## Overview
### Data
I am using a Kaggle dataset from the ([link](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations)) it contains:

- articles.csv: which contains information about fashion items.
- customers.csv: which contains information about users.
- transactions_train.csv: which contains information about transactions.

### Preprocessing
1. Removing nulls
2. Converting variables to there appropriate types 
3. Grouping ages
4. Imputing nulls (if possible)

In [3]:
import pandas as pd
import numpy as np

import great_expectations as gx
from great_expectations import ExpectationSuite
import preprocessing

## Articles

In [4]:
articles_data_df = pd.read_csv("data/articles.csv", encoding="utf-8")
articles_data_df.head(2)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


In [5]:
articles_data_df.shape

(105542, 25)

In [6]:
articles_df = preprocessing.preprocess_articles(articles_data_df)

In [7]:
print(articles_df.shape)
articles_df.head(2)

(105542, 24)


Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,1676,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,1676,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic


## Customers

In [8]:
customers_data_df = pd.read_csv("data/customers.csv", encoding="utf-8")
customers_data_df.head(2)

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...


In [9]:
customers_data_df.shape

(1371980, 7)

In [10]:
customers_df = preprocessing.preprocess_customers(customers_data_df)

In [11]:
print(customers_df.shape)
customers_df.head(2)

(1356119, 5)


Unnamed: 0,customer_id,club_member_status,age,postal_code,age_group
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,ACTIVE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...,46-55
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,ACTIVE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...,19-25


## Transactions

In [12]:
transaction_data_df = pd.read_csv("data/transactions_train.csv", encoding="utf-8")
transaction_data_df.head(2)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2


In [13]:
transaction_data_df.shape

(31788324, 5)

In [14]:
%load_ext autoreload
%autoreload 2
import preprocessing


In [15]:
transaction_df = preprocessing.preprocess_transactions(transaction_data_df)

In [16]:
print(transaction_df.shape)
transaction_df.head(2)

(31788324, 11)


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,year,month,day,day_of_week,month_sin,month_cos
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,2018,9,20,3,-0.866025,-0.5
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,2018,9,20,3,-0.866025,-0.5


## Great Expectations

It provides a batteries-included solution for testing and documenting the data. To achieve this, it create expectation suites. You can think of them as unit tests, but for data. 

- https://docs.greatexpectations.io/docs/core/define_expectations/create_an_expectation
- https://colab.research.google.com/github/datarootsio/tutorial-great-expectations/blob/main/tutorial_great_expectations.ipynb#scrollTo=f1lbHTKIZ5H4

In [169]:
context = gx.get_context(mode="file")

In [170]:
articles_asset = context.data_sources.add_pandas("articles_source").add_dataframe_asset(name="articles_df")
articles_suite = context.suites.add(gx.ExpectationSuite(name="articles_suite"))

In [171]:
articles_batch_def = articles_asset.add_batch_definition_whole_dataframe("articles_batch")
articles_batch = articles_batch_def.get_batch(batch_parameters={"dataframe": articles_df})

In [172]:
for column in ["article_id", "product_code"]:
    articles_suite.add_expectation(gx.expectations.ExpectColumnValuesToBeNull(column=column, mostly=0.0))

In [173]:
# For other dataframe don't need to define the context again
customers_source = context.data_sources.add_pandas("customers_source")  # Define the source of the data in our case pd.DataFrame
customers_asset = customers_source.add_dataframe_asset(name="customers_df")  # Define the asset in our case customer_df

In [174]:
# How to retrieve the data. For pd.DataFrame it will always retrieve all, even in when I define batch
customers_batch_def = customers_asset.add_batch_definition_whole_dataframe("customers_batch")
customers_batch = customers_batch_def.get_batch(batch_parameters={"dataframe": customers_df})

In [175]:
# Define the suite of verification
customers_suite = context.suites.add(gx.ExpectationSuite(name="customers_suite"))

In [176]:
# Age should be between 1 and 101
customers_suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(column="age", min_value=1, max_value=101))

ExpectColumnValuesToBeBetween(id='0f0666cb-7e63-460d-98fa-af2813775abf', meta=None, notes=None, result_format=<ResultFormat.BASIC: 'BASIC'>, description=None, catch_exceptions=True, rendered_content=None, windows=None, batch_id=None, column='age', mostly=1, row_condition=None, condition_parser=None, min_value=1.0, max_value=101.0, strict_min=False, strict_max=False)

In [177]:
# No columns should contains Nulls
for column in customers_df.columns:
    customers_suite.add_expectation(gx.expectations.ExpectColumnValuesToBeNull(column=column, mostly=0.0))

In [178]:
# For other dataframe don't need to define the context again
transactions_source = context.data_sources.add_pandas("transactions_source")
transactions_asset = transactions_source.add_dataframe_asset(name="transactions_df")
transactions_suite = gx.ExpectationSuite(name="transactions_suite")
transactions_suite = context.suites.add(transactions_suite)

In [179]:
transactions_batch_def = transactions_asset.add_batch_definition_whole_dataframe("transactions_batch")
transactions_batch = transactions_batch_def.get_batch(batch_parameters={"dataframe": transaction_df})

In [180]:
transactions_suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(column="price", min_value=0, max_value=None))

ExpectColumnValuesToBeBetween(id='6640956b-672d-4040-b6ef-2387730d2d14', meta=None, notes=None, result_format=<ResultFormat.BASIC: 'BASIC'>, description=None, catch_exceptions=True, rendered_content=None, windows=None, batch_id=None, column='price', mostly=1, row_condition=None, condition_parser=None, min_value=0.0, max_value=None, strict_min=False, strict_max=False)

In [181]:
# customer id cannot be null
transactions_suite.add_expectation(gx.expectations.ExpectColumnValuesToBeNull(column="customer_id", mostly=0.0))

ExpectColumnValuesToBeNull(id='dcb77ea7-9167-4b90-bcb5-36104c3ae815', meta=None, notes=None, result_format=<ResultFormat.BASIC: 'BASIC'>, description=None, catch_exceptions=True, rendered_content=None, windows=None, batch_id=None, column='customer_id', mostly=0.0, row_condition=None, condition_parser=None)

In [182]:
context.variables.save()

In [183]:
batch_definition = articles_asset.get_batch_definition("articles_batch")
validation_definition = gx.ValidationDefinition(data=batch_definition, suite=articles_suite, name="articles_validation")
validation_result_articles = validation_definition.run(batch_parameters={"dataframe": articles_df})
print(validation_result_articles.success)

Calculating Metrics: 100%|██████████| 13/13 [00:00<00:00, 59.89it/s] 


True


In [184]:
batch_definition = customers_asset.get_batch_definition("customers_batch")
validation_definition = gx.ValidationDefinition(data=batch_definition, suite=customers_suite, name="customer_validation")
validation_result_customers = validation_definition.run(batch_parameters={"dataframe": customers_df})
print(validation_result_customers.success)  # This tells you if the data passes the expectations

Calculating Metrics: 100%|██████████| 38/38 [00:03<00:00, 11.81it/s] 


True


In [185]:
batch_definition = transactions_asset.get_batch_definition("transactions_batch")
validation_definition = gx.ValidationDefinition(data=batch_definition, suite=transactions_suite, name="transactions_validation")
validation_result_transactions = validation_definition.run(batch_parameters={"dataframe": transaction_df})
print(validation_result_transactions.success)

Calculating Metrics: 100%|██████████| 18/18 [00:34<00:00,  1.92s/it] 


True
