## Data validation:
This notebook contains the code for performing data validation on the Mercari Dataset. The ideal choice for performing data validation in python is using `tensorflow data validation`. Since, it is not compatible with our python environment, we can use the following packages as alternatives:
- great expectations
- pandas-profiling
- pydantic

In [1]:
# import essentials
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint
%matplotlib inline

In [2]:
cd ..

/Users/mehuljain/Documents/course_related/ML_Ops/project/Price_Alchemy


## Great expectations:

In [3]:
CTX_ROOT='./data/great_expectations'
CSV_PATH='./data/Mercari Price Suggestion Challenge/train.csv'

Create data context

In [4]:
context=gx.get_context(context_root_dir=CTX_ROOT)

In [5]:
# add data sources
DS_NAME='train_file'
datasource=context.sources.add_pandas(DS_NAME)

# adding csv asset
asset_name='mercari_training'
asset= datasource.add_csv_asset(asset_name, filepath_or_buffer=CSV_PATH)

# build batch request
batch_request= asset.build_batch_request()

Add expectation suite

In [6]:
context.add_or_update_expectation_suite('mercari_first_expectation_suite')

{
  "expectation_suite_name": "mercari_first_expectation_suite",
  "ge_cloud_id": null,
  "expectations": [],
  "data_asset_type": null,
  "meta": {
    "great_expectations_version": "0.18.10"
  }
}

In [7]:
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name='mercari_first_expectation_suite',
)

What does the data look like?

In [8]:
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


Let's create some expectations

In [9]:
# not null expectations
validator.expect_column_values_to_not_be_null("name")
validator.expect_column_values_to_not_be_null("price")

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1482535,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Let's run the data assistant 

In [10]:
exclude_cols_names=['train_id']

In [11]:
data_assistant_result = context.assistants.missingness.run(
    validator=validator,
    exclude_column_names=exclude_cols_names,
)




Generating Expectations:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/7 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Save expectations

In [12]:
validator.expectation_suite= data_assistant_result.get_expectation_suite(
    expectation_suite_name='mercari_first_expectation_suite'
)
validator.save_expectation_suite(discard_failed_expectations=False)

In [13]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_v1",
    validator=validator,
)

Run expectations

In [14]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/38 [00:00<?, ?it/s]

In [15]:
assert checkpoint_result["success"] is True

Let's plot and view the results

In [18]:
data_assistant_result.plot_metrics()

56 Metrics calculated, 14 Metric plots implemented
Use DataAssistantResult.metrics_by_domain to show all calculated Metrics


interactive(children=(Dropdown(description='Select Plot Type: ', layout=Layout(margin='0px', width='max-conten…



View results 

In [17]:
data_assistant_result.show_expectations_by_expectation_type()

[ { 'expect_column_values_to_not_be_null': { 'column': 'name',
                                             'domain': 'column',
                                             'mostly': 1.0}},
  { 'expect_column_values_to_not_be_null': { 'column': 'item_condition_id',
                                             'domain': 'column',
                                             'mostly': 1.0}},
  { 'expect_column_values_to_not_be_null': { 'column': 'category_name',
                                             'domain': 'column',
                                             'mostly': 0.99}},
  { 'expect_column_values_to_not_be_null': { 'column': 'brand_name',
                                             'domain': 'column',
                                             'mostly': 0.55}},
  { 'expect_column_values_to_not_be_null': { 'column': 'price',
                                             'domain': 'column',
                                             'mostly': 1.0}},
  { 'expect_column_