## Data validation:
This notebook contains the code for performing data validation on the Mercari Dataset. The ideal choice for performing data validation in python is using `tensorflow data validation`. Since, it is not compatible with our python environment, we can use the following packages as alternatives:
- great expectations
- pandas-profiling
- pydantic

In [18]:
# import essentials
import numpy
import pandas as pd
import pandera as pa
import matplotlib.pyplot as plt
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint
%matplotlib inline

In [2]:
cd ..

/Users/mehuljain/Documents/course_related/ML_Ops/project/Price_Alchemy


## Great expectations:

In [3]:
CTX_ROOT='./data/great_expectations'
CSV_PATH='./data/Mercari Price Suggestion Challenge/train.csv'

Create data context

In [4]:
context=gx.get_context(context_root_dir=CTX_ROOT)

In [5]:
# add data sources
DS_NAME='train_file'
datasource=context.sources.add_pandas(DS_NAME)

# adding csv asset
asset_name='mercari_training'
asset= datasource.add_csv_asset(asset_name, filepath_or_buffer=CSV_PATH)

# build batch request
batch_request= asset.build_batch_request()

Add expectation suite

In [6]:
context.add_or_update_expectation_suite('mercari_first_expectation_suite')

{
  "expectation_suite_name": "mercari_first_expectation_suite",
  "ge_cloud_id": null,
  "expectations": [],
  "data_asset_type": null,
  "meta": {
    "great_expectations_version": "0.18.10"
  }
}

In [7]:
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name='mercari_first_expectation_suite',
)

What does the data look like?

In [8]:
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


Let's create some expectations

In [9]:
# column should exist
validator.expect_column_to_exist("name")
validator.expect_column_to_exist("price")
validator.expect_column_to_exist("item_condition_id")
validator.expect_column_to_exist("category_name")
validator.expect_column_to_exist("brand_name")
validator.expect_column_to_exist("shipping")
validator.expect_column_to_exist("item_description")

# not null expectations
validator.expect_column_values_to_not_be_null("name")
validator.expect_column_values_to_not_be_null("price")
validator.expect_column_values_to_not_be_null("item_condition_id")
validator.expect_column_values_to_not_be_null("shipping")
validator.expect_column_values_to_not_be_null("item_description", mostly=0.95)
validator.expect_column_values_to_not_be_null("category_name", mostly=.95)
validator.expect_column_values_to_not_be_null("brand_name", mostly=0.5)

# value expectations
# validator.expect_column_values_to_be_between(
#     "price", min_value=0, max_value=2000)

validator.expect_column_max_to_be_between(
    "price", min_value=1000, max_value=2500)

validator.expect_column_distinct_values_to_be_in_set(
        "shipping",
        [0,1])

validator.expect_column_distinct_values_to_be_in_set(
        "item_condition_id",
        [1,2,3,4,5])

#  distribution expectations
validator.expect_column_stdev_to_be_between(
'price', min_value=30, max_value=50)

validator.expect_column_mean_to_be_between(
'price', min_value=20, max_value=30)

validator.expect_column_value_z_scores_to_be_less_than(
'price', threshold=3, mostly=.9, double_sided=False)

# regex expectations
# should not be urls
validator.expect_column_values_to_not_match_regex(
'name', regex='https?:\/\/.*[\r\n]*')

validator.expect_column_values_to_not_match_regex(
'brand_name', regex='https?:\/\/.*[\r\n]*')

validator.expect_column_values_to_not_match_regex(
'category_name', regex='https?:\/\/.*[\r\n]*')

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/11 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1482535,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 6327,
    "missing_percent": 0.42676901388500105,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Let's run the data assistant 

In [10]:
# exclude_cols_names=['train_id']

In [11]:
# data_assistant_result = context.assistants.missingness.run(
#     validator=validator,
#     exclude_column_names=exclude_cols_names,
# )

Save expectations

In [12]:
# validator.expectation_suite= data_assistant_result.get_expectation_suite(
#     expectation_suite_name='mercari_first_expectation_suite'
# )
validator.save_expectation_suite(discard_failed_expectations=False)

In [13]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_v4",
    validator=validator,
)

Run expectations

In [14]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/64 [00:00<?, ?it/s]

In [15]:
# assert checkpoint_result["success"] is True

View results 

In [16]:
# data_assistant_result.show_expectations_by_expectation_type()

Let's view the data doc

In [17]:
context.view_validation_result(checkpoint_result)

## Pandera: