In [2]:
from IPython.display import clear_output

!pip install deepchecks -U --user
!pip install pandas -U --user
!pip install polars -U --user

clear_output()

# Data Validation & Exploration


Today we'll dive into automated _Data Validation_ and _Data Exploration_.

Every day we work through a multitude of data using heurestics, statistics and many tools. But is there better tools out there? Is there a way to automate some of the process to put greater emphasis on the important things? 

## Data Validation Tools

There is a few tools.

1. [Deepchecks](https://github.com/deepchecks/deepchecks) _Tests for Continuous Validation of ML Models & Data_
2. [ydata-profiling](https://github.com/ydataai/ydata-profiling) (previously _pandas-profiling) _Create HTML profiling reports from pandas DataFrame objects_
3. [greatexpectations](https://github.com/great-expectations/great_expectations) _Always know what to expect from your data._
4. [pandera](https://pandera.readthedocs.io/en/stable/) _A Statistical Data Testing Toolkit_

We'll focus on a few discussion points today

- When does it make sense to introduce this type of tool?
- How do you use this type of tool today?
- How can it be improved?
- Can it be used as part of Data Analysis?
- Can it be used in any other part of the process?

## Introduction

As we all know to be true data is incredibly important when developing Machine Learning Applications.

> Shit in, shit out

First we'll make a quick introduction to each tool and their strengths.

Second I'll share a few use-case examples.

Finally we'll end up discussing how we can use, or use, these tools.

### Deepchecks

![[Deepchecks Checks](https://github.com/deepchecks/deepchecks)](https://github.com/deepchecks/deepchecks/raw/main/docs/source/_static/images/general/checks-and-conditions.png){#fig-deepchecks}

> Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort. This includes checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.

#### Data Formats

Deepchecks supports the following formats:

1. Tabular
2. Computer Vision
3. NLP (text)

### Example

![Video of a Deepcheck Evaluation Suite](https://github.com/deepchecks/deepchecks/raw/main/docs/source/_static/images/general/model_evaluation_suite.gif)

### Types of checks

The types of checks are divided into 3 variants,

![Deepchecks Types and where they run](https://github.com/deepchecks/deepchecks/raw/main/docs/source/_static/images/general/pipeline_when_to_validate.svg)

#### Running a Deepcheck

Either you run a full suite or a single feature. You choose!

In [None]:
from deepchecks.tabular.suites import model_evaluation
suite = model_evaluation()
result = suite.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model)
result.save_as_html() # replace this with result.show() or result.show_in_window() to see results inline or in window

In [None]:
from deepchecks.tabular.checks import FeatureDrift
import pandas as pd

train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')
# Initialize and run desired check
FeatureDrift().run(train_df, test_df)

### ydata-profiling

### Great Expectations

![great expectations](https://docs.greatexpectations.io/assets/images/gx_oss_process-050a4264f415a1bff3ceea3ac6f9b3a0.png)

> Great Expectations (GX) helps data teams build a shared understanding of their data through quality testing, documentation, and profiling.



In [None]:

expect_column_values_to_be_between(
    column="passenger_count",
    min_value=1,
    max_value=6
)

![automated data docs](https://docs.greatexpectations.io/assets/images/datadocs-8d8bc71d8aec770a38656ce60cc1e073.png)

Even has _Data Assistant_ to build automated checks based on Golden Dataset!

There's > 50 built-in expexctations and >300 including community added!

### pandera
1. Define a schema once and use it to validate different dataframe types.
2. Check the types and properties of columns/values.
3. Perform more complex statistical validation like hypothesis testing.
4. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
5. Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
6. Synthesize data from schema objects for property-based testing with pandas data structures.
7. Lazily Validate dataframes so that all validation rules are executed before raising an error.
8. Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.

Supported

| Tool | Data Stores (Pandas, Spark, DB, Other) | Validation | Profiling Data | Drift | Hypothesis | Data Generation | Data Types | Personal Favorite(s) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|deepchecks||||||||||
|ydata-profiling||||||||||
|greatexpectations||||||||||
|pandera||||||||||