# Data Quality

Checking the quality of the data used in your analysis is a key step, and often done during an EDA process. But, it can easily be overlooked when moving to production, and even sometimes overlooked by the data engineering team. If there is any doubt that the data your application will be ingesting may have quality control (QC) issues, you should be empowered to set up your own QC process yourself.

It is amazing how quickly and easily data can go from being good, high-quality to being junk. A seemingly innocuous change in the schema of one single upstream table can ruin all downstream tables and applications. It is important to remember that, especially in large organizations, there may be multiple teams with ownership of multiple sources of data, and not all of these teams will have disciplined and knowledgeable engineers working on them. 

The most important data sources to test for quality will typically be the raw sources of data as it comes in, but you can test whichever data sources you have access to since you may not have certain privileges. You may even feel inclined to create tests further downstream to ensure that whatever preprocessing, feature generation, model scoring, and postprocessing code is written is resulting in *expected* data types, shapes, sizes, values and distributions. 

## Great Expectations

[Great Expectations](https://greatexpectations.io/) is a popular, free, and mature data quality tool which allows you to easiy write tests, or *expectations*, using python. From the website, you can use it to validate, document and profile your data. Great Expecations does NOT do data versioning or orchestrate data pipelines.

The high-level steps for using Great Expectations are: (a) install Great Expectations; (b) create and configure a Data Context; (c) create your expectations (tests); and (d) validate your data. However, let's start simpler and use the great expectations python library.

Let's install Great Expectations, within our virtual environment, with 

`pip install great_expectations`  

and add it to our requirements.txt file. From here, check out the help files:

`great_expectations --help`  

You'll see that, thankfully, there are only a handful of commands, though each command has many optional arguments, but we' won't worry about these right now.

In [1]:
import great_expectations as ge
import pandas as pd

# First, I'm going to add a header to my data and resave it. It is annoying not having a header.
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'y']
path = "../data/"
train = pd.read_csv(f"{path}adult.data", names = col_names)
test = pd.read_csv(f"{path}adult.test", names = col_names)

train.to_csv(f"{path}adult_train.csv", index = False)
test.to_csv(f"{path}adult_test.csv", index = False)

# Now reload the new data and make sure it looks right
train = pd.read_csv(f"{path}adult_train.csv")
test = pd.read_csv(f"{path}adult_test.csv")

# Here we are creating two datasets that we can use with Great Expectations
train_df = ge.dataset.PandasDataset(train)
test_df = ge.dataset.PandasDataset(test)

train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,y
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Table-Level Expectations

First, let's think of what we expect our entire table to look like. 

What columns should be there? 
Do any columns form a unique identifier?  

In [2]:
# columns
train_df.expect_table_columns_to_match_ordered_list(
    column_list=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'y']
)

{
  "success": true,
  "result": {
    "observed_value": [
      "age",
      "workclass",
      "fnlwgt",
      "education",
      "education-num",
      "marital-status",
      "occupation",
      "relationship",
      "race",
      "sex",
      "capital-gain",
      "capital-loss",
      "hours-per-week",
      "native-country",
      "y"
    ]
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [3]:
# Change column list to make success = false
test_df.expect_table_columns_to_match_ordered_list(
    column_list=['Rage', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'y']
)

{
  "success": false,
  "result": {
    "observed_value": [
      "age",
      "workclass",
      "fnlwgt",
      "education",
      "education-num",
      "marital-status",
      "occupation",
      "relationship",
      "race",
      "sex",
      "capital-gain",
      "capital-loss",
      "hours-per-week",
      "native-country",
      "y"
    ],
    "details": {
      "mismatched": [
        {
          "Expected Column Position": 0,
          "Expected": "Rage",
          "Found": "age"
        }
      ]
    }
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### Column-Level Expectations

Second, let's think about our individual columns. 

- What data types should they be?   
- Should there be any missing values?  
- What are the minimum and maximum values?  
- etc.

There are many more expectations which you can find [here](https://greatexpectations.io/expectations), and you can customize your own expectations.

Go left to right and define expectations for each column. We'll just do our first two columns.

In [4]:
# age
## should be integer
train_df.expect_column_values_to_be_of_type(column="age", type_="int")

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [5]:
test_df.expect_column_values_to_be_of_type(column="age", type_="int")

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [6]:
## Should have no missing values
train_df.expect_column_values_to_not_be_null(column="age")

{
  "success": true,
  "result": {
    "element_count": 32561,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [7]:
test_df.expect_column_values_to_not_be_null(column="age")

{
  "success": true,
  "result": {
    "element_count": 16281,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [8]:
# workclass
## should be string
train_df.expect_column_values_to_be_of_type(column="workclass", type_="str")

{
  "success": true,
  "result": {
    "element_count": 32561,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [9]:
## should have 9 unique values
train_df.expect_column_unique_value_count_to_be_between(column="workclass", min_value=9, max_value=9)

{
  "success": true,
  "result": {
    "observed_value": 9,
    "element_count": 32561,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [15]:
## should have 9 unique values
test_df.expect_column_unique_value_count_to_be_between(column="workclass", min_value=9, max_value=9)

{
  "success": true,
  "result": {
    "observed_value": 9,
    "element_count": 16281,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### Expectation Suite

After defining the expectations we can collect them all into a suite and then use the validate method to run them all and show us any expectations that fail.

In [20]:
# Expectation suite
expectation_suite = train_df.get_expectation_suite(discard_failed_expectations=False)
print(train_df.validate(expectation_suite=expectation_suite, only_return_failures=True))

{
  "success": true,
  "results": [],
  "evaluation_parameters": {},
  "statistics": {
    "evaluated_expectations": 5,
    "successful_expectations": 5,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "meta": {
    "great_expectations_version": "0.18.19",
    "expectation_suite_name": "default",
    "run_id": {
      "run_name": null,
      "run_time": "2024-09-05T14:22:04.162938-07:00"
    },
    "batch_kwargs": {
      "ge_batch_id": "0f8093be-6bc6-11ef-ac4d-ae0cde32659d"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20240905T212204.162888Z",
    "expectation_suite_meta": {
      "great_expectations_version": "0.18.19"
    }
  }
}


In [21]:
expectation_suite

{
  "expectation_suite_name": "default",
  "ge_cloud_id": null,
  "expectations": [
    {
      "expectation_type": "expect_table_columns_to_match_ordered_list",
      "kwargs": {
        "column_list": [
          "age",
          "workclass",
          "fnlwgt",
          "education",
          "education-num",
          "marital-status",
          "occupation",
          "relationship",
          "race",
          "sex",
          "capital-gain",
          "capital-loss",
          "hours-per-week",
          "native-country",
          "y"
        ]
      },
      "meta": {}
    },
    {
      "expectation_type": "expect_column_values_to_be_of_type",
      "kwargs": {
        "column": "age",
        "type_": "int"
      },
      "meta": {}
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "age"
      },
      "meta": {}
    },
    {
      "expectation_type": "expect_column_values_to_be_of_type",
      "kwargs": {
    

### Using Great Expectations Within a Project

Next, let's use Great Expectations using the CLI. This way, we can set up our testing suites and document them along with our results. First, we need to initialize and configure our Data Context using 

`great_expectations init`

The context is how we will then interact with great expectations, and you will see it appear again later. Running this command will create a great_expectations.yml file which will be our main configuration file. We can add this to our git repo, and ignore the rest. Notice there is a new great_expectations folder, and inside that folder is another folder called uncommitted. The 'uncommitted' folder is meant to be ignored by git, and you'll notice has already been added to a .gitignore file in the great_expectations folder.

There is some valuable information in the great_expectations.yml file we should take a look at before moving on. 

#### Create a New Data Source

Next, we should tell Great Expectations about our data sources. We can use data from our local filesystem, or data stored in a relational database. We also must tell Great Expectations how we are processing our data, either using pandas or spark. Let's set up our data sources by running:

`great_expectations datasource new`

In the terminal, we will be asked:  
```
What data would you like Great Expectations to connect to?  
    1. Files on a filesystem (for processing with Pandas or Spark)  
    2. Relational database (SQL)  
```

Select 1 for filesystem. Then we will be asked:  

```
What are you processing your files with?  
1. Pandas  
2. PySpark  
```

Select 1 for Pandas. Lastly, we will be asked:  

```
Enter the path of the root directory where the data files are stored: 
```

We should enter the path as `data`. A Jupyter notebook will open up automatically which will allow us to configure our data source. We might only need to make one simple change to the notebook, to the `datasource_name` field (we can rename it to whatever makes the most sense). From here, we run each cell in the notebook, and then we can look in the great_expectations.yml file to see that our new data source was added to it.

#### Create a New Suite

Next, let's create a suite of expectations by running:

`great_expectations suite new`

In the terminal we will be asked: 

```
How would you like to create your Expectation Suite?  
    1. Manually, without interacting with a sample batch of data (default)  
    2. Interactively, with a sample batch of data  
    3. Automatically, using a profiler  
```

Select 2 so that a sample of data will be used to test our expectations. Then we will be asked:

```
Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?  
```

We will select whichever dataset we want to create a suite for. Let's choose adult_train.csv. And lastly, we will be asked to name the suite. After naming the suite, a notebook will open and we can write our expectations directly in the notebook, and run the cells to validate the expectations. 

Notice that the suite will be saved in the expectations folder as a json file. If we ever need to check which suites we've created, or edit any of our suites, we can simply run 

`great_expectations suite list`  
`great_expectations suite edit suite_name`  

One of the nice things about Great Expectations is that it makes it easy to document our expectations and results. We can build and launch our docs in a browser by running:

`great_expectations docs build`

#### Using the Profiler

Rather than manually create all of our expectations ourselves, we can use a User Configerable Profiler to do some of the work for us. A profiler will create a set of very *strict* expectations, expectations that *overfit* our data and go beyond what we would probably want. After the expectation suite is created using the profiler, we can review the expectations and make any adjustments, additions or deletions.

We can use the profiler in one of two ways: (1) the CLI; or (2) a .py file. Let's use the CLI by running 

`great_expectations suite new --profile`

We'll be asked to select a data asset for the suite and then name the suite. Let's pick adult_train.csv and name the suite "adult_train_data". Great Expectations will launch a notebook that we can then edit. First we should comment out any columns that we don't want **excluded** from the profiler. After that we can run each code cell in the notebook in order to create the expectation suite. 

After using the profiler, and reviewing the expectations that were created, we should run `great_expectations edit suite adult_train_data` to edit the expectations.

#### Checkpoints

Checkpoints are used to validate data. Suppose you want to run a checkpoint every time a data pipeline job runs, you can run a checkpoint. We can see what checkpoints exist by running 

`great_expectations checkpoint list`

and create a new one by running 

`great_expectations checkpoint new {NAME}`

This will open up a notebook that you can then edit to configure a checkpoint. Be sure to check that the yaml for the checkpoint contains the expectation suite that you want to use for the checkpoint and the correct data set.

After creating a checkpoint you can run it in the command line:

`great_expectations checkpoint run {NAME}`  

Or you can run a checkpoint within a python script like below:

In [None]:
context = ge.get_context()
checkpoint = context.get_checkpoint("data_ingestion")
checkpoint.run()

Once a checkpoint is run, it would make sense to take some action based on the results, such as sending an email or a Slack notification if there are failures (or even if they all run successfully). This can, theoretically, be done within Great Expectations itself by defining a new `action` within the yaml file for the checkpoint that was created above (which you can find in the `checkpoints/` folder). There is an action called EmailAction for sending emails, or SlackNotificationAction for sending Slack notifications. Each action requires some configurations, and possible changes to your `config_variables.yml` file found under the `uncommited/` folder.


### Great Expectations Summary

There is a lot that Great Expectations can do, and we've only scratched the surface, but hopefully you've noticed how *simple* it can be to create, validate, and document tests for your data without having to write your own unit tests and corresponding application. We have not yet shown how to include Great Expectations in your data pipelines - there are integrations with several popular orchestration engines such as Airflow and Prefect. We'll cover these in a future session.

Most of what we've done has been from the CLI which launches notebooks that we have to run and edit - it's all very interactive (and admittedly feels a little strange). There are more pythonic ways of using Great Expectations by simply using the library in some python scripts. 

### Other Tools 

Admittedly, Great Expectations is a heavy tool. It takes a lot of setup to do things. But, notice what was done in each step:

- Define the data sources  
- Create the table-level and column-level expectations (tests)  
- Combine expectations for a data source into a suite  
- Create a checkpoint, with actions  
- Visualize results in a report or dashboard  

Each of these steps can also be accomplished with other tools. We can create unit tests, using `pytest`, and then using a scheduler we can schedule those tests to run every time we ingest new data, or on a schedule, and output the results to a log file, and email the results or send a Slack notification.

# Data Quality Lab

## Overview 

In this lab we will learn about running tests for data quality for our model data.

## Goal

The goal of this lab is to set up *unit tests* for your project data.

## Instructions

### Great Expectations 

There are a few tools out there for data quality, but we will be using Great Expectations since it is one of the most popular. Deepchecks is another newer tool that checks more than just the data, it also checks model outputs.

Remember the high-level steps for using Great Expectations are: (a) install Great Expectations; (b) create and configure a Data Context; (c) create your expectations (tests); and (d) validate your data.

Before beginning, be aware of the data set that you will be checking for quality. In a real job scenario, you should absolutely have checks for every column, but since I am not paying you, do not spend hours and hours doing this for this assignment. If your data has more than ten columns, then you can choose the ten **most important** features and create expectations just for those five, and then use a profiler for the rest. 

#### Install Great Expectations, Create and Configure a Data Context

1. Install Great Expectations (in your virtual environment): `pip install great_expectations`. You can also add it to your requirements.txt file. I'm using version 0.18.19.   
2. Create a Data Context by initializing Great Expectations: `great_expectations init`  
3. Add everything to git. The `uncommited/` folder should automatically not be added.  
4. Look over the great_expectations.yml file.  
5. Create a data source: `great_expectations datasource new`  

#### Create Expectations

1. Create a new suite of expectations for your data sets: `great_expectations suite new`.    
    - If you have more than ten columns, create expectations only for the **top ten** columns.  

#### Validate Data

1. Run the cells in the notebook that contains your suite of expectations.  
    - Fix any data issues that are revealed by your suite.    
2. Take a look at the docs: `great_expectations docs build`.  

#### Profile Data

1. Create a new suite of expectations for your data sets using a User Configurable Profiler.  
    - Edit this suite down to only the essential expectations needed for you data. 
    - **Do not** simply use the expectations that are given. Most of these will not make sense and are just a starting point. 
2. You should now have two suites of expectations for your data. At this point you can combine expectations into one suite for each set of data.  

#### Create a Checkpoint  

1. Create a new checkpoint. This is essentially what you can run every time you need to check the data, or run it on a schedule.  
2. Run the pipeline.  
3. Build the docs.  

### Turning It In

Make sure you push to Github and submit your Github URL in Canvas. Your repo should contain the yaml files for your initialized Great Expectations data context and checkpoint and the json files for your expectation suite. Also, please zip up your data docs folder and upload this to Canvas. 

### Grade

This lab is worth 10 points. Each step above should be completed. There should be **meaningful** tests for your project data table and columns, and the checkpoint should have been run a few times and should appear in the data docs.