## What You will Learn

#### The goal of this tutorial is to introduce [Great Expectations](https://greatexpectations.io/) a python package to help ensure data quality in data science and analytics workflows. You will learn in this tutorial

1.	Why data quality is imperative for data science.
2.	What expectations are, and how you can test expectations 
3.	How to set up a great expectations suite in a production environment


## 1.Importance of Data Quality

According to a [recent survey](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=77d4a0526f63), data scientists spend around 60 percent of their time cleaning and massaging data. And this aspect of the work is often considered not very desirable.  

Data cleaning is not an aspect focused on much in coursework. In most assignments and projects, we receive already cleaned data in CSV files or access to a database and continue to build models and conduct analysis without paying too much attention to data quality issues.

The data we encounter in most real-world data science projects have issues with **accuracy, completeness, uniqueness, consistency, and timeliness**. Furthermore, we may have to work with dynamic data that may get updated daily, weekly, or monthly. Consider building a machine learning model and an analytics dashboard with [311 Data from the NYC open data portal](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). This data gets updated daily, and we have to ensure that the columns in each new update are consistent with what we already have.

Most importantly, data quality can adversely impact model quality. For instance, when you are pre-processing a categorical variable to be entered into a model, an inconsistency between ***‘NYPD’ and ‘NyPD’*** would lead these two values to be considered as two distinct categories. But beyond this simple example, data quality issues can adversely impact model performance, and it may be often difficult to detect these issues during the modeling stage. 


## Set Up
in order to complete this section you need to install the **great_expectations** and **pandas** packages. Once you have succesfully installed them either using pip or anaconda, you can can import these packages using the following command

In [1]:
import great_expectations as ge
import pandas as pd

##  2.1 Quick Intro to Setting Expectations

You will be working with a slightly modified version of the Titanic Dataset for this section.

Let's imagine that the Titanic was able to divert its course and dodge the iceberg that stuck it in 1912 and completed its maiden voyage. Let's assume that the RMS titanic took many voyages between Europe and the UK, and a team of data engineers and data scientists with Titanics marketing division are collecting passenger data to determine which passengers it should provide premium discounts for an all-inclusive first-class pass. And during each voyage, they receive a batch of data. 

Let's Load the titanic dataset from the github link with the `read_csv` function in great expectations

In [7]:
titanic_df = ge.read_csv("https://raw.githubusercontent.com/laknath123/Practical-Data-Science-15-688-Tutorial/main/titanic.csv")

Lets Explore the first few lines of this dataset

In [8]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In This tutorial, we hope to make some modifications to the original dataset.
-  First we we will remove the `Survived` Column
-  Next we will split this dataset into two batches
-  Create A Data Anamoloy by Adding the Value Z to the `Embarked` column

In [9]:
titanic_df = titanic_df.loc[:,titanic_df.columns !='Survived']
df_batch1= titanic_df[0:400]
df_batch2= titanic_df[400:]
df_batch2.loc[400,'Embarked']='Z'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [13]:
df_batch1.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

The data team knows that passengers get on board the ship from one of 3 ports. Cherbourg in France, Queenstown in Ireland, and Southhampton in the UK. If you look at the Embarked column, these ports are categorized as Cherbourg- `C`, Queenstown- `Q`, Southhampton-`S`

We can use the great expectations package to create an expectation that the Embarked column should only contain these three values

## 2.2 Setting Categorical Expectations

In [15]:
df_batch1.expect_column_values_to_be_in_set('Embarked',['S','C','Q'])

{
  "result": {
    "element_count": 400,
    "missing_count": 1,
    "missing_percent": 0.25,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

When you run an expectation on a dataset it returns the result that you see above. The most important item to look at is the **success** parameter. Here it is evident that `"success": true` meaning that the `Embarked` column in this batch of data has met our expectation

Anytime you create an expectation,that expectation is stored as a configuration, and we can look at the config object it creates using the `get_expectation_suite()` command

In [16]:
df_batch1.get_expectation_suite()

{
  "ge_cloud_id": null,
  "expectations": [
    {
      "kwargs": {
        "column": "Embarked",
        "value_set": [
          "S",
          "C",
          "Q"
        ]
      },
      "meta": {},
      "expectation_type": "expect_column_values_to_be_in_set"
    }
  ],
  "meta": {
    "great_expectations_version": "0.14.10"
  },
  "expectation_suite_name": "default",
  "data_asset_type": "Dataset"
}

Since we hope to use this config to validate the data we recieve in the future, Let's save the previous expectation we created as `titanic_config` in our workspace using the following command

In [17]:
titanic_config = df_batch1.get_expectation_suite()

Once the Data Team receives another batch of passenger data from a voyage, they can use the config they had created to validate the new data.
You have to load the new batch of data and then run the validate command and pass the titanic config you created earlier as a parameter.

Load the second batch of data

In [18]:
batch2 = df_batch2

you can now validate this second batch of data, and also pass in a parameter called `only_return_failures=True` to specifically show the validation rules that have failed

In [19]:
batch2.validate(expectation_suite= titanic_config,only_return_failures=True)

{
  "statistics": {
    "evaluated_expectations": 1,
    "successful_expectations": 0,
    "unsuccessful_expectations": 1,
    "success_percent": 0.0
  },
  "success": false,
  "evaluation_parameters": {},
  "meta": {
    "great_expectations_version": "0.14.10",
    "expectation_suite_name": "default",
    "run_id": {
      "run_time": "2022-04-05T00:05:07.595107+00:00",
      "run_name": null
    },
    "batch_kwargs": {
      "ge_batch_id": "737d0114-b473-11ec-8fc7-9e305b1c753f"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20220405T000507.594107Z",
    "expectation_suite_meta": {
      "great_expectations_version": "0.14.10"
    }
  },
  "results": [
    {
      "result": {
        "element_count": 491,
        "missing_count": 1,
        "missing_percent": 0.20366598778004072,
        "unexpected_count": 1,
        "unexpected_percent": 0.20408163265306123,
        "unexpected_percent_total": 0.20366598778004072,
        "unexpected_percent_non

If we look at the results key in the above output, It is evident now that we have failed our expectation since `"success": false`. 
It is also clear from this output that this expectation failed, because there was an unexpected value `Z` in the `Embarked` column

## 2.3 Setting Statistical Expectations 

The great expectation package also provides a host of ways to build expectations related to statistics from our dataset. 

Let's look at the average fare that that we charged passengers during the first voyage

In [20]:
df_batch1['Fare'].mean()

33.24182199999998

Certain industries and organizations have [Average Cost Pricing Rules](https://www.investopedia.com/terms/a/average_cost_pricing_rule.asp) or internal policies that require them to set prices within a certain range. Let's assume that the Titanic's pricing team has determined that they want the Average Fare to between to between \\$25.00 and \\$35.50.

We can set an expectation to test this in our dataset using the `expect_column_mean_to_be_between` method and run this expectation on our second batch of data

In [21]:
df_batch2.expect_column_mean_to_be_between('Fare',25.00,35.00)

{
  "result": {
    "observed_value": 31.35890122199591,
    "element_count": 491,
    "missing_count": null,
    "missing_percent": null
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

When we run this expectation on our second batch of data, it is evident that the average fare falls within the expected range

## 2.4 Explore The World of Expectations

We just scratched the surface of what great expectations have to offer in the previous section. 
The great expectations library has a host of tests to check issues with your data. These test include ways to evaluate statistical expectations, expectations related to regular expressions, and ones to deal with geospatial such as setting latitude and longitude values to be within a certain range.

You can check out the entire list of expectations available using the this [link](https://greatexpectations.io/expectations)

Since Great Expectations is an open-source package, the community can contribute new expectations, and according to the website the goal of this package is create a **SHARED, OPEN STANDARD OF DATA QUALITY**

### 3.0 Setting up a Great Expectation suite in a production environment

The data science workflow in a typical organization involves obtaining data from many sources. For instance, the data may come from enterprise resource planning systems, customer relationship management systems, or from sensors.

In an analytics workflow that spans multiple teams, data engineering teams create pipelines to move data and may create databases and data warehouses to store data. These pipelines would then be used by data analysts/ business intelligence developers to create dashboards, and data scientists to build and deploy machine learning models.
 
Great Expectations seems to be a popular technology  currently being used by data engineers and data science teams to run data quality tests before  ingesting data into databases and models

- [How Great Expectations is used by the Global Analytics Team at Heineken](https://greatexpectations.io/case-studies/heineken-case-study/)

 
In this section, I guide you through how great_expectations may be deployed in a typical data science workflow where you have to deal with dynamic data inflows as opposed to dealing with a CSV file at a time.
 
Again, a real-world workflow may be a lot more involved than what I present here, but I hope that this gives you a general idea
 

## Set up
1.	Download the zipfile or obtain it from github, and save it somewhere in your machine.
2.	Open the anaconda prompt or terminal and use `cd` to go to the directory where the file is saved
3.	once you have cd’d into that folder. Create a virtual environment with the following command `python -m venv venv`
4.	Activate this virtual environment by running `venv\Scripts\activate.bat`
5.	Once you have activated the virtual environment run. 
    
    -`pip install great_expectations`  
    -`pip install sqlalchemy`


## 3.1 Launch and initialize Great Expectations
 
Run the following command

`Great_expectations init`

The command above initializes Great Expectations in your folder, and if you look at the folder, you should be able to see that you have a new folder named great_expectations, and it has the structure described in the following image

In [None]:
df1 = ge.read_csv("titanicbatch1.csv")
df1 = df1.loc[:,df1.columns !='Survived']

In [None]:
df_batch1= df[0:400]
df_batch2= df[400:]

In [None]:
df_batch2.loc[400,'Embarked']='Z'

In [None]:
df_batch1.to_csv('titanicbatch1.csv')
df_batch2.to_csv('titanicbatch2.csv')