## Quick Intro to Setting Expectations

For the purpose of this excercise lets imagine that the Titanic was able to divert its course and dodge the iceberg that stuck it in 1912 and succesfully completed its maiden voyage. In fact, lets assume that the RMS titanic took many voyages between Europe and the UK, and that a team of data engineers and data scientists with Titanics marketing division have been collecting  passenger data to determine which passengers it should provide premium discounts for an all inclusive first-class pass.

In [33]:
import great_expectations as ge
import pandas as pd

  and should_run_async(code)


For each voyage, the data science team recieves a batch of data. Let's load the first batch of data using read_csv function available with great expectations

In [34]:
batch1 = ge.read_csv('titanicbatch1.csv')

In [35]:
batch1.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Now the data engineers know that passengers get on board the ship from one of 3 ports. Cherbourg in France, Queenstown in Ireland, and Southhampoton in the UK. If you look at the `Embarked` column, these ports are categorized as `Cherbourg-C`,`Queenstown-Q`,`SouthHampton-S`  

We can use the great expecations package to create an expectation to ensure that the Embarked column contains only these three  values

In [36]:
batch1.expect_column_values_to_be_in_set('Embarked',['S','C','Q'])

  and should_run_async(code)


{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 400,
    "missing_count": 1,
    "missing_percent": 0.25,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "success": true,
  "meta": {}
}

When you run an expectation on a dataset it returns the result that you see above. The most important item to look at is the "success" parameter. Here it is evident that `"success": true` meaning that the `Embarked` column has met our expectation

Anytime you create an expectation,that expectation is stored as a configuration, and we can look at the config object it creates using the `get_expectations_config()` command

In [37]:
batch1.get_expectations_config()

  and should_run_async(code)


{
  "expectations": [
    {
      "meta": {},
      "kwargs": {
        "column": "Embarked",
        "value_set": [
          "S",
          "C",
          "Q"
        ]
      },
      "expectation_type": "expect_column_values_to_be_in_set"
    }
  ],
  "expectation_suite_name": "default",
  "meta": {
    "great_expectations_version": "0.14.10"
  },
  "data_asset_type": "Dataset",
  "ge_cloud_id": null
}

Since we hope to use this config to validate the data we recieve in the future, Let's save the config we created as `titanic_config` in our workspace using the following command

In [38]:
titanic_config = batch1.get_expectations_config()

Once the Data Engineering Team recieves another batch of passenger data from the most recent voyage, they can use the config that they had created past to validate the new batch of data. First you have to load the new batch of data, and then run the validate command and pass the titanic config you created earlier as a parameter

In [46]:
batch2 = ge.read_csv('titanicbatch2.csv')

In [49]:
batch2.validate(expectation_suite= titanic_config,only_return_failures=True)

  and should_run_async(code)


{
  "success": false,
  "evaluation_parameters": {},
  "results": [
    {
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      },
      "result": {
        "element_count": 491,
        "missing_count": 1,
        "missing_percent": 0.20366598778004072,
        "unexpected_count": 1,
        "unexpected_percent": 0.20408163265306123,
        "unexpected_percent_total": 0.20366598778004072,
        "unexpected_percent_nonmissing": 0.20408163265306123,
        "partial_unexpected_list": [
          "Z"
        ]
      },
      "success": false,
      "expectation_config": {
        "meta": {},
        "kwargs": {
          "column": "Embarked",
          "value_set": [
            "S",
            "C",
            "Q"
          ]
        },
        "expectation_type": "expect_column_values_to_be_in_set"
      },
      "meta": {}
    }
  ],
  "meta": {
    "great_expectations_version": "0.14.10",
    "exp

In [39]:
df1 = ge.read_csv("titanicbatch1.csv")
df1 = df1.loc[:,df1.columns !='Survived']

In [40]:
df_batch1= df[0:400]
df_batch2= df[400:]

In [44]:
df_batch2.loc[400,'Embarked']='Z'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [45]:
df_batch1.to_csv('titanicbatch1.csv')
df_batch2.to_csv('titanicbatch2.csv')