# Creating Expectations on CSV Files

As your data products and models are developed, you can encode assumptions about input and output datasets as **expectations**.

Using that workflow provides the following benefits:

1. These are machine verifiable and can be used to monitor data flowing through your pipelines.
2. These eliminate poisonous implicit assumptions that cause data engineers re-work and waste time - "How do we define visits?"
3. These **will eventually** be easy to edit.
4. These **will eventually** be easy to reason about visually.

In [1]:
import json
import os

import great_expectations as ge
import pandas as pd

Unable to load spark context; install optional spark dependency for support.
Unable to load spark context; install optional spark dependency for support.


## Initialize a DataContext

A great expectations `DataContext` represents the collection of data asset specifications in this project.

You'll need:
- the directory where you ran `great_expectations init` (where the .great_expectations.yml file is).
- dbt profile and target information in the datasources section of your great_expectations configuration

In [2]:
#context = ge.data_context.DataContext('../../', expectation_explorer=True)
context = ge.data_context.DataContext('../../', expectation_explorer=False)

## Get a Dataset

Using the data context, provide the name of the datasource configured in your project config ("dbt" in this case), and the name of the dbt model to which to connect

In [3]:
df = context.get_data_asset("mycsvfile", data_asset_name="titanic_input_file", file_path="tutorial_data/Titanic.csv")

In [4]:
df.get_expectations_config()

	0 failing expectations
	0 result_format kwargs
	0 include_configs kwargs
	0 catch_exceptions kwargs
If you wish to change this behavior, please set discard_failed_expectations, discard_result_format_kwargs, discard_include_configs_kwargs, and discard_catch_exceptions_kwargs appropirately.


{'data_asset_name': 'titanic_input_file',
 'meta': {'great_expectations.__version__': '0.6.0__develop__sch_internal'},
 'expectations': [{'expectation_type': 'expect_column_values_to_be_in_set',
   'kwargs': {'column': 'Sex', 'value_set': ['female', 'male']}},
  {'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'Age', 'mostly': 0.5}},
  {'expectation_type': 'expect_column_values_to_be_between',
   'kwargs': {'column': 'Age', 'min_value': 0, 'max_value': 120}}],
 'data_asset_type': 'Dataset'}

## Declare Expectations

As you develop your code

state an assumption your code makes about its input data

Check on the available data sample if you can expect this assumption to be true

If the available data sample violates this assumption, decide how your code should deal with the violations

Update your assumption

In [5]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0
5,6,"Anderson, Mr Harry",1st,47.0,male,1,0
6,7,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1,1
7,8,"Andrews, Mr Thomas, jr",1st,39.0,male,0,0
8,9,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.0,female,1,1
9,10,"Artagaveytia, Mr Ramon",1st,71.0,male,0,0


### Can we assume that "male" and "female" are the only values we will see in "Sex" column?

In [6]:
df.expect_column_values_to_be_in_set('Sex', ['female', 'male'], include_config=True)

{'success': True,
 'result': {'element_count': 1313,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []},
 'expectation_config': {'expectation_type': 'expect_column_values_to_be_in_set',
  'kwargs': {'column': 'Sex',
   'value_set': ['female', 'male'],
   'result_format': 'BASIC'}}}

### Yes. Let's keep this expectation - if our code encounters input data that does not conform to it, we want to know about

### Can we assume that all people in our input data have non-empty value in "Age" column?

In [7]:
df.expect_column_values_to_not_be_null('Age')

{'success': False,
 'result': {'element_count': 1313,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 557,
  'unexpected_percent': 0.4242193450114242,
  'partial_unexpected_list': []}}

### No. We will have to adjust our code to deal with nulls in this column. However, let's make sure that if in future our code encounters input data where there more nulls than we expect, we will be notified:

In [6]:
df.expect_column_values_to_not_be_null('Age', mostly=0.5)

{'success': True,
 'result': {'element_count': 1313,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 557,
  'unexpected_percent': 0.4242193450114242,
  'partial_unexpected_list': []}}

### Can we assume that all "Age" column values are in a reasonable range?

In [8]:
df.expect_column_values_to_be_between('Age', min_value=0, max_value=120)

{'success': True,
 'result': {'element_count': 1313,
  'missing_count': 557,
  'missing_percent': 0.4242193450114242,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

### Yes. Great - let's keep this expectation. Our code can assume that this is true. If in future we will see input data that violates this expectation, our validation will catch it.

### Let's review the expectations.

In [9]:
df.get_expectations_config()

	1 failing expectations
	2 result_format kwargs
	0 include_configs kwargs
	0 catch_exceptions kwargs
If you wish to change this behavior, please set discard_failed_expectations, discard_result_format_kwargs, discard_include_configs_kwargs, and discard_catch_exceptions_kwargs appropirately.


{'data_asset_name': 'titanic_input_file',
 'meta': {'great_expectations.__version__': '0.6.0__develop__sch_internal'},
 'expectations': [{'expectation_type': 'expect_column_values_to_be_in_set',
   'kwargs': {'column': 'Sex', 'value_set': ['female', 'male']}},
  {'expectation_type': 'expect_column_values_to_be_between',
   'kwargs': {'column': 'Age', 'min_value': 0, 'max_value': 120}}],
 'data_asset_type': 'Dataset'}

### and save them. Expectations for "titanic_input_file" will be saved in a JSON file in great_expectations/data_asset_configurations directory. We will load this file when we need to validate.

In [11]:
df.save_expectations_config()

	1 failing expectations
	2 result_format kwargs
	0 include_configs kwargs
	0 catch_exceptions kwargs
If you wish to change this behavior, please set discard_failed_expectations, discard_result_format_kwargs, discard_include_configs_kwargs, and discard_catch_exceptions_kwargs appropirately.
