In [None]:
import json
import os
import great_expectations as ge
import great_expectations.jupyter_ux
import pandas as pd

# Author Expectations For Your CSV Files

When you develop your data pipeline code, you make some assumptions about what valid input data looks like.
You can encode these assumptions as *expectations* (e.g., "column X should not have more than 5% null values").

Once you deploy your code in production, Great Expectations will validate new data and check if it conforms to the assumptions your code makes.

This way you can stop data that your code does not know how to deal with from being processed, thus avoiding the "garbage in, garbage out" problem.

In this notebook you will create expectations for the CSV files your pipeline processes.


## Create a DataContext object

First, we need to create a `DataContext` object - it represents Great Expectations in your data pipeline.
We are passing '../../' to this object to let it know where to find its configuration. No need to modify this line


In [None]:
# context = ge.data_context.DataContext('../../', expectation_explorer=True)
context = ge.data_context.DataContext('../../', expectation_explorer=False)

## Data source


Data sources are locations where your pipeline reads its input data from. In our case, it is a directory on the local file system.

When you ran `great_expectations init` in your project, you configured a data source of type "pandas" and gave it a name.


In [None]:
data_source_name = great_expectations.jupyter_ux.set_data_source(context, data_source_type='pandas')


In [None]:
#data_source_name = ???

In [None]:
data_source_name

In Great Expectations we use the name "data asset" for each "type" of files.

Let's say that your data pipeline processes CSV files in `/data/my_input_directory` directory on the filesystem.
CSV files that contain orders lines are deposited in the subdirectory `orders` and the ones contain cancellations lines in `cancellations`. Each CSV file has date and/or sequence number in its name.

Following this example, this directory will looks like this:

    my_input_directory
        ├── orders
        |   └── orders_20190101_1.csv        
        |   └── orders_20190102_1.csv        
        |   └── orders_20190103_1.csv        
        ├── cancellations
        |   └── cancellations_20190101_1.csv        
        |   └── cancellations_20190102_1.csv        
        |   └── cancellations_20190103_1.csv        

In this example there are 2 data assets: "orders" and "cancellations". You can create expectations about these types.

In order to create expectations about a data asset (e.g., orders), you will need to load one of the files of this type
into Great Expectations. 

In [None]:
great_expectations.jupyter_ux.list_available_data_asset_names(context, data_source_name=data_source_name)

#### pick one of the data asset names above and use as the value of data_asset_name argument below.

In [None]:
df = context.get_batch(data_source_name, data_asset_name="orders")
df.head()

**Note: If you need to pass options to read_csv (e.g., sep, header, etc), you can add them as arguments in the method call below. Once you have all your options, add them to the config of this datasource in great_expectations.yml under "read_csv_kwargs" key**

The call in the cell above loaded one of the batches of this data asset. 
When working with files, batch corresponds to one file
You can read more on this here:
https://great-expectations.readthedocs.io/en/latest/what_are_batches.html


In [None]:
# this is how you can see which file was loaded
df._batch_kwargs

## Author Expectations

Now that you have one of the files loaded, you can call expect* methods on the dataframe in order to check
if you can make an assumption about the data.

For example, to check if you can expect values in column "order_date" to never be empty, call: `df.expect_column_values_to_not_be_null('order_date')`

### How do I know which types of expectations I can add?
* *Tab-complete* this statement, and add an expectation of your own; copy the cell to add more
* In jupyter, you can also use *shift-tab* to see the docstring for each expectation, to see what parameters it takes and get more information about the expectation.
* Here is a glossary of expectations you can add:
https://great-expectations.readthedocs.io/en/latest/glossary.html

In [None]:
#example:

column_name = df.columns[0]
df.expect_column_values_to_not_be_null(column_name)


In [None]:
# add more expectations here

In [None]:
# add more expectations here

In [None]:
# add more expectations here

### Let's review the expectations.

Expectations that were true on this data sample were added. To view all the expectations you added so far about this type of files, do:

In [None]:
df.get_expectations_config()

Now let's save the expectations about this type of files. Expectations for "orders" in our example will be saved in a JSON file in great_expectations/data_asset_configurations directory. We will load this file when we need to validate.


      your_project_root
        ├── great_expectations
        |   └── expectations
        |     └── orders.json        
        |     └── cancellations.json        


In [None]:
df.save_expectations_config()

### You created and saved expectations for at least one of the types of CSV files your data pipeline processes. 

### We will show you how to set up validation - the process of checking if new files of this type conform to your expectations before they are processed by your pipeline's code. 

### Go to [integrate_validation_into_pipeline.ipynb](integrate_validation_into_pipeline.ipynb) to proceed.


