In [None]:
import json
import os
import great_expectations as ge
import pandas as pd

# Author Expectations For Your CSV Files

As your data products and models are developed, you can encode assumptions about input and output datasets as **expectations**.

Using that workflow provides the following benefits:

1. These are machine verifiable and can be used to monitor data flowing through your pipelines.
2. These eliminate poisonous implicit assumptions that cause data engineers re-work and waste time - "How do we define visits?"
3. These **will eventually** be easy to edit.
4. These **will eventually** be easy to reason about visually.


Let's say that your data pipeline processes CSV files in `/data/my_input_directory` directory on the filesystem.
CSV files that contain orders lines are deposited in the subdirectory `orders` and the ones contain cancellations lines - 
in `cancellations`. Each CSV file has date and/or sequence number in its name.

Following this example, this directory will looks like this:

    my_input_directory
        ├── orders
        |   └── orders_20190101_1.csv        
        |   └── orders_20190102_1.csv        
        |   └── orders_20190103_1.csv        
        ├── cancellations
        |   └── cancellations_20190101_1.csv        
        |   └── cancellations_20190102_1.csv        
        |   └── cancellations_20190103_1.csv        

Your code that processes these files as they arrive makes some assumptions on what a valid file looks like.
You can encode these assumptions as expectations (e.g., "column X should not have more than 5% null values").

When you validate new files to check if they conform to the assumptions your code makes, you can stop data that your code
does not know how to deal with from being processed, thus avoiding the "garbage in, garbage out" problem.

First, you have to author your expectations for every type of file your code processes.




## Create a DataContext object

First, we need to create a `DataContext` object - it represents Great Expectations in your data pipeline.
We are passing '../../' to this object to let it know where to find its configuration. No need to modify this line


In [None]:
#context = ge.data_context.DataContext('../../', expectation_explorer=True)
context = ge.data_context.DataContext('../../', expectation_explorer=False)

## Data source


data sources are locations where your pipeline reads its input data from. In our case, it is a directory - 

When you ran `great_expectations init` in your project, you configured a data source of type "filesystem" and gave it a name ("my_input_directory" in our example).

In the following cell set data_source_name to your data source name.

If you did not create the data source during init, here is how to add it now: 
https://great-expectations.readthedocs.io/en/latest/how_to_add_data_source.html

In [None]:
data_source_name = "my_input_directory"

In Great Expectations we use the name "data asset" for each "type" of files (e.g., orders and cancellations).

In order to create expectations about a data asset (e.g., orders), you will need to load one of the files of this type
into Great Expectations. `df` below will behave like a regular Pandas dataframe, but with additional methods added by Great Expectations - you will see shortly.

In the next cell we are calling context.get_data_asset to load one of the files.


In [None]:
# df = context.get_data_asset(data_source_name, data_asset_name="orders")
# df.head()

## Author Expectations

Now that you have one of the files loaded, you can call expect* methods on the dataframe in order to check
if you can make an assumption about the data.

For example, to check if you can expect values in column "order_date" to never be empty, call: `df.expect_column_values_to_not_be_null('order_date')`


Here is a glossary of expectations you can add:
https://great-expectations.readthedocs.io/en/latest/glossary.html

### Let's review the expectations.

Expectations that were true on this data sample were added. To view all the expectations you added so far about this type of files, do:

In [None]:
#df.get_expectations_config()

Now let's save the expectations about this type of files. Expectations for "orders" in our example will be saved in a JSON file in great_expectations/data_asset_configurations directory. We will load this file when we need to validate.


      your_project_root
        ├── great_expectations
        |   └── expectations
        |     └── orders.json        
        |     └── cancellations.json        


In [None]:
#df.save_expectations_config()

### Now that you created and saved expectations for at least one of the types of CSV files your data pipeline processes, we will show you how to set up validation - the process of checking if new files of this type conform to your expectations before they are processed by your pipeline's code. 

Check out the notebook "validate_csv_files.ipynb"

