In [1]:
import json
import os
import great_expectations as ge
import great_expectations.jupyter_ux

# Author Expectations For The CSV Files You Read Into Spark Dataframes

When you develop your data pipeline code, you make some assumptions about what valid input data looks like.
You can encode these assumptions as *expectations* (e.g., "column X should not have more than 5% null values").

Once you deploy your code in production, Great Expectations will validate new data and check if it conforms to the assumptions your code makes.

This way you can stop data that your code does not know how to deal with from being processed, thus avoiding the "garbage in, garbage out" problem.

First, you have to author your expectations for every type of file your code processes.

In this notebook you can create expectations for the CSV files that you will load into Spark Dataframes.


## Create a DataContext object

First, we need to create a `DataContext` object - it represents Great Expectations in your data pipeline.
We are passing '../../' to this object to let it know where to find its configuration. No need to modify this line


In [2]:
# context = ge.data_context.DataContext('../../', expectation_explorer=True)
context = ge.data_context.DataContext('../../', expectation_explorer=False)

## Data source


Data sources are locations where your pipeline reads its input data from. In our case, it is a directory on the local file system.

When you ran `great_expectations init` in your project, you configured a data source of type "spark" and gave it a name.


In [7]:
data_source_name = great_expectations.jupyter_ux.set_data_source(context, 'spark')


In [8]:
#data_source_name = ???

In [9]:
data_source_name

'201810'

In Great Expectations we use the name "data asset" for each "type" of files.

Let's say that your Spark data pipeline processes CSV files in `/data/my_input_directory` directory on the filesystem.
CSV files that contain orders lines are deposited in the subdirectory `orders` and the ones contain cancellations lines in `cancellations`. Each CSV file has date and/or sequence number in its name.

Following this example, this directory will looks like this:

    my_input_directory
        ├── orders
        |   └── orders_20190101_1.csv        
        |   └── orders_20190102_1.csv        
        |   └── orders_20190103_1.csv        
        ├── cancellations
        |   └── cancellations_20190101_1.csv        
        |   └── cancellations_20190102_1.csv        
        |   └── cancellations_20190103_1.csv        

In this example there are 2 data assets: "orders" and "cancellations". You can create expectations about these types.

In order to create expectations about a data asset (e.g., orders), you will need to load one of the files of this type
into Great Expectations. 

In [10]:
great_expectations.jupyter_ux.list_available_data_asset_names(context, data_source_name=data_source_name)

{'member', 'providersupplemental', 'claimcode', 'memberenrollment', 'upkmemberkeys', 'claim', 'provider', '.DS_Store'}


#### pick one of the data asset names above and use as the value of data_asset_name argument below.

**Note: If you need to pass options to Spark reader (e.g., delimiter, header, etc), you can add them as arguments in the method call below. Once you have all your options, add them to the config of this datasource in great_expectations.yml under "reader_options" key**

In [21]:
df = context.get_data_asset(data_source_name, data_asset_name="orders")
df.spark_df.show()

OSError: ../../project_data/clarify_payer/claim_data/ability_payer/staging/201810/orders

The call in the cell above loaded one of the batches of this data asset. 
When working with files, batch corresponds to one file
You can read more on this here:
https://great-expectations.readthedocs.io/en/latest/what_are_batches.html


In [13]:
# this is how you can see which file was loaded
df._batch_kwargs

{'path': '/Users/eugenemandel/projects/forum-edw/../../project_data/clarify_payer/claim_data/ability_payer/staging/201810/claimcode/000000_0'}

## Author Expectations

Now that you have one of the files loaded, you can call expect* methods on the dataframe in order to check
if you can make an assumption about the data.

For example, to check if you can expect values in column "order_date" to never be empty, call: `df.expect_column_values_to_not_be_null('order_date')`

### How do I know which types of expectations I can add?
* *Tab-complete* this statement, and add an expectation of your own; copy the cell to add more
* In jupyter, you can also use *shift-tab* to see the docstring for each expectation, to see what parameters it takes and get more information about the expectation.
* Here is a glossary of expectations you can add:
https://great-expectations.readthedocs.io/en/latest/glossary.html

In [15]:
#example:

column_name = df.spark_df.columns[0]
df.expect_column_values_to_not_be_null(column_name)


{'success': True,
 'result': {'element_count': 26427155,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'partial_unexpected_list': []}}

### Let's review the expectations.

Expectations that were true on this data sample were added. To view all the expectations you added so far about this type of files, do:

In [16]:
df.get_expectations_config()

	0 failing expectations
	1 result_format kwargs
	0 include_configs kwargs
	0 catch_exceptions kwargs
If you wish to change this behavior, please set discard_failed_expectations, discard_result_format_kwargs, discard_include_configs_kwargs, and discard_catch_exceptions_kwargs appropirately.


{'data_asset_name': 'claimcode',
 'meta': {'great_expectations.__version__': '0.6.1__develop__sch_internal'},
 'expectations': [{'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': '_c0'}}],
 'data_asset_type': 'Dataset'}

Now let's save the expectations about this type of files. Expectations for "orders" in our example will be saved in a JSON file in great_expectations/data_asset_configurations directory. We will load this file when we need to validate.


      your_project_root
        ├── great_expectations
        |   └── expectations
        |     └── orders.json        
        |     └── cancellations.json        


In [18]:
df.save_expectations_config()

	0 failing expectations
	1 result_format kwargs
	0 include_configs kwargs
	0 catch_exceptions kwargs
If you wish to change this behavior, please set discard_failed_expectations, discard_result_format_kwargs, discard_include_configs_kwargs, and discard_catch_exceptions_kwargs appropirately.


### Now that you created and saved expectations for at least one of the types of CSV files your data pipeline processes, we will show you how to set up validation - the process of checking if new files of this type conform to your expectations before they are processed by your pipeline's code. 

Check out the notebook "validate_csv_files.ipynb"

