In [1]:
import json
import os
import great_expectations as ge
import great_expectations.jupyter_ux
import pandas as pd

2019-10-08T18:44:06-0400 - INFO - Great Expectations logging enabled at INFO level by JupyterUX module.


# Author Expectations



[**Watch a short tutorial video**](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#video)

[**Read more in the tutorial**](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations)

**Reach out for help on** [**Great Expectations Slack**](https://greatexpectations.io/slack)


### Get a DataContext object
[Read more in the tutorial](https://great-expectations.readthedocs.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#get-datacontext-object)




In [3]:
context = ge.data_context.DataContext()

### List data assets in your project

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#data-assets)


In [4]:
great_expectations.jupyter_ux.list_available_data_asset_names(context)

data_source: demo__dir (PandasDatasource)
  generator_name: default (SubdirReaderGenerator)
    generator_asset: npidata_pfile
      expectation suite: BasicDatasetProfiler


### Set the name of the data asset you want to create expectations about and the name of the expectation suite to put these expectations in

We recommend to name your first expectation suite for a data asset "warning". Later, as you identify some of the expectations that you add to this suite as critical, you can move these expectations into another suite and call it "failure". 

In [21]:
data_asset_name = "demo__dir/default/npidata_pfile_transformed" # TODO: replace with your value!
expectation_suite_name = "warning" # TODO: replace with your value!

### Create the expectation suite



In [22]:
context.create_expectation_suite(data_asset_name=data_asset_name, expectation_suite_name=expectation_suite_name)

{'data_asset_name': 'demo__dir/default/npidata_pfile_transformed',
 'meta': {'great_expectations.__version__': '0.8.0a3+15.g8cd46e9c'},
 'expectations': []}

### Load a batch of data from the data asset you want to validate

Learn about `get_batch` in [this tutorial](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#get-batch)

__Quick Guide:__

##### If you want to validate data in Pandas Dataframes or in Spark Dataframes:

* A. If GE listed and profiled your files correctly:

```
data_asset_name = CHOOSE FROM THE LIST ABOVE
batch = context.get_batch(data_asset_name, 
                          expectation_suite_name,
                          context.yield_batch_kwargs(data_asset_name))
```
* B. Otherwise (you want to control the logic of reading the data):

```
df = load the data into a dataframe, e.g., df = SparkDFDataset(spark.read.csv... or pd.read_csv(...
data_asset_name = COME UP WITH A NAME - THIS WILL CREATE A NEW DATA ASSET.
batch = context.get_batch(data_asset_name, 
                          expectation_suite_name, 
                          df)
```


##### If you want to validate data in a database:

* A. To validate an existing table:

```
data_asset_name = 'CHOOSE THE NAME OF YOUR TABLE FROM THE LIST OF DATA ASSETS ABOVE'
batch = context.get_batch(data_asset_name, 
                        expectation_suite_name='my_suite'
                        BatchKwargs(table=data_asset_name)) 
```

* B. To validate a query result set:

```
data_asset_name = 'NAME YOUR QUERY (E.G., daily_users_query) - THIS WILL CREATE A NEW DATA ASSET'
batch = context.get_batch(data_asset_name, 
                        expectation_suite_name='my_suite',
                        BatchKwargs(query='SQL FOR YOUR QUERY'))
```





In [26]:
# COPY THE APPROPRIATE CODE SNIPPET FROM THE CELL ABOVE
batch = context.get_batch(data_asset_name, 
                          expectation_suite_name,
                        batch_kwargs={"path":'/data/demo/npidata_pfile/npidata_pfile_20050523-20190908_0.csv'})
batch.head()

2019-10-08T19:05:46-0400 - INFO - 	0 expectation(s) included in expectation_suite.
2019-10-08T19:05:46-0400 - INFO - 	0 expectation(s) included in expectation_suite.
2019-10-08T19:05:46-0400 - INFO - 	0 expectation(s) included in expectation_suite.
2019-10-08T19:05:46-0400 - INFO - 	0 expectation(s) included in expectation_suite.
2019-10-08T19:05:46-0400 - INFO - 	0 expectation(s) included in expectation_suite.


Unnamed: 0,NPI,Entity Type Code,Replacement NPI,Employer Identification Number (EIN),Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,Provider Name Prefix Text,Provider Name Suffix Text,...,Healthcare Provider Taxonomy Group_6,Healthcare Provider Taxonomy Group_7,Healthcare Provider Taxonomy Group_8,Healthcare Provider Taxonomy Group_9,Healthcare Provider Taxonomy Group_10,Healthcare Provider Taxonomy Group_11,Healthcare Provider Taxonomy Group_12,Healthcare Provider Taxonomy Group_13,Healthcare Provider Taxonomy Group_14,Healthcare Provider Taxonomy Group_15
0,1679576722,1.0,,,,WIEBE,DAVID,A,,,...,,,,,,,,,,
1,1588667638,1.0,,,,PILCHER,WILLIAM,C,DR.,,...,,,,,,,,,,
2,1497758544,2.0,,<UNAVAIL>,"CUMBERLAND COUNTY HOSPITAL SYSTEM, INC",,,,,,...,,,,,,,,,,
3,1306849450,1.0,,,,SMITSON,HAROLD,LEROY,DR.,II,...,,,,,,,,,,
4,1215930367,1.0,,,,GRESSOT,LAURENT,,DR.,,...,,,,,,,,,,


#### Optionally, customize options used to read your data (e.g., separators, header, etc) by setting reader options in `get_batch`

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#reader-options)



In [20]:
# this is how you can see which data batch was loaded
batch._batch_kwargs

{'path': '/data/demo/npidata_pfile/npidata_pfile_20050523-20191001_1.csv',
 'partition_id': 'npidata_pfile_20050523-20191001_1',
 'sep': None,
 'engine': 'python'}

## Author Expectations

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#create-expectations)

See available expectations in the [expectation glossary](https://docs.greatexpectations.io/en/latest/glossary.html?utm_source=notebook&utm_medium=create_expectations)


In [None]:
#example:

column_name = batch.get_table_columns()[0]
batch.expect_column_values_to_not_be_null(column_name)


In [12]:
# add more expectations here
batch.expect_column_to_exist('Provider Other Organization Name Type Code')

{'success': True}

In [14]:
# add more expectations here
batch.expect_column_distinct_values_to_be_in_set('Provider Other Organization Name Type Code', value_set=[3.0, 4.0, 5.0])

{'success': True,
 'result': {'observed_value': [3.0, 4.0, 5.0],
  'element_count': 10000,
  'missing_count': 96,
  'missing_percent': 0.0096}}

In [16]:
# add more expectations herep
partition_object = ge.dataset.util.build_categorical_partition_object(batch, column='Provider Other Organization Name Type Code')

In [17]:
batch.expect_column_kl_divergence_to_be_less_than(
    'Provider Other Organization Name Type Code', 
    partition_object=partition_object, 
    threshold=0.2
)

{'success': True,
 'result': {'observed_value': 0.0,
  'element_count': 10000,
  'missing_count': 96,
  'missing_percent': 0.0096}}

### Review the expectations

Expectations that were true on this data batch were added. To view all the expectations you added so far about this data asset, do:

In [18]:
batch.get_expectation_suite()

2019-10-08T18:50:27-0400 - INFO - 	3 expectation(s) included in expectation_suite. result_format settings filtered.


{'data_asset_name': 'demo__dir/default/npidata_pfile',
 'meta': {'great_expectations.__version__': '0.8.0a3+15.g8cd46e9c'},
 'expectations': [{'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'Provider Other Organization Name Type Code'}},
  {'expectation_type': 'expect_column_distinct_values_to_be_in_set',
   'kwargs': {'column': 'Provider Other Organization Name Type Code',
    'value_set': [3.0, 4.0, 5.0]}},
  {'expectation_type': 'expect_column_kl_divergence_to_be_less_than',
   'kwargs': {'column': 'Provider Other Organization Name Type Code',
    'partition_object': {'values': [3.0, 4.0, 5.0],
     'weights': [0.00020193861066235866,
      0.00010096930533117933,
      0.9996970920840065]},
    'threshold': 0.2}}],
 'data_asset_type': 'Dataset'}

    
    
If you decide not to save some expectations that you created, use [remove_expectaton method](https://docs.greatexpectations.io/en/latest/module_docs/data_asset_module.html?highlight=remove_expectation&utm_source=notebook&utm_medium=create_expectations#great_expectations.data_asset.data_asset.DataAsset.remove_expectation)


The following call will save the expectation suite as a JSON file in great_expectations/expectations directory of your project:
    

In [27]:
batch.save_expectation_suite()

2019-10-08T19:05:51-0400 - INFO - 	0 expectation(s) included in expectation_suite.


### You created and saved expectations for at least one of the data assets.

### We will show you how to set up validation - the process of checking if new files of this type conform to your expectations before they are processed by your pipeline's code. 

### Go to [integrate_validation_into_pipeline.ipynb](integrate_validation_into_pipeline.ipynb) to proceed.


