In [1]:
import os
import json
import sys
import great_expectations as ge
import great_expectations.jupyter_ux
from datetime import datetime
import math
import pandas as pd
os.chdir('/Users/mparayil/Desktop/Development/dsa-data-workflows/grtexp_agero_dsa/great_expectations')

2020-03-17T11:37:20-0400 - INFO - Great Expectations logging enabled at INFO level by JupyterUX module.


In [2]:
import ge_prod.ge_data_access as gda
import ge_prod.queries as queries

In [3]:
rule_query = queries.queries.get('service_specification').get('create_expectations_dec2019')

In [4]:
rule_query

"SELECT * FROM service_specification where create_time_utc >= to_date('2019-12-01') and create_time_utc <= to_date('2019-12-31');"

# Author Expectations



[**Watch a short tutorial video**](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#video)

[**Read more in the tutorial**](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations)

**Reach out for help on** [**Great Expectations Slack**](https://tinyurl.com/great-expectations-slack)


### Get a DataContext object
[Read more in the tutorial](https://great-expectations.readthedocs.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#get-datacontext-object)




In [5]:
context = ge.data_context.DataContext()

2020-03-17T11:37:25-0400 - INFO - Using project config: /Users/mparayil/Desktop/Development/dsa-data-workflows/grtexp_agero_dsa/great_expectations/great_expectations.yml


### List data assets in your project

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#data-assets)


In [6]:
great_expectations.jupyter_ux.list_available_data_asset_names(context)

Inspecting your data sources. This may take a moment...


#### Pick one of the data asset names above and use as the value of data_asset_name argument below

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#get-batch)


### Specify data_asset & expectation_suite_name

In [7]:
data_asset_name = 'service_specification'
normalized_data_asset_name = context.normalize_data_asset_name(data_asset_name)
print(normalized_data_asset_name)

NormalizedDataAssetName(datasource='agero_dsa_pandas', generator='default', generator_asset='service_specification')


### Create a new empty expectation suite

In [8]:
expectation_suite_name = 'warnings_dec2019'
# context.create_expectation_suite(data_asset_name=normalized_data_asset_name, expectation_suite_name=expectation_suite_name,
#                                 overwrite_existing=True)

In [9]:
context.list_expectation_suite_keys()

[{'data_asset_name': agero_dsa_pandas/default/service_specification,
 {'data_asset_name': agero_dsa_pandas/default/customer_experience,
 {'data_asset_name': agero_dsa_pandas/default/network_outreach,
 {'data_asset_name': agero_dsa_pandas/default/network_outreach,
 {'data_asset_name': agero_dsa_pandas/default/network_outreach,
 {'data_asset_name': agero_dsa_pandas/default/customer_complaints,
 {'data_asset_name': agero_dsa_pandas/default/network_claims,
 {'data_asset_name': agero_dsa_pandas/default/network_claims,
 {'data_asset_name': agero_dsa_pandas/default/service_progress,
 {'data_asset_name': agero_dsa_pandas/default/zip_code_lookup,

### Get batch to create expectations against

In [10]:
rule_df = gda.snowflake_connector_to_df(rule_query)
# rule_df.to_pickle('temp_data/network_claims_2019Q4.pkl')

In [12]:
rule_df.to_pickle('temp_data/service_specification_dec2019.pkl')

In [11]:
rule_df.shape

(1516298, 56)

In [12]:
b_kwargs = {"dataset": rule_df}
batch = context.get_batch(normalized_data_asset_name, expectation_suite_name=expectation_suite_name,
                         batch_kwargs=b_kwargs)

In [13]:
batch.get_row_count()

1516298

In [14]:
print(rule_df.shape)

(1516298, 56)


In [15]:
[datasource['name'] for datasource in context.list_datasources() if datasource['class_name'] == 'PandasDatasource']

['agero_dsa_pandas']

In [16]:
# getting rule_df batchId & fingerprint
rule_batch_fingerprint = batch.batch_fingerprint
rule_batch_id = batch.batch_id

In [17]:
print('rule_batch_fingerprint: ', rule_batch_fingerprint, sep='\n')
print('rule_batch_id: ', rule_batch_id, sep='\n')

rule_batch_fingerprint: 
{'partition_id': '20200317T154359.443839Z', 'fingerprint': 'a22d45d9f735dbbea0f1712aeff7ef5c'}
rule_batch_id: 
{'timestamp': 1584459555.327921, 'PandasInMemoryDF': True, 'fingerprint': '9ef58f980a692de2c01fc861bf7948be'}


## Author Expectations

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#create-expectations)

See available expectations in the [expectation glossary](https://docs.greatexpectations.io/en/latest/glossary.html?utm_source=notebook&utm_medium=create_expectations)


### Dataset exploration & understanding of fields to ensure rules reflect behavior of data
- Validating columns to exist in table shape
- Expected column count in table shape
- Expected set values to be seen in given column
- Expected columns to have null or non-null values X percentage of the time
- Expect column values to be of certain data type(s)
- Placing max and min values limits on numerical columns
- Average or median column value to be within a certain range
- Expecting column A to be large/less than column B

### 1. Validating to see if every column exists in table

In [18]:
# add more expectations here
column_names = batch.get_table_columns()
print(column_names)

['CASE_ID', 'TASK_ID', 'PO_NUMBER', 'TASK_STATUS_CODE', 'POLICY_NUMBER', 'IS_DISPATCH', 'CALL_REASON_CODE', 'DISABLEMENT_REASON_CODE', 'PROBLEM_CODE', 'IS_SCHEDULED_DISPATCH', 'SERVICE_TYPE_CODE', 'PRIMARY_EQUIPMENT_CODE', 'PRIMARY_EQUIPMENT_CLASS', 'PRIMARY_EQUIPMENT_TYPE', 'PRODUCT_TYPE', 'IS_PER_EVENT', 'DISABLEMENT_LATITUDE', 'DISABLEMENT_LONGITUDE', 'DISABLEMENT_ZIP_CODE', 'DISABLEMENT_ADDRESS_1', 'DISABLEMENT_ADDRESS_2', 'DISABLEMENT_CITY', 'DISABLEMENT_STATE_CODE', 'IS_DRIVER_WITH_VEHICLE', 'LOCATION_TYPE_CODE', 'VIN', 'VEHICLE_MODEL_YEAR', 'VEHICLE_MAKE', 'VEHICLE_MODEL', 'BILL_GROUP_ID', 'CLIENT_ID', 'COVERAGE_STATUS_CODE', 'COVERAGE_AMOUNT', 'COVERAGE_STATUS_DESCRIPTION', 'COVERED_AMOUNT', 'OVERAGE_PAYMENT_METHOD_CODE', 'OVERAGE_PAYMENT_AMOUNT', 'AUTHORIZATION_REASON_CODE', 'ESTIMATED_TOW_MILEAGE', 'TOW_DESTINATION_ID', 'TOW_DESTINATION_LATITUDE', 'TOW_DESTINATION_LONGITUDE', 'TOW_DESTINATION_ZIP_CODE', 'TOW_DESTINATION_CITY', 'TOW_DESTINATION_STATE_CODE', 'TOW_DESTINATION_AD

In [19]:
colnames = list(batch.columns)
# colnames.sort()

In [20]:
print(colnames)

['CASE_ID', 'TASK_ID', 'PO_NUMBER', 'TASK_STATUS_CODE', 'POLICY_NUMBER', 'IS_DISPATCH', 'CALL_REASON_CODE', 'DISABLEMENT_REASON_CODE', 'PROBLEM_CODE', 'IS_SCHEDULED_DISPATCH', 'SERVICE_TYPE_CODE', 'PRIMARY_EQUIPMENT_CODE', 'PRIMARY_EQUIPMENT_CLASS', 'PRIMARY_EQUIPMENT_TYPE', 'PRODUCT_TYPE', 'IS_PER_EVENT', 'DISABLEMENT_LATITUDE', 'DISABLEMENT_LONGITUDE', 'DISABLEMENT_ZIP_CODE', 'DISABLEMENT_ADDRESS_1', 'DISABLEMENT_ADDRESS_2', 'DISABLEMENT_CITY', 'DISABLEMENT_STATE_CODE', 'IS_DRIVER_WITH_VEHICLE', 'LOCATION_TYPE_CODE', 'VIN', 'VEHICLE_MODEL_YEAR', 'VEHICLE_MAKE', 'VEHICLE_MODEL', 'BILL_GROUP_ID', 'CLIENT_ID', 'COVERAGE_STATUS_CODE', 'COVERAGE_AMOUNT', 'COVERAGE_STATUS_DESCRIPTION', 'COVERED_AMOUNT', 'OVERAGE_PAYMENT_METHOD_CODE', 'OVERAGE_PAYMENT_AMOUNT', 'AUTHORIZATION_REASON_CODE', 'ESTIMATED_TOW_MILEAGE', 'TOW_DESTINATION_ID', 'TOW_DESTINATION_LATITUDE', 'TOW_DESTINATION_LONGITUDE', 'TOW_DESTINATION_ZIP_CODE', 'TOW_DESTINATION_CITY', 'TOW_DESTINATION_STATE_CODE', 'TOW_DESTINATION_AD

In [21]:
master_column_names = ['CASE_ID', 'TASK_ID', 'PO_NUMBER', 'TASK_STATUS_CODE', 'POLICY_NUMBER', 'IS_DISPATCH',
                       'CALL_REASON_CODE', 'DISABLEMENT_REASON_CODE', 'PROBLEM_CODE', 'IS_SCHEDULED_DISPATCH',
                       'SERVICE_TYPE_CODE', 'PRIMARY_EQUIPMENT_CODE', 'PRIMARY_EQUIPMENT_CLASS',
                        'PRIMARY_EQUIPMENT_TYPE', 'PRODUCT_TYPE', 'IS_PER_EVENT', 'DISABLEMENT_LATITUDE',
                        'DISABLEMENT_LONGITUDE', 'DISABLEMENT_ZIP_CODE', 'DISABLEMENT_ADDRESS_1',
                        'DISABLEMENT_ADDRESS_2', 'DISABLEMENT_CITY', 'DISABLEMENT_STATE_CODE', 'IS_DRIVER_WITH_VEHICLE',
                        'LOCATION_TYPE_CODE', 'VIN', 'VEHICLE_MODEL_YEAR', 'VEHICLE_MAKE', 'VEHICLE_MODEL',
                        'BILL_GROUP_ID', 'CLIENT_ID', 'ESTIMATED_TOW_MILEAGE', 'TOW_DESTINATION_ID',
                        'TOW_DESTINATION_LATITUDE', 'TOW_DESTINATION_LONGITUDE', 'TOW_DESTINATION_ZIP_CODE',
                        'TOW_DESTINATION_CITY', 'TOW_DESTINATION_STATE_CODE', 'TOW_DESTINATION_ADDRESS_1',
                        'TOW_DESTINATION_ADDRESS_2', 'TOW_DESTINATION_NAME', 'EQUIPMENT_CONFIGURATION',
                        'EQUIPMENT_COUNT', 'SERVICE_TIME_EASTERN', 'SERVICE_TIME_UTC', 'SERVICE_TIME_LOCAL',
                        'CREATE_TIME_EASTERN', 'CREATE_TIME_UTC', 'CREATE_TIME_LOCAL']

In [22]:
len(master_column_names)

49

In [23]:
len(column_names)

56

In [27]:
# Ensuring columns to exist
for col in master_column_names:
    print(col + ':', batch.expect_column_to_exist(col, result_format='BASIC', catch_exceptions=True), sep='\n')

CASE_ID:
{'success': True, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
TASK_ID:
{'success': True, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
PO_NUMBER:
{'success': True, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
TASK_STATUS_CODE:
{'success': True, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
POLICY_NUMBER:
{'success': True, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
IS_DISPATCH:
{'success': True, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
CALL_REASON_CODE:
{'success': True, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
DISABLEMENT_REASON_CODE:
{'success': True, 'exception_info':

### 2. Validating column count in table is always the same

In [28]:
print('# of columns in customer_complaints: ', len(column_names))

# of columns in customer_complaints:  49


In [29]:
print('# of columns in {}: '.format('network_claims'), len(master_column_names), '\n')
if len(column_names) == len(master_column_names):
    print(batch.expect_table_column_count_to_equal(len(column_names), result_format='SUMMARY'))
else:
    print(batch.expect_table_column_count_to_equal(len(master_column_names), result_format='SUMMARY'))

# of columns in network_claims:  49 

{'success': True, 'result': {'observed_value': 49}}


### 3. Checking which columns should not have null values

In [30]:
# identifying which columns should not be null
print(column_names)

['CASE_ID', 'TASK_ID', 'PO_NUMBER', 'TASK_STATUS_CODE', 'POLICY_NUMBER', 'IS_DISPATCH', 'CALL_REASON_CODE', 'DISABLEMENT_REASON_CODE', 'PROBLEM_CODE', 'IS_SCHEDULED_DISPATCH', 'SERVICE_TYPE_CODE', 'PRIMARY_EQUIPMENT_CODE', 'PRIMARY_EQUIPMENT_CLASS', 'PRIMARY_EQUIPMENT_TYPE', 'PRODUCT_TYPE', 'IS_PER_EVENT', 'DISABLEMENT_LATITUDE', 'DISABLEMENT_LONGITUDE', 'DISABLEMENT_ZIP_CODE', 'DISABLEMENT_ADDRESS_1', 'DISABLEMENT_ADDRESS_2', 'DISABLEMENT_CITY', 'DISABLEMENT_STATE_CODE', 'IS_DRIVER_WITH_VEHICLE', 'LOCATION_TYPE_CODE', 'VIN', 'VEHICLE_MODEL_YEAR', 'VEHICLE_MAKE', 'VEHICLE_MODEL', 'BILL_GROUP_ID', 'CLIENT_ID', 'ESTIMATED_TOW_MILEAGE', 'TOW_DESTINATION_ID', 'TOW_DESTINATION_LATITUDE', 'TOW_DESTINATION_LONGITUDE', 'TOW_DESTINATION_ZIP_CODE', 'TOW_DESTINATION_CITY', 'TOW_DESTINATION_STATE_CODE', 'TOW_DESTINATION_ADDRESS_1', 'TOW_DESTINATION_ADDRESS_2', 'TOW_DESTINATION_NAME', 'EQUIPMENT_CONFIGURATION', 'EQUIPMENT_COUNT', 'SERVICE_TIME_EASTERN', 'SERVICE_TIME_UTC', 'SERVICE_TIME_LOCAL', 'CR

In [24]:
rule_df.isnull().sum()

CASE_ID                              0
TASK_ID                              0
PO_NUMBER                       505447
TASK_STATUS_CODE                     0
POLICY_NUMBER                   525755
IS_DISPATCH                          0
CALL_REASON_CODE                 95593
DISABLEMENT_REASON_CODE        1025729
PROBLEM_CODE                    410807
IS_SCHEDULED_DISPATCH                0
SERVICE_TYPE_CODE                    0
PRIMARY_EQUIPMENT_CODE          444717
PRIMARY_EQUIPMENT_CLASS         444717
PRIMARY_EQUIPMENT_TYPE          444717
PRODUCT_TYPE                         0
IS_PER_EVENT                         0
DISABLEMENT_LATITUDE            458783
DISABLEMENT_LONGITUDE           458783
DISABLEMENT_ZIP_CODE            458801
DISABLEMENT_ADDRESS_1           458903
DISABLEMENT_ADDRESS_2          1470928
DISABLEMENT_CITY                458794
DISABLEMENT_STATE_CODE          458808
IS_DRIVER_WITH_VEHICLE               0
LOCATION_TYPE_CODE              460466
VIN                      

In [27]:
# Separating null & non-null columns
null_cols = list(batch.isnull().sum()[batch.isnull().sum() > 0].keys())
not_null_cols = list(batch.isnull().sum()[batch.isnull().sum() == 0].keys())

In [28]:
print('Viewing column null value counts: ', batch.isnull().sum(), sep='\n')

Viewing column null value counts: 
CASE_ID                              0
TASK_ID                              0
PO_NUMBER                       505447
TASK_STATUS_CODE                     0
POLICY_NUMBER                   525755
IS_DISPATCH                          0
CALL_REASON_CODE                 95593
DISABLEMENT_REASON_CODE        1025729
PROBLEM_CODE                    410807
IS_SCHEDULED_DISPATCH                0
SERVICE_TYPE_CODE                    0
PRIMARY_EQUIPMENT_CODE          444717
PRIMARY_EQUIPMENT_CLASS         444717
PRIMARY_EQUIPMENT_TYPE          444717
PRODUCT_TYPE                         0
IS_PER_EVENT                         0
DISABLEMENT_LATITUDE            458783
DISABLEMENT_LONGITUDE           458783
DISABLEMENT_ZIP_CODE            458801
DISABLEMENT_ADDRESS_1           458903
DISABLEMENT_ADDRESS_2          1470928
DISABLEMENT_CITY                458794
DISABLEMENT_STATE_CODE          458808
IS_DRIVER_WITH_VEHICLE               0
LOCATION_TYPE_CODE           

In [29]:
not_null_cols.sort()

In [30]:
print(not_null_cols)

['CASE_ID', 'CREATE_TIME_EASTERN', 'CREATE_TIME_UTC', 'EQUIPMENT_COUNT', 'IS_DISPATCH', 'IS_DRIVER_WITH_VEHICLE', 'IS_PER_EVENT', 'IS_SCHEDULED_DISPATCH', 'PRODUCT_TYPE', 'SERVICE_TIME_EASTERN', 'SERVICE_TIME_UTC', 'SERVICE_TYPE_CODE', 'TASK_ID', 'TASK_STATUS_CODE']


In [31]:
null_cols.sort()

In [32]:
print(null_cols)

['AUTHORIZATION_REASON_CODE', 'BILL_GROUP_ID', 'CALL_REASON_CODE', 'CLIENT_ID', 'COVERAGE_AMOUNT', 'COVERAGE_STATUS_CODE', 'COVERAGE_STATUS_DESCRIPTION', 'COVERED_AMOUNT', 'CREATE_TIME_LOCAL', 'DISABLEMENT_ADDRESS_1', 'DISABLEMENT_ADDRESS_2', 'DISABLEMENT_CITY', 'DISABLEMENT_LATITUDE', 'DISABLEMENT_LONGITUDE', 'DISABLEMENT_REASON_CODE', 'DISABLEMENT_STATE_CODE', 'DISABLEMENT_ZIP_CODE', 'EQUIPMENT_CONFIGURATION', 'ESTIMATED_TOW_MILEAGE', 'LOCATION_TYPE_CODE', 'OVERAGE_PAYMENT_AMOUNT', 'OVERAGE_PAYMENT_METHOD_CODE', 'POLICY_NUMBER', 'PO_NUMBER', 'PRIMARY_EQUIPMENT_CLASS', 'PRIMARY_EQUIPMENT_CODE', 'PRIMARY_EQUIPMENT_TYPE', 'PROBLEM_CODE', 'SERVICE_TIME_LOCAL', 'TOW_DESTINATION_ADDRESS_1', 'TOW_DESTINATION_ADDRESS_2', 'TOW_DESTINATION_CITY', 'TOW_DESTINATION_ID', 'TOW_DESTINATION_LATITUDE', 'TOW_DESTINATION_LONGITUDE', 'TOW_DESTINATION_NAME', 'TOW_DESTINATION_STATE_CODE', 'TOW_DESTINATION_ZIP_CODE', 'VEHICLE_MAKE', 'VEHICLE_MODEL', 'VEHICLE_MODEL_YEAR', 'VIN']


In [43]:
# checking for all columns that shouldn't be null are not
for col in not_null_cols:
    print(col, '\n', batch.expect_column_values_to_not_be_null(col, result_format='BASIC'))

CASE_ID 
 {'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}}
TASK_ID 
 {'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}}
TASK_STATUS_CODE 
 {'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}}
IS_DISPATCH 
 {'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}}
IS_SCHEDULED_DISPATCH 
 {'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}}
SERVICE_TYPE_CODE 
 {'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}}
PRODUCT_TYPE 
 {'success': True, 'result': {'element_count': 1516258, 'unexpected

In [44]:
print(f"capturing the following columns to not be null: {not_null_cols}", sep='\n')

capturing the following columns to not be null: ['CASE_ID', 'TASK_ID', 'TASK_STATUS_CODE', 'IS_DISPATCH', 'IS_SCHEDULED_DISPATCH', 'SERVICE_TYPE_CODE', 'PRODUCT_TYPE', 'IS_PER_EVENT', 'IS_DRIVER_WITH_VEHICLE', 'EQUIPMENT_COUNT', 'SERVICE_TIME_EASTERN', 'SERVICE_TIME_UTC', 'CREATE_TIME_EASTERN', 'CREATE_TIME_UTC']


### 4. Validating columns to have null values
- **columns to check:**
    - 'PO_NUMBER', 'POLICY_NUMBER', 'CALL_REASON_CODE', 'DISABLEMENT_REASON_CODE', 'PROBLEM_CODE', 'PRIMARY_EQUIPMENT_CODE', 'PRIMARY_EQUIPMENT_CLASS', 'PRIMARY_EQUIPMENT_TYPE', 'DISABLEMENT_LATITUDE', 'DISABLEMENT_LONGITUDE', 'DISABLEMENT_ZIP_CODE', 'DISABLEMENT_ADDRESS_1', 'DISABLEMENT_ADDRESS_2', 'DISABLEMENT_CITY', 'DISABLEMENT_STATE_CODE', 'LOCATION_TYPE_CODE', 'VIN', 'VEHICLE_MODEL_YEAR', 'VEHICLE_MAKE', 'VEHICLE_MODEL', 'BILL_GROUP_ID', 'CLIENT_ID', 'ESTIMATED_TOW_MILEAGE', 'TOW_DESTINATION_ID', 'TOW_DESTINATION_LATITUDE', 'TOW_DESTINATION_LONGITUDE', 'TOW_DESTINATION_ZIP_CODE', 'TOW_DESTINATION_CITY', 'TOW_DESTINATION_STATE_CODE', 'TOW_DESTINATION_ADDRESS_1', 'TOW_DESTINATION_ADDRESS_2', 'TOW_DESTINATION_NAME', 'EQUIPMENT_CONFIGURATION', 'SERVICE_TIME_LOCAL', 'CREATE_TIME_LOCAL'

In [45]:
from typing import Union
from great_expectations.dataset import PandasDataset
def get_df_not_null_weights(df: Union[pd.DataFrame, PandasDataset], groupby_col: str, not_null_col: str) -> float:
    """
    Provides specified column's weight/percentage for it not to be null.

    Parameters
    -----------
    df: pd.DataFrame or great_expectations.dataset.PandasDataset
        dataframe object to look at
    groupby_col: str
        grouping column string to groupby dataframe on when looking at specified column in next parameter
    not_null_col: str
        column used from dataframe to calculate safe weight thresholds of when it would be not null

    Returns
    ------------
    float
        Not null weight of specified column lowered by 5% after looking at the 10% quartile
    """

    df_group = df.groupby(df[groupby_col].dt.date)
    df_group = df_group.apply(lambda x: x[not_null_col].notnull().mean())

    adjusted_weight = df_group.quantile(0.1, interpolation='lower')
    return adjusted_weight.round(4)

In [33]:
from typing import Union
from great_expectations.dataset import PandasDataset
def get_df_not_null_weights(df: Union[pd.DataFrame, PandasDataset], groupby_col: str, not_null_col: str) -> float:
	"""
	Provides specified column's weight/percentage for it not to be null.

	Parameters
	-----------
	df: pd.DataFrame or great_expectations.dataset.PandasDataset
		dataframe object to look at
	groupby_col: str
		grouping column string to groupby dataframe on when looking at specified column in next parameter
	not_null_col: str
		column used from dataframe to calculate safe weight thresholds of when it would be not null

	Returns
	------------
	float
		Not null weight of specified column lowered by 5% after looking at the 10% quartile
	"""
	
	df_group = df.groupby(df[groupby_col].dt.date)
	df_group = df_group.apply(lambda x: x[not_null_col].notnull().mean())

	base_weight = df_group.quantile(0.1, interpolation='midpoint')
	adjusted_weight = (base_weight - 0.009)
	if adjusted_weight < 0.005:
		final_weight = base_weight.round(4)
	else:
		final_weight = adjusted_weight.round(4)
	return float(final_weight)

In [34]:
for col in null_cols:
    w = get_df_not_null_weights(rule_df, 'CREATE_TIME_UTC', col)
    print(col, w, sep='\n')

AUTHORIZATION_REASON_CODE
0.006
BILL_GROUP_ID
0.74
CALL_REASON_CODE
0.9247
CLIENT_ID
0.7334
COVERAGE_AMOUNT
0.715
COVERAGE_STATUS_CODE
0.715
COVERAGE_STATUS_DESCRIPTION
0.7123
COVERED_AMOUNT
0.6641
CREATE_TIME_LOCAL
0.6664
DISABLEMENT_ADDRESS_1
0.6685
DISABLEMENT_ADDRESS_2
0.0163
DISABLEMENT_CITY
0.6686
DISABLEMENT_LATITUDE
0.6686
DISABLEMENT_LONGITUDE
0.6686
DISABLEMENT_REASON_CODE
0.2806
DISABLEMENT_STATE_CODE
0.6686
DISABLEMENT_ZIP_CODE
0.6686
EQUIPMENT_CONFIGURATION
0.6755
ESTIMATED_TOW_MILEAGE
0.302
LOCATION_TYPE_CODE
0.6665
OVERAGE_PAYMENT_AMOUNT
0.0215
OVERAGE_PAYMENT_METHOD_CODE
0.053
POLICY_NUMBER
0.6137
PO_NUMBER
0.6372
PRIMARY_EQUIPMENT_CLASS
0.6755
PRIMARY_EQUIPMENT_CODE
0.6755
PRIMARY_EQUIPMENT_TYPE
0.6755
PROBLEM_CODE
0.7057
SERVICE_TIME_LOCAL
0.6664
TOW_DESTINATION_ADDRESS_1
0.294
TOW_DESTINATION_ADDRESS_2
0.0065
TOW_DESTINATION_CITY
0.302
TOW_DESTINATION_ID
0.302
TOW_DESTINATION_LATITUDE
0.302
TOW_DESTINATION_LONGITUDE
0.302
TOW_DESTINATION_NAME
0.2579
TOW_DESTINATION_S

In [49]:
# calculating weight for columns of how often they should be null
original_null_percents = dict(1 -(batch.isnull().sum() / len(batch))[batch.isnull().sum() / len(batch) > 0])
    
print('original_null_percents', original_null_percents, sep='\n')

# lowering weights by one thousandth of decimal
adjusted_null_percents = {}
for key, weight in original_null_percents.items():
    adjusted_null_percents[key] = (weight - 0.01).round(3)
    
print('---------------------------------------')
print('adjusted_null_weights:')
print(adjusted_null_percents)
# original_null_percents = {k:round(v, 3) for k, v in original_null_percents.items()}
# adjusted_null_percents = {k:round(v, 3) for k, v in not_null_weights.items()}

original_null_percents
{'PO_NUMBER': 0.6615721071216112, 'POLICY_NUMBER': 0.6532324973718193, 'CALL_REASON_CODE': 0.936957298823815, 'DISABLEMENT_REASON_CODE': 0.3235089279001331, 'PROBLEM_CODE': 0.7290091791766309, 'PRIMARY_EQUIPMENT_CODE': 0.7066515065378056, 'PRIMARY_EQUIPMENT_CLASS': 0.7066515065378056, 'PRIMARY_EQUIPMENT_TYPE': 0.7066515065378056, 'DISABLEMENT_LATITUDE': 0.6973727426335097, 'DISABLEMENT_LONGITUDE': 0.6973727426335097, 'DISABLEMENT_ZIP_CODE': 0.6973608713029049, 'DISABLEMENT_ADDRESS_1': 0.6972936004294783, 'DISABLEMENT_ADDRESS_2': 0.0299197102340103, 'DISABLEMENT_CITY': 0.6973654879314735, 'DISABLEMENT_STATE_CODE': 0.6973562546743364, 'LOCATION_TYPE_CODE': 0.6962621137036045, 'VIN': 0.6483362330157533, 'VEHICLE_MODEL_YEAR': 0.7495874712614872, 'VEHICLE_MAKE': 0.7495004148370528, 'VEHICLE_MODEL': 0.7494905220615489, 'BILL_GROUP_ID': 0.7584111674926035, 'CLIENT_ID': 0.7554367396577627, 'ESTIMATED_TOW_MILEAGE': 0.374998186324491, 'TOW_DESTINATION_ID': 0.37506150008771

In [125]:
default_not_null_weights = {'COUNTY': 0.919, 'LATITUDE': 0.952, 'LONGITUDE': 0.952, 'TIMEZONE': 0.945}

In [51]:
for col, weight in adjusted_null_percents.items():
    print(col, batch.expect_column_values_to_not_be_null(col, mostly=weight, include_config=True,
                                                           catch_exceptions=True,
                                                           result_format='SUMMARY'), sep='\n')

PO_NUMBER
{'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 513144, 'unexpected_percent': 33.842789287838876, 'partial_unexpected_list': []}, 'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'PO_NUMBER', 'mostly': 0.652, 'result_format': 'SUMMARY'}}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
POLICY_NUMBER
{'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 525789, 'unexpected_percent': 34.67675026281807, 'partial_unexpected_list': []}, 'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'POLICY_NUMBER', 'mostly': 0.643, 'result_format': 'SUMMARY'}}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
CALL_REASON_CODE
{'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 95589, 'unexpected_percent': 6.3042

VEHICLE_MODEL
{'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 379837, 'unexpected_percent': 25.05094779384511, 'partial_unexpected_list': []}, 'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'VEHICLE_MODEL', 'mostly': 0.739, 'result_format': 'SUMMARY'}}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
BILL_GROUP_ID
{'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 366311, 'unexpected_percent': 24.15888325073965, 'partial_unexpected_list': []}, 'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'BILL_GROUP_ID', 'mostly': 0.748, 'result_format': 'SUMMARY'}}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
CLIENT_ID
{'success': True, 'result': {'element_count': 1516258, 'unexpected_count': 370821, 'unexpected_percent': 24.45

### 5. Expecting column values to be in a set
- COMPLAINT_CATEGORY
- COMPLAINT_REASON
- COMPLAINT_REASON_DETAILS
- COMPLAINT_ORGIN
- CASE_RESOLUTION
- COMPLAINT_TYPE

In [61]:
master_column_names


['CASE_ID',
 'TASK_ID',
 'PO_NUMBER',
 'TASK_STATUS_CODE',
 'POLICY_NUMBER',
 'IS_DISPATCH',
 'CALL_REASON_CODE',
 'DISABLEMENT_REASON_CODE',
 'PROBLEM_CODE',
 'IS_SCHEDULED_DISPATCH',
 'SERVICE_TYPE_CODE',
 'PRIMARY_EQUIPMENT_CODE',
 'PRIMARY_EQUIPMENT_CLASS',
 'PRIMARY_EQUIPMENT_TYPE',
 'PRODUCT_TYPE',
 'IS_PER_EVENT',
 'DISABLEMENT_LATITUDE',
 'DISABLEMENT_LONGITUDE',
 'DISABLEMENT_ZIP_CODE',
 'DISABLEMENT_ADDRESS_1',
 'DISABLEMENT_ADDRESS_2',
 'DISABLEMENT_CITY',
 'DISABLEMENT_STATE_CODE',
 'IS_DRIVER_WITH_VEHICLE',
 'LOCATION_TYPE_CODE',
 'VIN',
 'VEHICLE_MODEL_YEAR',
 'VEHICLE_MAKE',
 'VEHICLE_MODEL',
 'BILL_GROUP_ID',
 'CLIENT_ID',
 'ESTIMATED_TOW_MILEAGE',
 'TOW_DESTINATION_ID',
 'TOW_DESTINATION_LATITUDE',
 'TOW_DESTINATION_LONGITUDE',
 'TOW_DESTINATION_ZIP_CODE',
 'TOW_DESTINATION_CITY',
 'TOW_DESTINATION_STATE_CODE',
 'TOW_DESTINATION_ADDRESS_1',
 'TOW_DESTINATION_ADDRESS_2',
 'TOW_DESTINATION_NAME',
 'EQUIPMENT_CONFIGURATION',
 'EQUIPMENT_COUNT',
 'SERVICE_TIME_EASTERN',
 '

In [74]:
def get_categorical_columns_values(df: Union[pd.DataFrame, PandasDataset], cols: list, table_name: str) -> dict:
    c_weights = {}
    for col in cols:
        unique_weights = df[col].value_counts(normalize=True) * 100
        c_weights[col] = unique_weights.values.mean().round(5)

    cat_weight_dict = {c: w for (c, w) in c_weights.items() if w > 0.9 if df[c].dtypes != bool
                       if c not in ['TASK_ID', 'task_id', 'CLIENT_ID', 'client_id', 'equipment_count',
                                    'EQUIPMENT_COUNT', 'BILL_GROUP_ID', 'bill_group_id']}

    execute_strings = ' '.join(f"SELECT DISTINCT {c_name} FROM {table_name};" for c_name in cat_weight_dict.keys())
    ctx = gda.get_snowflake_connector()

    cursor_list = ctx.execute_string(execute_strings, remove_comments=True, return_cursors=True)
    category_col_values = {}
    for cur in cursor_list:
        col_names = ','.join([col[0] for col in cur.description])
        cat_values = [x[0] for x in cur.fetchall() if x[0]]
        category_col_values[col_names] = cat_values
    return category_col_values

In [78]:
dtest = get_categorical_columns_values(rule_df, master_column_names, 'service_specification')

In [79]:
dtest.keys()

dict_keys(['TASK_STATUS_CODE', 'CALL_REASON_CODE', 'DISABLEMENT_REASON_CODE', 'PROBLEM_CODE', 'SERVICE_TYPE_CODE', 'PRIMARY_EQUIPMENT_CODE', 'PRIMARY_EQUIPMENT_CLASS', 'PRIMARY_EQUIPMENT_TYPE', 'PRODUCT_TYPE', 'DISABLEMENT_STATE_CODE', 'LOCATION_TYPE_CODE', 'VEHICLE_MODEL_YEAR', 'TOW_DESTINATION_STATE_CODE'])

In [116]:
c_weights = {}
for col in master_column_names:
    unique_weights = rule_df[col].value_counts(normalize=True).mean()
    c_weights[col] = unique_weights.values.count()

AttributeError: 'numpy.float64' object has no attribute 'values'

In [113]:
unique_weights

1.2835767636665653e-06

In [105]:
c_weights = {}
for col in master_column_names:
    unique_weights = rule_df[col].value_counts().count()
    c_weights[col] = unique_weights.values().round(5)

# cat_weight_dict = {c: w for (c, w) in c_weights.items() if w > 0.9 if df[c].dtypes != bool
#                    if c not in ['TASK_ID', 'task_id'] if c not in ['CLIENT_ID', 'client_id']}

AttributeError: 'numpy.int64' object has no attribute 'values'

In [101]:
c_weights

{'CASE_ID': 1.12957,
 'TASK_ID': 84236.55556,
 'PO_NUMBER': 1.00001,
 'TASK_STATUS_CODE': 63177.41667,
 'POLICY_NUMBER': 1.40018,
 'IS_DISPATCH': 758129.0,
 'CALL_REASON_CODE': 34650.46341,
 'DISABLEMENT_REASON_CODE': 13257.37838,
 'PROBLEM_CODE': 55268.3,
 'IS_SCHEDULED_DISPATCH': 758129.0,
 'SERVICE_TYPE_CODE': 79803.05263,
 'PRIMARY_EQUIPMENT_CODE': 25511.09524,
 'PRIMARY_EQUIPMENT_CLASS': 357155.33333,
 'PRIMARY_EQUIPMENT_TYPE': 357155.33333,
 'PRODUCT_TYPE': 758129.0,
 'IS_PER_EVENT': 758129.0,
 'DISABLEMENT_LATITUDE': 1.31685,
 'DISABLEMENT_LONGITUDE': 1.34087,
 'DISABLEMENT_ZIP_CODE': 41.03139,
 'DISABLEMENT_ADDRESS_1': 1.44582,
 'DISABLEMENT_ADDRESS_2': 3.76325,
 'DISABLEMENT_CITY': 43.35859,
 'DISABLEMENT_STATE_CODE': 16020.78788,
 'IS_DRIVER_WITH_VEHICLE': 758129.0,
 'LOCATION_TYPE_CODE': 87976.08333,
 'VIN': 1.39199,
 'VEHICLE_MODEL_YEAR': 10332.43636,
 'VEHICLE_MAKE': 1732.37195,
 'VEHICLE_MODEL': 374.80904,
 'BILL_GROUP_ID': 2731.46556,
 'CLIENT_ID': 8612.30827,
 'ESTIMATE

In [93]:
{k:[min(v), max(v)] for k, v in dtest.items() if k == 'VEHICLE_MODEL_YEAR'}

{'VEHICLE_MODEL_YEAR': [1900.0, 2032.0]}

In [119]:
batch.expect_column_values_to_be_json_parseable('EQUIPMENT_CONFIGURATION', result_format='SUMMARY',
                                        catch_exceptions=True, mostly=0.99)

{'success': True,
 'result': {'element_count': 1516258,
  'missing_count': 444775,
  'missing_percent': 29.333728164995666,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': [],
  'partial_unexpected_index_list': [],
  'partial_unexpected_counts': []},
 'exception_info': {'raised_exception': False,
  'exception_message': None,
  'exception_traceback': None}}

In [59]:
c_weights = {}
for col in master_column_names:
    unique_weights = rule_df[col].value_counts(normalize=True) * 100
    c_weights[col] = unique_weights.values.mean().round(5)

In [123]:
execute_strings = ' '.join(f"SELECT DISTINCT {c_name} FROM service_specification;" for c_name in master_column_names)
ctx = gda.get_snowflake_connector()

cursor_list = ctx.execute_string(execute_strings, remove_comments=True, return_cursors=True)
category_col_values = {}
for cur in cursor_list:
    col_names = ','.join([col[0] for col in cur.description])
    cat_values = [x[0] for x in cur.fetchall() if x[0]]
    category_col_values[col_names] = cat_values

KeyboardInterrupt: 

In [None]:
category_col_values

In [None]:
{k:len(v) for k, v in category_col_values.items()}

In [248]:
execute_strings = ' '.join(f"SELECT DISTINCT {col} FROM customer_complaints;" for col in cat_dict)

In [249]:
execute_strings

'SELECT DISTINCT CASE_RESOLUTION FROM customer_complaints; SELECT DISTINCT COMPLAINT_CATEGORY FROM customer_complaints; SELECT DISTINCT COMPLAINT_ORIGIN FROM customer_complaints; SELECT DISTINCT COMPLAINT_REASON FROM customer_complaints; SELECT DISTINCT COMPLAINT_REASON_DETAILS FROM customer_complaints; SELECT DISTINCT COMPLAINT_TYPE FROM customer_complaints;'

In [120]:
for v in category_col_values.values():
    print(len(v))

NameError: name 'category_col_values' is not defined

In [92]:
for col, val_set in dtest.items():
    print(col, '\n', batch.expect_column_values_to_be_in_set(col, val_set, result_format='BASIC', 
                                        include_config=True, catch_exceptions=True), '\n')

TASK_STATUS_CODE 
 {'success': True, 'result': {'element_count': 1516258, 'missing_count': 0, 'missing_percent': 0.0, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'unexpected_percent_nonmissing': 0.0, 'partial_unexpected_list': []}, 'expectation_config': {'expectation_type': 'expect_column_values_to_be_in_set', 'kwargs': {'column': 'TASK_STATUS_CODE', 'value_set': ['DR', 'CDN', 'TTC', 'SPCN', 'DEC', 'ICC', 'NCC', 'FCM', 'PSI', 'RPD', 'SCM', 'ADV', 'SPI', 'PSAP', 'IN', 'DS', 'TCR', 'DA', 'FRI', 'ACMT', 'AGCN', 'IS', 'DCM', 'ADD', 'HLD', 'CCAN', 'RDN', 'SU', 'ICI'], 'result_format': 'BASIC'}}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}} 

CALL_REASON_CODE 
 {'success': True, 'result': {'element_count': 1516258, 'missing_count': 95589, 'missing_percent': 6.304270117618506, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'unexpected_percent_nonmissing': 0.0, 'partial_unexpected_list': []}, 'expectation_config': {'expectation

VEHICLE_MODEL_YEAR 
 {'success': False, 'result': {'element_count': 1516258, 'missing_count': 379690, 'missing_percent': 25.041252873851285, 'unexpected_count': 3844, 'unexpected_percent': 0.25351886024673903, 'unexpected_percent_nonmissing': 0.3382111761020898, 'partial_unexpected_list': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, 'expectation_config': {'expectation_type': 'expect_column_values_to_be_in_set', 'kwargs': {'column': 'VEHICLE_MODEL_YEAR', 'value_set': [2012.0, 1963.0, 1983.0, 1954.0, 1973.0, 1965.0, 1959.0, 1924.0, 1916.0, 1913.0, 1914.0, 1949.0, 1918.0, 1995.0, 1947.0, 1993.0, 1988.0, 1964.0, 1905.0, 1955.0, 1930.0, 1937.0, 1926.0, 2015.0, 2006.0, 1978.0, 1982.0, 2019.0, 1962.0, 1900.0, 1967.0, 1958.0, 1936.0, 1901.0, 1934.0, 2022.0, 1972.0, 1943.0, 1907.0, 1903.0, 2014.0, 2017.0, 2001.0, 1987.0, 1970.0, 1952.0, 1932.0, 1992.0, 1979.0, 1971.0, 1957.0, 1950.0, 1929.0, 1942.0, 1941.0, 1917.0, 1985.0, 1996.0, 2008.0

In [124]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

### 8. Determine if columns are unique per row
- CASE_ID, TASK_ID

In [132]:
equip_config_json_schema = {
    "$schema": "http://json-schema.org/draft-07/schema",
    "$id": "http://example.com/example.json",
    "type": ["array", "null"],
    "title": "The Root Schema",
    "description": "The root schema comprises the entire JSON document.",
    "items": {
        "$id": "#/items",
        "type": ["object", "null"],
        "title": "The Items Schema",
        "description": "An explanation about the purpose of this instance.",
        "default": {},
        "examples": [
            {
                "equipment_type": "TOW",
                "equipment_class": "M",
                "equipment_code": "MDW"
            },
            {
                "equipment_type": "TOW",
                "equipment_class": "M",
                "equipment_code": "MDW"
            },
            {
                "equipment_type": "TOW",
                "equipment_class": "M",
                "equipment_code": "MDW"
            },
            {
                "equipment_class": "M",
                "equipment_code": "MDW",
                "equipment_type": "TOW"
            },
            {
                "equipment_type": "TOW",
                "equipment_class": "M",
                "equipment_code": "MDW"
            },
            {
                "equipment_class": "M",
                "equipment_code": "MDW",
                "equipment_type": "TOW"
            },
            {
                "equipment_class": "M",
                "equipment_code": "MDW",
                "equipment_type": "TOW"
            },
            {
                "equipment_type": "TOW",
                "equipment_class": "M",
                "equipment_code": "MDW"
            },
            {
                "equipment_class": "M",
                "equipment_code": "MDW",
                "equipment_type": "TOW"
            },
            {
                "equipment_class": "M",
                "equipment_code": "MDW",
                "equipment_type": "TOW"
            },
            {
                "equipment_type": "TOW",
                "equipment_class": "M",
                "equipment_code": "MDW"
            },
            {
                "equipment_type": "TOW",
                "equipment_class": "M",
                "equipment_code": "MDW"
            }
        ],
        "required": [
            "equipment_class",
            "equipment_code",
            "equipment_type"
        ],
        "properties": {
            "equipment_class": {
                "$id": "#/items/properties/equipment_class",
                "type": ["string", "null"],
                "title": "The Equipment_class Schema",
                "description": "An explanation about the purpose of this instance.",
                "default": "",
                "examples": [
                    "M"
                ]
            },
            "equipment_code": {
                "$id": "#/items/properties/equipment_code",
                "type": ["string", "null"],
                "title": "The Equipment_code Schema",
                "description": "An explanation about the purpose of this instance.",
                "default": "",
                "examples": [
                    "MDW"
                ]
            },
            "equipment_type": {
                "$id": "#/items/properties/equipment_type",
                "type": ["string", "null"],
                "title": "The Equipment_type Schema",
                "description": "An explanation about the purpose of this instance.",
                "default": "",
                "examples": [
                    "TOW"
                ]
            }
        }
    }
}

In [133]:
print('EQUIPMENT_CONFIGURATION', 
      batch.expect_column_values_to_match_json_schema("EQUIPMENT_CONFIGURATION", equip_config_json_schema),
      sep='\n')

KeyboardInterrupt: 

In [139]:
for z in ['DISABLEMENT_ZIP_CODE', 'TOW_DESTINATION_ZIP_CODE']:
    print(z, batch.expect_column_values_to_match_regex(z, '^([A-Z]\d[A-z])|(\d{5})(-\d{3})|\d{4,5}$'), sep='\n')

DISABLEMENT_ZIP_CODE
{'success': True, 'result': {'element_count': 1516258, 'missing_count': 458879, 'missing_percent': 30.263912869709507, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'unexpected_percent_nonmissing': 0.0, 'partial_unexpected_list': []}}
TOW_DESTINATION_ZIP_CODE
{'success': True, 'result': {'element_count': 1516258, 'missing_count': 947570, 'missing_percent': 62.49398189490179, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'unexpected_percent_nonmissing': 0.0, 'partial_unexpected_list': []}}


In [140]:
batch.expect_multicolumn_values_to_be_unique(column_list=['CASE_ID', 'TASK_ID'], result_format='SUMMARY',
                                            catch_exceptions=True, include_config=True)

{'success': True,
 'result': {'element_count': 1516258,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': [],
  'partial_unexpected_index_list': [],
  'partial_unexpected_counts': []},
 'expectation_config': {'expectation_type': 'expect_multicolumn_values_to_be_unique',
  'kwargs': {'column_list': ['CASE_ID', 'TASK_ID'],
   'result_format': 'SUMMARY'}},
 'exception_info': {'raised_exception': False,
  'exception_message': None,
  'exception_traceback': None}}

In [143]:
batch.expect_column_values_to_be_unique('PO_NUMBER', mostly=0.999, result_format='SUMMARY', catch_exceptions=True)

{'success': True,
 'result': {'element_count': 1516258,
  'missing_count': 513144,
  'missing_percent': 33.842789287838876,
  'unexpected_count': 20,
  'unexpected_percent': 0.001319036733854001,
  'unexpected_percent_nonmissing': 0.0019937913337865886,
  'partial_unexpected_list': [246074618.0,
   678691177.0,
   853728396.0,
   568081068.0,
   312945368.0,
   145733598.0,
   379252333.0,
   832810811.0,
   379252333.0,
   912521626.0,
   102106642.0,
   678691177.0,
   102106642.0,
   312945368.0,
   145733598.0,
   832810811.0,
   246074618.0,
   853728396.0,
   568081068.0,
   912521626.0],
  'partial_unexpected_index_list': [227,
   242,
   251,
   252,
   1955,
   1956,
   80007,
   82087,
   85643,
   180813,
   186856,
   345956,
   366011,
   453176,
   453178,
   987341,
   1083995,
   1084023,
   1084024,
   1260300],
  'partial_unexpected_counts': [{'value': 102106642.0, 'count': 2},
   {'value': 145733598.0, 'count': 2},
   {'value': 246074618.0, 'count': 2},
   {'value': 

In [163]:
batch.expect_column_pair_values_A_to_be_greater_than_B('SERVICE_TIME_UTC', "CREATE_TIME_UTC", mostly=0.25, or_equal=True,
	                                                             ignore_row_if='either_value_is_missing',
	                                                             result_format='SUMMARY', catch_exceptions=True)

{'success': True,
 'result': {'element_count': 1516258,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 405487,
  'unexpected_percent': 26.742612405012867,
  'unexpected_percent_nonmissing': 26.742612405012867,
  'partial_unexpected_list': [['2018-10-08 15:29:39', '2019-12-12 05:00:26'],
   ['2018-10-17 05:28:11', '2019-12-21 05:01:33'],
   ['2018-10-18 14:30:00', '2019-12-12 05:00:24'],
   ['2019-01-10 10:34:14', '2019-12-20 05:02:48'],
   ['2019-01-11 15:46:14', '2019-12-17 15:09:54'],
   ['2019-01-11 17:40:07', '2019-12-17 15:40:48'],
   ['2019-01-11 08:19:45', '2019-12-20 05:02:49'],
   ['2018-12-22 17:30:00', '2019-12-12 05:00:22'],
   ['2018-12-03 08:16:35', '2019-12-20 05:02:05'],
   ['2019-03-01 15:10:50', '2019-12-20 05:03:23'],
   ['2019-03-03 18:46:31', '2019-12-17 15:09:55'],
   ['2019-03-06 08:03:03', '2019-12-05 17:07:14'],
   ['2019-03-07 14:28:08', '2019-12-17 15:09:55'],
   ['2019-01-23 08:23:37', '2019-12-16 17:00:31'],
   ['2019-01-25 16:40:23',

### 8. Expecting columns to be certain data type

In [164]:
rule_df.dtypes

CASE_ID                                int64
TASK_ID                                 int8
PO_NUMBER                            float64
TASK_STATUS_CODE                      object
POLICY_NUMBER                         object
IS_DISPATCH                             bool
CALL_REASON_CODE                      object
DISABLEMENT_REASON_CODE               object
PROBLEM_CODE                          object
IS_SCHEDULED_DISPATCH                   bool
SERVICE_TYPE_CODE                     object
PRIMARY_EQUIPMENT_CODE                object
PRIMARY_EQUIPMENT_CLASS               object
PRIMARY_EQUIPMENT_TYPE                object
PRODUCT_TYPE                          object
IS_PER_EVENT                            bool
DISABLEMENT_LATITUDE                 float64
DISABLEMENT_LONGITUDE                float64
DISABLEMENT_ZIP_CODE                  object
DISABLEMENT_ADDRESS_1                 object
DISABLEMENT_ADDRESS_2                 object
DISABLEMENT_CITY                      object
DISABLEMEN

In [35]:
for x, y in batch.dtypes.iteritems():
    print(x, y)

CASE_ID int64
TASK_ID int8
PO_NUMBER float64
TASK_STATUS_CODE object
POLICY_NUMBER object
IS_DISPATCH bool
CALL_REASON_CODE object
DISABLEMENT_REASON_CODE object
PROBLEM_CODE object
IS_SCHEDULED_DISPATCH bool
SERVICE_TYPE_CODE object
PRIMARY_EQUIPMENT_CODE object
PRIMARY_EQUIPMENT_CLASS object
PRIMARY_EQUIPMENT_TYPE object
PRODUCT_TYPE object
IS_PER_EVENT bool
DISABLEMENT_LATITUDE float64
DISABLEMENT_LONGITUDE float64
DISABLEMENT_ZIP_CODE object
DISABLEMENT_ADDRESS_1 object
DISABLEMENT_ADDRESS_2 object
DISABLEMENT_CITY object
DISABLEMENT_STATE_CODE object
IS_DRIVER_WITH_VEHICLE bool
LOCATION_TYPE_CODE object
VIN object
VEHICLE_MODEL_YEAR float64
VEHICLE_MAKE object
VEHICLE_MODEL object
BILL_GROUP_ID float64
CLIENT_ID float64
COVERAGE_STATUS_CODE object
COVERAGE_AMOUNT float64
COVERAGE_STATUS_DESCRIPTION object
COVERED_AMOUNT float64
OVERAGE_PAYMENT_METHOD_CODE object
OVERAGE_PAYMENT_AMOUNT float64
AUTHORIZATION_REASON_CODE object
ESTIMATED_TOW_MILEAGE float64
TOW_DESTINATION_ID object


In [166]:
service_specification_data_types = dict(batch.dtypes.iteritems())

In [167]:
for key, val in service_specification_data_types.items():
    service_specification_data_types[key] = str(val)

In [168]:
print(service_specification_data_types)

{'CASE_ID': 'int64', 'TASK_ID': 'int8', 'PO_NUMBER': 'float64', 'TASK_STATUS_CODE': 'object', 'POLICY_NUMBER': 'object', 'IS_DISPATCH': 'bool', 'CALL_REASON_CODE': 'object', 'DISABLEMENT_REASON_CODE': 'object', 'PROBLEM_CODE': 'object', 'IS_SCHEDULED_DISPATCH': 'bool', 'SERVICE_TYPE_CODE': 'object', 'PRIMARY_EQUIPMENT_CODE': 'object', 'PRIMARY_EQUIPMENT_CLASS': 'object', 'PRIMARY_EQUIPMENT_TYPE': 'object', 'PRODUCT_TYPE': 'object', 'IS_PER_EVENT': 'bool', 'DISABLEMENT_LATITUDE': 'float64', 'DISABLEMENT_LONGITUDE': 'float64', 'DISABLEMENT_ZIP_CODE': 'object', 'DISABLEMENT_ADDRESS_1': 'object', 'DISABLEMENT_ADDRESS_2': 'object', 'DISABLEMENT_CITY': 'object', 'DISABLEMENT_STATE_CODE': 'object', 'IS_DRIVER_WITH_VEHICLE': 'bool', 'LOCATION_TYPE_CODE': 'object', 'VIN': 'object', 'VEHICLE_MODEL_YEAR': 'float64', 'VEHICLE_MAKE': 'object', 'VEHICLE_MODEL': 'object', 'BILL_GROUP_ID': 'float64', 'CLIENT_ID': 'float64', 'ESTIMATED_TOW_MILEAGE': 'float64', 'TOW_DESTINATION_ID': 'object', 'TOW_DESTI

In [169]:
for col, typ in service_specification_data_types.items():
    print(batch.expect_column_values_to_be_of_type(col, typ, result_format='SUMMARY', catch_exceptions=True))

{'success': True, 'result': {'observed_value': 'int64'}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
{'success': True, 'result': {'observed_value': 'int8'}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
{'success': True, 'result': {'observed_value': 'float64'}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
{'success': True, 'result': {'observed_value': 'object_'}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
{'success': True, 'result': {'observed_value': 'object_'}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
{'success': True, 'result': {'observed_value': 'bool_'}, 'exception_info': {'raised_exception': False, 'exception_message': None, 'exception_traceback': None}}
{'success': True, 'result': {'obser

### Review the expectations

Expectations that were true on this data batch were added. To view all the expectations you added so far about this data asset, do:

In [131]:
batch.get_expectation_suite()

2020-02-24T12:38:40-0500 - INFO - 	128 expectation(s) included in expectation_suite. Omitting 3 expectation(s) that failed when last run; set discard_failed_expectations=False to include them. result_format settings filtered.


{'data_asset_name': 'agero_dsa_pandas/default/network_claims',
 'meta': {'great_expectations.__version__': '0.8.8'},
 'expectations': [{'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDCHARGE_AMOUNT'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDCHARGE_COUNT'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDCHARGE_DETAILS'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDPAY_APPROVED_DATE_EASTERN'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDPAY_APPROVED_DATE_UTC'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDPAY_APPROVED_PAYMENT'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDPAY_COUNT'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDPAY_DETAILS'}},
  {'expectation_type': 'expect_column_to_exist',
   'kwargs': {'column': 'ADDPAY_PAYME

In [41]:
batch.save_expectation_suite()

2020-01-31T18:26:29-0500 - INFO - 	64 expectation(s) included in expectation_suite. result_format settings filtered.


### You created and saved expectations for at least one of the data assets.

### We will show you how to set up validation - the process of checking if new files of this type conform to your expectations before they are processed by your pipeline's code. 

### Go to [integrate_validation_into_pipeline.ipynb](integrate_validation_into_pipeline.ipynb) to proceed.


