# Creating & Editing Expectation Suites
Use this example notebook as a "boilerplate" template for creating and modifying your expectation suites.

While the same notebook can be used to manage multiple expectation suites, developers often find it helpful to dedicate a separate notebook for each expectation suite, because it makes the organization of the expectation suites in the code repository more explicit and improves the code readability.

## IMPORTANT
Be sure to commit your notebook to GitHub as part of your repository!  This notebook is the source of truth, capturing your expectations with respect to the given data asset.  (To facilitate code review, you may wish to "Restart Kernel and Clear All Outputs" before committing the notebook to Git).

## _We are here to help!_

You can always **reach out to us on** the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack)

## Initialize Spark Context and Import Python Basics


In [None]:
import os
import sys
import io

import time
import datetime

from pyspark import SQLContext

from pyspark.context import SparkContext
from pyspark.sql import SparkSession


In [None]:
from pyspark.sql import functions as F

In [None]:
sys.version_info

In [None]:
os.environ.get('PYSPARK_PYTHON')

In [None]:
spark_session = SparkSession.builder.appName("pytest-pyspark-local-notebook-manage_expectations"). \
    master("local[2]"). \
    config("spark.executor.memory", "6g"). \
    config("spark.driver.memory", "6g"). \
    config("spark.ui.showConsoleProgress", "false"). \
    config("spark.sql.shuffle.partitions", "2"). \
    config("spark.default.parallelism", "4"). \
    enableHiveSupport(). \
    getOrCreate()
sc = spark_session.sparkContext

In [None]:
spark = SQLContext(sc)

## Import Useful Python Utilities

Also import GreatExpectations.

In [None]:

import json
import re

import pandas as pd

import great_expectations as ge


## Add Repository Repository to Spark Context

Also import frequently used utilities from your repository.

### _Important_
Make sure that the path to your repository archive in S3 for the `sc.addPyFile(s3_path_to_repo_zip)` call below is correct and that the contents are up to date.

In [None]:
# sc.addPyFile('s3://alex-ge-test/code-0.0.0.zip')

In [None]:
def load_csv(spark_context, path, delimiter):
    return spark_context.read \
        .format("com.databricks.spark.csv") \
        .option("delimiter", delimiter) \
        .option("header", "true") \
        .load(path)


def load_parquet(spark_context, path, prefix_path=None, select_cols=None):
    if prefix_path is None:
        spark_parquet_read_func = spark_context.read
    else:
        spark_parquet_read_func = spark_context.read.option("basePath", prefix_path)

    if isinstance(path, list):
        df = spark_parquet_read_func.parquet(*path)
    else:
        df = spark_parquet_read_func.parquet(path)

    if select_cols:
        df = df.select(*select_cols)

    return df


## GreatExpectations Basics

Check GreatExpections version.

Import the GreatExpections `get_ge_context()` method and execute it using the standard buckets as the parameters:
* json_s3_bucket -- stores JSON files containing the authoritative expectation suites definitions and validation results
* html_docs_s3_bucket -- stores HTML files for displaying the expectation suites definitions and reporting their corresponding validation results

In [None]:
ge.__version__

In [None]:
# from repo.lib.test.great_expectations.ge_context import get_ge_context

In [None]:
import datetime

from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import BaseDataContext

class GeContext(object):
    def __init__(
            self,
            json_s3_bucket,
            html_docs_s3_bucket,
            site_name='s3_site',
            slack_webhook=None
    ):
        GeContext._validate_arguments(
            json_s3_bucket=json_s3_bucket,
            html_docs_s3_bucket=html_docs_s3_bucket,
            site_name=site_name,
            slack_webhook=slack_webhook
        )
        self._site_name = site_name
        action_list = [
            {
                'name': 'store_validation_result',
                'action': {
                    'class_name': 'StoreValidationResultAction'
                }
            },
            {
                'name': 'store_evaluation_params',
                'action': {
                    'class_name': 'StoreEvaluationParametersAction'
                }
            },
            {
                'name': 'update_data_docs',
                'action': {
                    'class_name': 'UpdateDataDocsAction'
                }
            },
        ]
        
        notify_slack_action_dict = {
            'name': 'notify_slack',
            'action': {
                'class_name': 'SlackNotificationAction',
                'slack_webhook': slack_webhook,
                'notify_on': 'all',
                'renderer': {
                    'module_name': 'great_expectations.render.renderer.slack_renderer',
                    'class_name': 'SlackRenderer'
                }
            }
        }
        
        if slack_webhook is not None:
            action_list.append(notify_slack_action_dict)

        project_config = DataContextConfig(
            config_version=1,
            datasources={
                's3_files_spark_datasource': {
                    'class_name': 'SparkDFDatasource',
                    'data_asset_type': {
                        'class_name': 'SparkDFDataset'
                    }
                }
            },
            config_variables_file_path=None,
            plugins_directory=None,
            validation_operators={
                'action_list_operator': {
                    'class_name': 'ActionListValidationOperator',
                    'action_list': action_list
                }
            },
            stores={
                'expectations': {
                    'class_name': 'ExpectationsStore',
                    'store_backend': {
                        'class_name': 'TupleS3StoreBackend',
                        'bucket': json_s3_bucket,
                        'prefix': 'great_expectations/ExpectationSuites'
                    }
                },
                'validations': {
                    'class_name': 'ValidationsStore',
                    'store_backend': {
                        'class_name': 'TupleS3StoreBackend',
                        'bucket': json_s3_bucket,
                        'prefix': 'great_expectations/Validations'
                    }
                },
                'evaluation_parameters': {
                    'class_name': 'EvaluationParameterStore'
                }
            },
            expectations_store_name='expectations',
            validations_store_name='validations',
            evaluation_parameter_store_name='evaluation_parameters',
            data_docs_sites={
                self._site_name: {
                    'class_name': 'SiteBuilder',
                    'store_backend': {
                        'class_name': 'TupleS3StoreBackend',
                        'bucket': html_docs_s3_bucket,
                        'prefix': ''
                    },
                    'site_index_builder': {
                        'class_name': 'DefaultSiteIndexBuilder',
                        'show_cta_footer': True
                    }
                }
            }
        )
        ge_context = BaseDataContext(project_config=project_config)
        self._ge_context = ge_context

    def build_data_docs(self):
        self._ge_context.build_data_docs(site_names=self._site_name)

    def get_ge_context(self):
        return self._ge_context

    @staticmethod
    def _validate_arguments(json_s3_bucket, html_docs_s3_bucket, site_name, slack_webhook):
        if not json_s3_bucket or not isinstance(json_s3_bucket, str):
            raise ValueError('Error: "json_s3_bucket" must be a non-empty string.')
        if not html_docs_s3_bucket or not isinstance(html_docs_s3_bucket, str):
            raise ValueError('Error: "html_docs_s3_bucket" must be a non-empty string.')
        if not site_name or not isinstance(site_name, str):
            raise ValueError('Error: "site_name" must be a non-empty string.')
        if slack_webhook and not isinstance(slack_webhook, str):
            raise ValueError('Error: "slack_webhook" must be either a non-empty string or entirely omitted.')

def get_ge_context(json_s3_bucket, html_docs_s3_bucket, slack_webhook=None):
    return GeContext(
        json_s3_bucket=json_s3_bucket,
        html_docs_s3_bucket=html_docs_s3_bucket,
        slack_webhook=slack_webhook
    ) \
        .get_ge_context()


In [None]:
json_s3_bucket = 'alex-ge-test'

In [None]:
html_docs_s3_bucket = 'alex-ge-test'

In [None]:
ge_context = get_ge_context(json_s3_bucket=json_s3_bucket, html_docs_s3_bucket=html_docs_s3_bucket)

# Manage Your Expectation Suite
Use this notebook to recreate and modify your expectation suite for (write down the name of the expectation suite below for future references):

**Expectation Suite Name**: `Titanic_Expectation_Suite`

You can always **reach out to us on** the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack)

## Data Asset Specification

Specify the S3 path to the data asset that you wish to reason about (by characterising it with expectations) in this notebook.  Then use the previously imported utilities to load this asset into a PySpark DataDrame (we also recommend printing some basic information about your dataframe).

### Terminology
We use the term "check dataframe" when referring to the dataframe corresponding to your data asset, because this is the dataframe, on which the various checks against what is expected will be performed in the course of building the expectation suite.  As part of this process, you may need to create additional columns (e.g., to combine existing columns), join different dataframes, and so on in order to produce a check dataframe for expectations. 

In [None]:
data_asset_path = 's3a://alex-ge-test/data_assets/Titanic.csv'

In [None]:
df_check = load_csv(
    spark_context=spark,
    path=data_asset_path,
    delimiter=','
)

In [None]:
print(df_check.columns)

In [None]:
print((df_check.count(), len(df_check.columns)))

In [None]:
df_check.show(n=200, truncate=False)

## Define Expectation Suite Name

Now create the name for your expectation suite.

We recommend the naming convention that concatenates the root of your outputfile name (or project ID) with the suffix "_Expectation_Suite" at the end.  While the name of an expectation suite can be any alphanumeric string, this naming convention facilitates clarity, standardization, and repeatability.

In [None]:
expectation_suite_name = 'Titanic_Expectation_Suite'

## Create Expectation Suite

Use the GreatExpectations context to create your expectation suite with the above name.


In [None]:
ge_context.create_expectation_suite(
    expectation_suite_name=expectation_suite_name,
    overwrite_existing=True
)

## Obtain Data Batch

Now wrap your check dataframe into a batch of data within the Great Expectations context.

This is a 2-step process.  First, we create keywork arguments as a metadata for your data asset.  Then we use the GreatExpectations context to generate the batch of data from your data asset and place it within the scope of your expectation suite.  We also display several rows of the batch to make sure that the contents are the same as in your original check dataframe.  Finally, we print out the batch keyword arguments for diagnostics purposes.


In [None]:
batch_kwargs = {
    'datasource': 's3_files_spark_datasource',
    'dataset': df_check
}

In [None]:
batch = ge_context.get_batch(
    expectation_suite_name=expectation_suite_name,
    batch_kwargs=batch_kwargs
)
batch.head(10)

In [None]:
batch.batch_kwargs

## Use GreatExpectations API

The GreatExpectations API provides information about the data batch.  For example, `batch.get_table_columns()` returns the columns of your data asset.  In the remainder of this notebook, you will be expressing your reasoning about the data in these columns by creating various expectations on them.

In [None]:
data_source_column_names_list = batch.get_table_columns()
print(data_source_column_names_list, len(data_source_column_names_list))

## Create & Edit Expectations

Add expectations by calling specific expectation methods on the `batch` object. They all begin with `.expect_` which makes autocompleting easy using the "tab" key.

You can see all the available expectations in the **[expectation glossary](https://docs.greatexpectations.io/en/latest/expectation_glossary.html?utm_source=notebook&utm_medium=create_expectations)**.

In [None]:
column_list = data_source_column_names_list

In [None]:
result = batch.expect_table_columns_to_match_ordered_list(
    column_list=column_list,
    result_format='SUMMARY',
    include_config=True,
    catch_exceptions=None,
    meta=None
)
print(result, 'Success: {0}'.format(result.success))

In [None]:
min_value = 1300

In [None]:
max_value = 1500

In [None]:
result = batch.expect_table_row_count_to_be_between(
    min_value=min_value,
    max_value=max_value,
    result_format='SUMMARY',
    include_config=True,
    catch_exceptions=None,
    meta=None
)
print(result, 'Success: {0}'.format(result.success))

In [None]:
column_names = ['Name', 'PClass', 'Age', 'Sex', 'Survived', 'SexCode']

In [None]:
for column_name in column_names:
    result = batch.expect_column_values_to_not_be_null(
        column=column_name,
        mostly=None,
        result_format='SUMMARY',
        include_config=True,
        catch_exceptions=None,
        meta=None
    )
    print(result, 'Success: {0}'.format(result.success))
    print("\n")

In [None]:
column_name = '_c0'

In [None]:
result = batch.expect_column_values_to_not_be_null(
    column=column_name,
    mostly=9.8e-1,
    result_format='SUMMARY',
    include_config=True,
    catch_exceptions=None,
    meta=None
)
print(result, 'Success: {0}'.format(result.success))

In [None]:
# column_name = 'Zip'

In [None]:
# regex_pattern = '^[0-9]{5}(?:-[0-9]{4})?$'

In [None]:
# result = batch.expect_column_values_to_match_regex(
#     column=column_name,
#     regex=regex_pattern,
#     mostly=9.0e-1,
#     result_format='SUMMARY',
#     include_config=True,
#     catch_exceptions=None,
#     meta=None
# )
# print(result, 'Success: {0}'.format(result.success))

In [None]:
# column_name = 'Year'

In [None]:
# value_set = [
#     2019,
#     2020
# ]

In [None]:
# result = batch.expect_column_values_to_be_in_set(
#     column=column_name,
#     value_set=value_set,
#     mostly=None,
#     result_format='SUMMARY',
#     include_config=True,
#     catch_exceptions=None,
#     meta=None
# )
# print(result, 'Success: {0}'.format(result.success))

In [None]:
# column_name = 'Week'

In [None]:
# min_value = 1

In [None]:
# max_value = 52

In [None]:
# result = batch.expect_column_values_to_be_between(
#     column=column_name,
#     min_value=min_value,
#     max_value=max_value,
#     mostly=None,
#     result_format='SUMMARY',
#     include_config=True,
#     catch_exceptions=None,
#     meta=None
# )
# print(result, "Success: {0}".format(result.success))

## Save & Review Your Expectations

Let's save the expectation suite as a JSON file in the `great_expectations/expectations` directory of your project.
If you decide not to save some expectations that you created, use [remove_expectaton method](https://docs.greatexpectations.io/en/latest/module_docs/data_asset_module.html?highlight=remove_expectation&utm_source=notebook&utm_medium=edit_expectations#great_expectations.data_asset.data_asset.DataAsset.remove_expectation).

Let's now rebuild your Data Docs, which helps you communicate about your data with both machines and humans.

In [None]:
batch.get_expectation_suite(discard_failed_expectations=False)

In [None]:
batch.save_expectation_suite(discard_failed_expectations=False)

In [None]:
ge_context.build_data_docs()

In [None]:
sc.stop()