# Creating Expectation Suites with Pandas Profiling

Pandas profiling provide a feature that can generate an expectation suite based on the data profile analysis.

But this feature does not work anymore due to the new release of the greate expectation.

The official doc can be found [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/great_expectations_integration.html)

The code example can be found [here](https://github.com/pandas-profiling/pandas-profiling/blob/develop/examples/features/great_expectations_example.py)

I found this feature is no longer needed, because greate expectation provides a feature called **profiler** which does exactly the same thing. So I did not provide a fix or pr to this bug.

I manage to make below code example work, but the generated expectation suite is empty. If I added the expectation manually, it could work, but it does not make any sense anymore. Because the goal is to generate the appropriate expectation rules automatically.

In [25]:
import pandas as pd
from pandas_profiling import ProfileReport
import great_expectations as ge
from ruamel.yaml import YAML

In [14]:
file_path="../data/adult_with_duplicates.csv"
df = pd.read_csv(file_path)

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,139.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,-12.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
2,,emp-by-pengfei,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
3,39.5,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
4,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


In [15]:
# generate a data profile report
data_profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

# 1 Instantiate a GE data_context

In [28]:
ge_project_root_dir="../great_expectations_validation/great_expectations"

# this will read great_expectations.yaml in the project and build the data_context
data_context = ge.data_context.DataContext(context_root_dir=ge_project_root_dir)


ConfigNotFoundError: Error: No great_expectations directory was found here!
    - Please check that you are in the correct directory or have specified the correct directory.
    - If you have never run Great Expectations in this project, please run `great_expectations init` to get started.


# 2. Create the expectation suite

In [22]:
# Create the expectation suite by using the pandas' data profiling
expectation_suite = data_profile.to_expectation_suite(suite_name="pandas_generated_expectations",
    data_context=data_context,
    save_suite=False,
    run_validation=False,
    build_data_docs=False)

# 3. Create a GE dataframe and associate it with an expectation suite

In [23]:
# convert pandas dataframe to ge dataframe
ge_df = ge.dataset.PandasDataset(df,expectation_suite)
ge.get_context()

In [24]:
# Run validation on your dataframe
ge_df.validate()


{
  "success": true,
  "statistics": {
    "evaluated_expectations": 0,
    "successful_expectations": 0,
    "unsuccessful_expectations": 0,
    "success_percent": null
  },
  "evaluation_parameters": {},
  "results": [],
  "meta": {
    "great_expectations_version": "0.14.1",
    "expectation_suite_name": "default",
    "run_id": {
      "run_name": null,
      "run_time": "2022-01-22T12:54:48.370199+00:00"
    },
    "batch_kwargs": {
      "ge_batch_id": "79b5779a-7b82-11ec-839c-65b23fbe7c08"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20220122T125448.370151Z",
    "expectation_suite_meta": {
      "great_expectations_version": "0.14.1"
    }
  }
}

You can notice the expectation suite has **0** validation rule. And we do find the expectation json file **pandas_generated_expectations.json**. And the content of the file is shown below:

```json
{
  "data_asset_type": null,
  "expectation_suite_name": "pandas_generated_expectations",
  "expectations": [],
  "ge_cloud_id": null,
  "meta": {
    "great_expectations_version": "0.14.1"
  }
}

```

You can notice, the expectation list is empty at line 3. As we don't have **.** in the expectation suite name so there is no folder generated (Unlike pengfei.text1). It's just a simple json file under **expectations** folder.

In the web UI, you can also find the expectation suite **pandas_generated_expectations** is created. See below image
![expectation_suite](../images/expectation_pandas_validation.png)

Just to verify that if we add validation rule in expectation list, it could work. We added the following code to insert a rule manually. Then we create a checkpoint to test the expectation suite on a sample data

In [32]:

yaml = YAML()

# Use yaml to configure a checkpoint
my_checkpoint_name = "pandas_generated_checkpoint"

# create a new checkpoint by using the expectation pandas_generated_expectations
checkpoint_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-pandas_generated_checkpoint_run"
validations:
  - batch_request:
      datasource_name: pengfei_test
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: adult_with_duplicates.csv
      data_connector_query:
        index: -1
    expectation_suite_name: pandas_generated_expectations
"""

# preview the checkpoint config
print(checkpoint_config)


name: pandas_generated_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-pandas_generated_checkpoint_run"
validations:
  - batch_request:
      datasource_name: pengfei_test
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: adult_with_duplicates.csv
      data_connector_query:
        index: -1
    expectation_suite_name: pandas_generated_expectations



In [33]:
# check the config correctness
my_checkpoint = data_context.test_yaml_config(yaml_config=checkpoint_config)

Attempting to instantiate class from config...
	Instantiating as a SimpleCheckpoint, since class_name is SimpleCheckpoint
{
  "name": "pandas_generated_checkpoint",
  "config_version": 1.0,
  "template_name": null,
  "module_name": "great_expectations.checkpoint",
  "class_name": "SimpleCheckpoint",
  "run_name_template": "%Y%m%d-%H%M%S-pandas_generated_checkpoint_run",
  "expectation_suite_name": null,
  "batch_request": null,
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "evaluation_parameters": {},
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_n

In [34]:
# save the checkpoint
data_context.add_checkpoint(**yaml.load(checkpoint_config))

{
  "name": "pandas_generated_checkpoint",
  "config_version": 1.0,
  "template_name": null,
  "module_name": "great_expectations.checkpoint",
  "class_name": "Checkpoint",
  "run_name_template": "%Y%m%d-%H%M%S-pandas_generated_checkpoint_run",
  "expectation_suite_name": null,
  "batch_request": null,
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "evaluation_parameters": {},
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "pengfei_test",
        "data_connector_name": "default_inferred_data_connector_name",
        "data_asset_name": "adult_w

<great_expectations.checkpoint.checkpoint.SimpleCheckpoint at 0x7f46c2153550>

In [35]:
# run the checkpoint
data_context.run_checkpoint(checkpoint_name=my_checkpoint_name)

# view the result
data_context.open_data_docs()

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

The checkpoint is executed, see the below image.

You can notice the validation failed, because we added a rule to avoid null value, but the sample data contains null value. So it's normal

![expectation_pandas_profiling_integration](../images/expectation_pandas_integration.png)