# Generate a Synthetic Data Quality Report with Gretel Evaluate

* This notebook the process of generating a SQS report using Gretel Evaluate.
* To run this notebook, you will need an API key from the Gretel console, at https://console.gretel.cloud.



# Getting started

In [1]:
%%capture
!pip install -U gretel-client

In [2]:
import pandas as pd

from gretel_client.config import RunnerMode
from gretel_client.evaluation.quality_report import QualityReport
from gretel_client import configure_session
from gretel_client.projects import create_or_get_unique_project

In [3]:
# Specify your Gretel API Key

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", cache="yes", validate=True)

Gretel Api Key··········
Caching Gretel config to disk.
Using endpoint https://api.gretel.cloud
Logged in as grace@gretel.ai ✅


# Load and preview the datasets


Specify a real-world dataset and a synthetic dataset to evaluate. The synthetic data was generated from the real-world data. These can be local files or web locations.

For demonstration purposes, we'll use an United States Census dataset as our real-world data. Our synthetic data is the corresponding data generated by Gretel Synthetics.

In [4]:
# Load and preview real-world data

real_data = "https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"

real_df = pd.read_csv(real_data)
real_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,42,Private,255847,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,4386,0,48,United-States,>50K
1,34,Private,111567,HS-grad,9,Never-married,Transport-moving,Own-child,White,Male,0,0,40,United-States,<=50K
2,34,Private,263307,Bachelors,13,Never-married,Sales,Unmarried,Black,Male,0,0,45,?,<=50K
3,69,Private,174474,10th,6,Separated,Machine-op-inspct,Not-in-family,White,Female,0,0,28,Peru,<=50K
4,26,Private,260614,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,42,Self-emp-inc,287037,12th,8,Divorced,Craft-repair,Not-in-family,White,Male,0,0,10,United-States,<=50K
4996,48,Private,236858,11th,7,Divorced,Other-service,Not-in-family,White,Female,0,0,31,United-States,<=50K
4997,53,Private,317313,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,60,United-States,>50K
4998,23,Private,113601,Some-college,10,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,30,United-States,<=50K


In [5]:
# Load and preview synthetic data

synth_data = "https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/USAdultIncome5kGenerated.csv"

synth_ref = pd.read_csv(synth_data)
synth_ref

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,29,Private,179541.0,11th,7,Married-civ-spouse,Sales,Husband,White,Male,0,0,60,United-States,>50K
1,17,?,143604.0,10th,6,Never-married,?,Own-child,White,Male,0,0,12,United-States,<=50K
2,80,?,242001.0,Masters,14,Widowed,?,Not-in-family,Other,Male,0,0,48,United-States,<=50K
3,27,?,143058.0,11th,7,Never-married,?,Own-child,White,Female,0,0,40,United-States,<=50K
4,29,Private,116834.0,HS-grad,9,Never-married,?,Not-in-family,White,Male,0,0,35,?,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,49,Private,94413.0,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,48,United-States,>50K
4996,42,Private,31621.0,Bachelors,13,Separated,Sales,Own-child,Black,Female,0,0,35,United-States,<=50K
4997,35,Private,167967.0,11th,7,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,35,United-States,>50K
4998,37,Private,213640.0,Some-college,10,Divorced,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K


# Create a Quality Report 

Now, we will task a worker running in the Gretel cloud to generate a Quality Report using a temporary project.

In [6]:
report = QualityReport(data_source=synth_data, ref_data=real_data)
report.run()
report.peek()

[32mINFO: [0mStarting poller


{
    "uid": "62cf29e3152cbb2acd5fc695",
    "guid": "model_2Bu6uEuSzOVAUJEeuZZb8gyBt4o",
    "model_name": "clever-wiggly-iguana",
    "runner_mode": "cloud",
    "user_id": "626c14f7bff6215ff674589c",
    "user_guid": "user_28TpRgxbGDpUE4TUKlJrFDwB4Aq",
    "billing_domain": "gretel.ai",
    "billing_domain_guid": "domain_28eujAnf9EFme26oSFok8xCUT4n",
    "project_id": "62cf29d92901ead70dc2ff58",
    "project_guid": "proj_2Bu6t2vYJx40xKQvxJ1robiNguH",
    "status_history": {
        "created": "2022-07-13T20:24:03.257890Z"
    },
    "last_modified": "2022-07-13T20:24:03.420439Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/evaluate@sha256:9311f2a0b7228573ae2c252157e5057f09f7b24de8651918c2dbfc0654434cdd",
    "model_type": "evaluate",
    "config": {
        "schema_version": "1.0",
        "name": null,

[32mINFO: [0mStatus is pending. A Gretel Cloud worker is being allocated to begin model creation.
[32mINFO: [0mStatus is active. A worker has started creating your model!
2022-07-13T20:24:13.503898Z  Starting Gretel Evaluate
2022-07-13T20:24:13.504914Z  Loading data sets for SQS creation...
2022-07-13T20:24:13.534278Z  Creating SQS...
2022-07-13T20:24:24.624564Z  SQS finished, exporting report artifacts...
2022-07-13T20:24:25.029731Z  Evaluate job completed!
2022-07-13T20:24:25.030903Z  Uploading artifacts to Gretel Cloud


{'grade': 'Excellent', 'raw_score': 91.42962962962963, 'score': 91}

# View results

## Synthetic Data Quality Score (SQS)

In [7]:
report.as_dict["synthetic_data_quality_score"]

{'grade': 'Excellent', 'raw_score': 91.42962962962963, 'score': 91}

## Gretel Synthetic Report as HTML

In [8]:
import IPython
from smart_open import open

IPython.display.HTML(report.as_html)

0,1,2,3,4,5
Synthetic Data Use Cases,Excellent,Good,Moderate,Poor,Very Poor
Significant tuning required to improve model,,,,,
Improve your model using our tips and advice,,,,,
Demo environments or mock data,,,,,
Pre-production testing environments,,,,,
Balance or augment machine learning data sources,,,,,
Machine learning or statistical analysis,,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,5000,5000
Column Count,15,15
Training Lines Duplicated,--,0

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
education_num,16,0,1.55,Categorical,Excellent
education,16,0,8.43,Categorical,Excellent
capital_gain,79,0,1.28,Categorical,Excellent
hours_per_week,82,0,1.98,Categorical,Excellent
age,70,0,2.0,Categorical,Excellent
native_country,40,0,12.3,Categorical,Excellent
income_bracket,2,0,4.76,Binary,Excellent
fnlwgt,4557,0,5.83,Numeric,Excellent
capital_loss,53,0,1.14,Categorical,Excellent
occupation,15,0,12.18,Categorical,Excellent
