# Evaluate two datasets and monitor data drift  

### How to use this notebook
Customers often ask how to understand and evaluate the quality of their data as it changes over time. Even if you don't create synthetic data, you can use Gretel Evaluate to compare any two datasets, like monitoring data drift in the same database over time. This could be relevant if you collect more data (e.g. from user growth), implement a new data policy (like GPDR compliance), or need to monitor data quality for maintaining the accuracy of machine learning models. 




### Installation
 You'll have to get your API key from the [Gretel console](https://www.console.gretel.ai) to configure your session. 

In [None]:
# Install the latest Gretel Client
%%capture
%pip install -U gretel-client

In [None]:
# Configure your Gretel session - enter your API key when prompted
from gretel_client import configure_session

configure_session(endpoint="https://api.gretel.cloud", api_key="prompt", cache="yes")

### Configure the evaluation

In [None]:
import pandas as pd
from gretel_client.evaluation.quality_report import QualityReport
from gretel_client.projects import create_or_get_unique_project


project = create_or_get_unique_project(name="evaluate-datasets-monitor-data") 

data_path = 'https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/creditcard_kaggle_25k.csv.zip'
data_source = pd.read_csv(data_path)

ref_data_path = 'https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/creditcard_kaggle_25k.csv.zip'
ref_data = pd.read_csv(ref_data_path)

### Run Evaluate

In [None]:
# Create Quality Report LOCALLY, using the specified project
local_report = QualityReport(project=project, data_source=data_source, ref_data=ref_data, output_dir='report_results')
local_report.run()
local_report.peek()


### View the data quality report

In [None]:
# This will return the full HTML contents of the report.

import IPython
from smart_open import open

IPython.display.HTML(data=local_report.as_html)

### Next: evaluate data on machine learning classification models

In [None]:
from gretel_client.evaluation.downstream_classification_report import DownstreamClassificationReport

# Target to predict, required field -- enter the header name of the label or target
target = "Class" 

test_holdout = 0.05

# Supply a subset if you do not want all of these, default is to use all of them
# models = classification_models

# Metric to use for ordering results, default is "acc" (Accuracy) for classification
# metric = "acc"

# Evaluate classification
evaluate = DownstreamClassificationReport(
    project=project,
    target=target, 
    data_source=data_source, 
    ref_data=ref_data,
    holdout=test_holdout,
    # models=models,
    # metric=metric,
    # output_dir = '/tmp',
    # runner_mode="cloud",
)

evaluate.run() # this will wait for the job to finish

### View the data utility report

In [None]:
# This will return the full HTML contents of the report.

import IPython
from smart_open import open

IPython.display.HTML(data=evaluate.as_html)