# Gretel Blueprint: Auto-Balance Dataset
Use Gretel-Synthetics to automatically balance your project data. This blueprint can be used in support of fair AI and generally any imbalanced dataset to boost minority classes. In one pass, bias will be completely removed from as many fields as you like.

# Install Packages
Install open source and premium packages from Gretel.ai

In [None]:
%%capture
!pip install numpy pandas 
!pip install -U gretel-client "gretel-synthetics>=0.14.0"

In [None]:
# Be sure to use your Gretel URI here, which is available from the Integration menu in the Console

import getpass
import os

gretel_uri = os.getenv("GRETEL_URI") or getpass.getpass("Your Gretel URI")

In [None]:
# Install Gretel SDKs

from gretel_client import project_from_uri

project = project_from_uri(gretel_uri)
client = project.client
project.client.install_packages()

# Import Blueprint Modules
If you are running on Google Colab, use the first cell to download files from our blueprint repo into a Colab notebook's working directory. Remember to change colab to a GPU runtime.

In [None]:
!curl -sL https://get.gretel.cloud/blueprint.sh | bash -s gretel/auto_balance_dataset/*.py

In [None]:
import bias_bp_inputs as bpi
import bias_bp_generate as bpgen
import bias_bp_graphs as bpg
import bias_bp_data as bpd

# Gather Project Data
There are two different modes for balancing your data.  The first (mode="full"), is the scenario where you'd like to generate a complete synthetic dataset with bias removed. The second (mode="additive"), is the scenario where you only want to generate synthetic samples, such that when added to the original set will remove bias.

In the below command to gather project data, specifiy the appropriate mode, as well as the number of records from your project that you'd like to use (num_records). If you are running in mode "full", please also specify 
the number of synthetic data records you'd like generated (gen_lines). If you are running in mode "additive", we will tell you the number of synthetic data records that will need to be generated to balance your dataset after you have chosen the fields to balance.

In [None]:
project_info = bpd.get_project_info(project, mode="full", num_records=14000, gen_lines=1000)

In [None]:
project_info["records"].head()

# Look at Current Categorical Field Distributions
Graphs are shown for categorical fields having a unique value count less than or equal 
to the parameter "uniq_cnt_threshold".  Adjust this parameter to fit your needs.

In [None]:
bpg.show_field_graphs(project_info["field_stats"], uniq_cnt_threshold=10)

# Choose Which Fields to Fix Bias In

In [None]:
project_info = bpi.choose_bias_fields(project_info)

# Compute Records Needed to Fix Bias

If you are running in mode "additive", this command will also tell you the total number of synthetic
records that will need to be generated to fix the bias in your chosen fields. After viewing this, if you
would like to go back and adjust your bias field selections, you may.

In [None]:
project_info = bpgen.compute_synth_needs(project_info)

# Train Your Synthetic Model

- See [our documentation](https://gretel-synthetics.readthedocs.io/en/stable/api/config.html) for additional config options

In [None]:
# Create the Gretel Synthtetics Training / Model Configuration
from pathlib import Path

checkpoint_dir = str(Path.cwd() / "checkpoints")

config_template = {
    "checkpoint_dir": checkpoint_dir,
    "overwrite": True
}

In [None]:
#Create the Synthetic Training Bundle
from gretel_helpers.synthetics import SyntheticDataBundle

bundle = SyntheticDataBundle(
    header_prefix=bpd.bias_fields(project_info),
    training_df=project_info["records"],
    delimiter=",", # Specify the appropriate delimeter in your data
    auto_validate=True, 
    synthetic_config=config_template, 
)

bundle.build()

In [None]:
# Now train your model
bundle.train()

# Generate Balanced Synthetic Data

In [None]:
synth_df = bpgen.gen_synth_nobias(bundle, project_info)

# Take a Look At Your Synthetic Data

In [None]:
synth_df.head()

# Combine Your Original and New Synthetic Data
Relevant if you are using mode="additive"

In [None]:
import pandas as pd
new_df = pd.concat([synth_df,project_info["records"]],ignore_index=True)

# Save to CSV

In [None]:
synth_df.to_csv('synthetic-data.csv', index=False, header=True)
#new_df.to_csv('synth-plus-orig-data.csv', index=False, header=True)

# Save to New Gretel Project

In [None]:
new_project = client.get_project(create=True)
new_project.send_dataframe(synth_df, detection_mode='fast') #alternatively use new_df
print(f"Access your project at {new_project.get_console_url()}") 

In [None]:
#Delete project if you don't need it
new_project.delete()

# Show New Distributions
When running in "full" mode, graphs will be shown comparing training data to synthetic data.  When running in "additive" mode, still pass in the synth_df and the graphs will automatically compare training data to training plus synthetic records.

In [None]:
bpg.show_new_graphs(project_info, synth_df)

# Generate a Full Synthetic Performance Report
Correlations and distributions in non-bias fields should, as always, transfer from training data to synthetic data.

In [None]:
from gretel_helpers.reports.correlation import generate_report
from IPython.core.display import display
from IPython.display import IFrame

generate_report(project_info["records"], synth_df, report_path="./report.html") #alternatively use new_df
display(IFrame("./report.html", 1000, 600))