## Setup and Installation

This section installs required python and system dependencies for the notebook to run, and then it creates a session with the Gretel API endpoint so that we can communicate with Gretel Cloud. Learn more in our documentation covering [environment setup](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).

In [None]:
%%capture
!pip install -U gretel-trainer gretel-client

## Gretel Setup
Set up the Gretel API connection

In [None]:
from getpass import getpass
from gretel_client import configure_session

gretel_endpoint = "https://api.gretel.cloud"
gretel_api_key = getpass("API Key: ")

configure_session(
    api_key=gretel_api_key,
    endpoint=gretel_endpoint,
    validate=True,
    clear=True,
)

## Fetch and prepare data
Read in the dataset as a Gretel Relational object

In [None]:
from gretel_trainer.relational import *
import pandas as pd

DATA_PATH = "https://gretel-datasets.s3.us-west-2.amazonaws.com/telecom.json"

data = pd.read_json(DATA_PATH)
data.iloc[:5].to_json("telecom_preview.json", orient="table", indent=4, index=None)

rd = RelationalData()
rd.add_table(name="telecom", primary_key=None, data=data)

### Select JSON tables

Specify selected JSON tables based on minimum required records and nesting depth

In [None]:
MINIMUM_REQUIRED_RECORDS = 1000

# We already omit empty invented tables from the set of tables considered "modelable"
all_tables = rd.list_all_tables("all")
modelable_tables = rd.list_all_tables("modelable")

below_threshold_tables = [table for table in modelable_tables if len(rd.get_table_data(table)) < MINIMUM_REQUIRED_RECORDS]
above_threshold_tables = [table for table in modelable_tables if table not in below_threshold_tables]

print(f"total table count: {len(all_tables)}")
print(f"modelable table count: {len(modelable_tables)}")
print(f"below threshold count: {len(below_threshold_tables)}")
print(f"above threshold count: {len(above_threshold_tables)}")

Specify selected JSON tables based on nesting depth

In [None]:
MAX_JSON_DEPTH = 3

table_separator = gretel_trainer.relational.json.TABLE_SEPARATOR

def get_depth(rd: RelationalData, table: str):
    invented_table_metadata = rd.get_invented_table_metadata(table)
    breadcrumb = invented_table_metadata.json_breadcrumb_path
    data = rd.get_table_data(table)
    data.to_csv(f"{table}.csv", index=False)
    return breadcrumb.count(table_separator)

json_depths = {
    table: get_depth(rd, table)
    for table in above_threshold_tables
}

ok_tables = [table for table, depth in json_depths.items() if depth <= MAX_JSON_DEPTH]
print(f"modelable tables above record threshold and within max json depth: {len(ok_tables)}")

## Train Synthetic model on nested JSON data
Note that in this example we will use our tabular-dp model for all tables.  

Model training on the demo nested JSON dataset will take around 10 minutes to complete.

In [None]:
PROJECT_DISPLAY_NAME = "demo-nested-json"

mt = MultiTable(rd, project_display_name=PROJECT_DISPLAY_NAME)

config = "synthetics/tabular-differential-privacy"

mt.train_synthetics(config=config, only=ok_tables)

## Generate synthetic JSON records

Now that our model is trained, we can generate high quality synthetic json record at a fraction or multiple of the original data source.

In [None]:
RECORD_SIZE_RATIO = 1.0

mt.generate(record_size_ratio=RECORD_SIZE_RATIO)       # To adjust the amount of data generated, change record_size_ratio parameter

## Convert synthetic dataframe to single JSON

In [None]:
df = mt.synthetic_output_tables['telecom']
df.iloc[:5].to_json("synth_telecom_preview.json", orient="table", indent=4, index=None)
df.to_json("synth_telecom.json", orient="table", indent=4, index=None)

## Accessing Output Files
All of the Relational Synthetics output files can be found in your local working directory. Additionally, you can download the outputs as a single archive file from the Gretel Console using the below URL.


In [None]:
console_url = f"https://console.gretel.ai/{mt._project.name}"
print(console_url)