[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb)

<br>

<center><a href=https://gretel.ai/><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></a></center>

<br>

## Welcome to the Gretel 101 Blueprint!

In this Blueprint, we will use Gretel Navigator Fine Tuning (navigator_ft) to fine tune a natural language model which was pre-trained specifically on tabular datasets with learned schema based rules. We use this model to generate high-quality synthetic (tabular) data and accomplish this by submitting training and generation jobs to the [Gretel Cloud](https://gretel.ai/faqs/gretel-cloud) via [Gretel's Python SDK](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).

This model supports multiple tabular modalities, such as numeric, categorical, free text, JSON, and time series values. The datasets provided in this notebook are selected to include them all.

Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.

## Create your Gretel account

To get started, you will need to [sign up for a free Gretel account](https://console.gretel.ai/).

<br>

#### Ready? Let's go 🚀

## 💾 Install `gretel-client` and its dependencies

In [None]:
%%capture
!pip install gretel-client

## 🛜 Configure your Gretel session

- The `Gretel` object provides a high-level interface for streamlining interactions with Gretel's APIs.

- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).

- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).

- With `validate=True`, your login credentials will be validated immediately at instantiation.

In [None]:
from gretel_client import Gretel

gretel = Gretel(api_key="prompt", validate=True)

In [None]:
# @title 🗂️ Pick a tabular dataset 👇 { display-mode: "form" }
dataset_path_dict = {
    "patient events (7348 records, 17 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-patient-events.csv", 
    "car accidents (25000 records, 46 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-car-accidents.csv", # cited papers: [Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019. & Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.]

}

dataset = "patient events (7348 records, 17 fields)" # @param [ "patient events (7348 records, 17 fields)", "car accidents (25000 records, 46 fields)" ]
dataset = dataset_path_dict[dataset]



In [None]:
import pandas as pd

# explore the data using pandas
df = pd.read_csv(dataset)
df.head()

In [None]:
# Patient dataset is sequential which is grouped based on "patient_id". Each group is ordered by the "even_id".
df_sorted = df.groupby('patient_id', group_keys=True).apply(lambda group: group.sort_values('event_id'))
df_sorted 

## 🏋️‍♂️ Train a generative model

- The [navigator-ft](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/navigator-ft.yml) base config tells Gretel which model to train and how to configure it.

- You can replace `navigator-ft` with the path to a custom config file, or you can select any of the tabular configs [listed here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics).

- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.

- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console.

In [None]:
# For car accident data, the "group_training_examples_by" and "order_training_examples_by" params can be dropped since the data is not sequantial.
trained = gretel.submit_train("navigator-ft", 
                            group_training_examples_by= "patient_id", # groupby the "patient_id" column
                            order_training_examples_by= "event_id", # orderby the "event_id" column
                            data_source=dataset)

## 🧐 Evaluate the synthetic data quality

- Gretel automatically creates a [synthetic data quality report](https://docs.gretel.ai/reference/evaluate/synthetic-data-quality-report) for each model you train.

- The training results object returned by `submit_train` has a `GretelReport` attribute for viewing the quality report.


In [None]:
# view the quality scores
print(trained.report)

In [None]:
# display the full report within this notebook
trained.report.display_in_notebook()

In [None]:
# inspect the synthetic data used to create the report
df_synth_report = trained.fetch_report_synthetic_data()
df_synth_report.head()

## 🤖 Generate synthetic data

- The `model_id` argument can be the ID of any trained model within the current project.


In [None]:
generated = gretel.submit_generate(trained.model_id, num_records=1000)

In [None]:
# inspect the generated synthetic data
generated.synthetic_data.head()