[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/generate_differentially_private_synthetic_text.ipynb)

<br>

<center><a href=https://gretel.ai/><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></a></center>

<br>

## Generate Differentially Private Synthetic Text with Gretel GPT

In this Blueprint, we'll demonstrate fine-tuning Gretel GPT on a dataset using differential privacy, generating synthetic text suitable for analytics, ML, or AI applications. You will need need a [free Gretel account](https://console.gretel.ai/) to run this notebook. If this is your first time using the Gretel Client SDK, you can learn more about it [here](https://docs.gretel.ai/gretel-basics/getting-started/blueprints).

<br>

### Dataset

1. **alexa/Commonsense-Dialogues on 🤗**: Consists of 9k snippets of everyday conversations between people. Training time: 2hrs.

#### Ready? Let's go 🚀

## 💾 Install gretel-client and dependencies

In [None]:
%%capture
! pip install -Uqq gretel-client datasets

## 🛜 Configure your Gretel session

- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).  

- Set the project name at instantiation or use the `set_project` method.

- Retrieve your API key [here](https://console.gretel.ai/users/me/key).

In [None]:
from gretel_client import Gretel

gretel = Gretel(project_name="dp-synthetic-text", api_key="prompt", validate=True)

## 📂 Load and Process the Dataset

In [None]:
from datasets import load_dataset
import pandas as pd

def print_dataset_statistics(data_source):
    """Print high level dataset statistics"""
    num_rows = data_source.shape[0]
    num_chars = data_source['text'].str.len().sum()

    print(f"Number of rows: {num_rows}")
    print(f"Number of characters: {num_chars}")

# Load the commonsense dialogues dataset, preprocessed into dialog format
dataset = load_dataset("meowterspace42/commonsense_dialogues")

# Convert the dataset to a pandas DataFrame
dataset_df = dataset['train'].to_pandas()

print("Sample Dialogue:\n")
print(dataset_df.iloc[0]['text'])
print_dataset_statistics(dataset_df)


## 🏗️ Train Gretel GPT with a **custom configuration**

###Base Configuration
For the full base YAML configuration for Gretel GPT, refer to [this link](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/natural-language.yml).

###Customizing the Configuration
You can customizing the configuration using *keyword arguments* in the `submit_train` method. The keywords can be any of the sections under the model, such as `params`, `generate`, or `privacy_params`. The values must be dictionaries with parameters from the associated section. Tip: Use the `job_label` argument to append a descriptive label to the model's name.

☕ Go grab grab a coffee while the model fine-tunes!

In [None]:
# Submit the fine-tuning job to Gretel

trained = gretel.submit_train(
    base_config="natural-language",
    job_label="commonsense_epsilon_8",
    data_source=dataset_df,
    params={
        "pretrained_model": "mistralai/Mistral-7B-Instruct-v0.2",
        "batch_size": 8,
        "steps": None,
        "epochs": 3,
        "max_tokens": 512,
        "learning_rate": 0.001
    },
    privacy_params={
        "dp": True,
        "epsilon": 8,
        "delta": "auto"
    },
    generate={
        "num_records": 100,
        "temperature": 0.8,
        "maximum_text_length": 512
    }
)
print(trained.model_id)

### 🔄 Loading a Fine-tuned Model

If you want to reload the trained model object later, do it like this:

```python
trained = gretel.fetch_train_job_results(model_id)
```

## 📈 View the synthetic quality report

In [None]:
# view synthetic data quality scores
print(trained.report)

## 📄 View the sample generation

In [None]:
df = trained.fetch_report_synthetic_data()

print("Sample Dialogue:\n")
print(df.iloc[0]['text'])

## 🌱 Prepare the seed data

- Conditional data generation is accomplished by submitting seed data, which can be given as a file path or `DataFrame`.

- The seed data should contain a subset of the dataset's columns with the desired seed values.

- Currently, only categorical seed columns are supported.

In [None]:
import pandas as pd

# A dataframe with 5 sample commonsense conversation contexts to complete.
data = {
    "text": [
        "The context of the following conversation is that Ashley went to a fancy dinner party at a high-end restaurant. She accidentally spilled soup on the host's expensive rug.",
        "The context of the following conversation is that John missed his flight and is now trying to find an alternative way to get to his business meeting on time.",
        "The context of the following conversation is that Mary found a stray cat on her way home and is figuring out what to do with it.",
        "The context of the following conversation is that Mike's car broke down in the middle of a road trip, and he needs to get it fixed to continue his journey.",
        "The context of the following conversation is that Sarah is planning a surprise birthday party for her best friend and needs to keep it a secret while making all the arrangements."
    ]
}

seed_data = pd.DataFrame(data)

## 🤖 Generate additional DP synthetic data

- The `submit_generate` method requires either `num_records` **or** `seed_data` as a keyword argument.

- If `seed_data` is given, the number of generated records will equal `len(seed_data)`.

- **Tip:** You can generate data from any trained model in the current project by using its associated `model_id`.

In [None]:
generated = gretel.submit_generate(trained.model_id, seed_data=seed_data, temperature=0.8, maximum_text_length=512)

In [None]:
# inspect conditionally generated data
print("Sample Dialogue:\n")
print(generated.synthetic_data.iloc[0]['text'])