<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/navigator_augmenting_llm_training_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br>

<center><a href=https://gretel.ai/><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></a></center>

<br>


# 🚀 Augmenting LLM training data with high-quality _synthetic_ examples
In this notebook, we will leverage [Gretel's Navigator](https://gretel.ai/navigator) to generate diverse, high-quality training examples to efficiently train/fine-tune better LLMs with less data. Our goal is to demonstrate how to get started creating high-quality synthetic data for LLM training and facilitate further research into safeguards for completion models.

## Background
Recent research has shown that training small, efficient language models on high-quality, diverse data can achieve state-of-the-art results, as demonstrated by Microsoft's [phi-1.5](https://arxiv.org/abs/2309.05463) and [Orca2](https://arxiv.org/abs/2311.11045) models.

Creating diverse synthetic training data is challenging but vital to reduce overfitting and improve generalization. We will demonstrate how to boost generation diversity using an approach similar to the [TinyStories](https://arxiv.org/abs/2305.07759) study, in which the authors chose random words from a fixed vocabulary to inject into the prompt.

## Prerequisites

Before diving into the notebook, there are a couple of prerequisites:

1. **Gretel API Key**: You'll need an API key from Gretel. If you don't have one already, you can obtain it from [Gretel's console]((https://console.gretel.ai/users/me/key)). This key will enable us to use Gretel's services for generating our synthetic datasets.

2. **Access to Gretel's Navigator**: To utilize the specific features of Navigator, you need to have access to the early preview. If you're not already signed up, you can request early access at [Gretel's Navigator page](https://gretel.ai/navigator).

Let's get started!


## 💾 Install and import necessary packages

In [None]:
%%capture
!pip install datasets keybert keyphrase_vectorizers
!pip install -Uqq gretel_client

In [None]:
import numpy as np
import pandas as pd

from datasets import load_dataset
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer

from gretel_client import Gretel

## 🛜 Configure your Gretel session and initialize Navigator

- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).

In [None]:
gretel = Gretel(api_key="prompt")

navigator = gretel.factories.initialize_inference_api(backend_model="gretelai-google/gemini-pro")

## ⚙️ Set demo parameters

In [None]:
DATASET_NAME = "databricks/databricks-dolly-15k"
CATEGORY = "closed_qa"
MAX_WORDS = 400
NUM_EXAMPLES = 10
NUM_SELECT_PHRASES = 2
UPSAMPLE_MULTIPLE = 3
RANDOM_SEED = len("GRETEL")

## 💾 Load and preprocess dataset

In the cell below we perform the following preprocessing steps:

- Load the Dolly dataset and convert to a pandas `DataFrame`

- Select examples in the set `CATEGORY`

- Clean text and convert to ascii

- Remove examples with more words than `MAX_WORDS`

- Drop unnecessary columns

- Sample `NUM_EXAMPLES` examples for the demo

In [None]:
%%capture
dataset = load_dataset(DATASET_NAME, split="train")

df = (
    dataset
    .to_pandas()
    .query("category==@CATEGORY")
    .applymap(lambda x: x.replace('\n', ' ').replace('\r', ' ').encode('ascii', 'ignore').decode('ascii'))
    .assign(num_words=lambda df_: df_["context"].str.cat(df_["response"], sep=" ").str.split().apply(len))
    .query("num_words < @MAX_WORDS")
    .drop(columns=["category", "num_words"])
    .sample(NUM_EXAMPLES, random_state=RANDOM_SEED)
    .reset_index(drop=True)
)

In [None]:
navigator.display_dataframe_in_notebook(df)

## 🗝️ Identify key phrases

- Here we use a BERT-based model to extract interesting/important key phrases from the `context` of each example.

- We upsample the dataset by a factor of `UPSAMPLE_MULTIPLE`, which will allow us to create multiple examples from the same context.

- We then sample `NUM_SELECT_PHRASES` from the extracted key phrases.

- We will use these `select_phrases` to boost the diversity in our synthetic instructions.

In [None]:
%%capture
np.random.seed(RANDOM_SEED)

def sample_key_phrases(phrases):
    random_idx = np.random.choice(len(phrases), NUM_SELECT_PHRASES, replace=False)
    return ", ".join([phrases[i][0] for i in random_idx])


df["select_phrases"] = KeyBERT().extract_keywords(
    docs=df["context"].tolist(),
    vectorizer=KeyphraseCountVectorizer(), top_n=3 * NUM_SELECT_PHRASES
)

df = pd.DataFrame(np.repeat(df.values, UPSAMPLE_MULTIPLE, axis=0), columns=df.columns)

df["select_phrases"] = df["select_phrases"].apply(sample_key_phrases)

In [None]:
# preview example context + select_phrases
navigator.display_dataframe_in_notebook(df[["context", "select_phrases"]].head())

## 🤖 Prompt Gretel's Navigator to create synthetic instructions and responses

In [None]:
prompt = """\
For each example in the Dataset, please act as a tutor and create high quality,
detailed synthetic question and answers of higher quality than the provided example.
Frame your question around one of the phrases in the 'select_phrases' column.
Ensure the data teaches concepts step-by-step and focuses on improving reasoning skills.
Focus on generating questions and answers about under-represented topics and knowledge gaps.

Add two new columns to the Dataset:
1. 'synthetic_instruction':
  * Introduce the topic from the example briefly in 1-2 sentences
  * Ask a clear question related to the topic that requires logical thinking or common sense reasoning
  * Provide any necessary context to set up the reasoning problem
  * Do not repeat the instruction from the Dataset example
2. 'synthetic_response':
  * Respond to the synthetically generated instruction thoroughly in a step-by-step manner
  * Provide the complete reasoning needed to arrive at the answer
  * Ensure the explanation is textbook quality with all details needed to learn the concept
  * Answer in 3-5 sentences.
"""

In [None]:
columns = ["instruction", "context", "response", "synthetic_instruction", "synthetic_response"]

synthetic = navigator.edit(prompt=prompt, seed_data=df, top_k=40, temperature=0.8)[columns]

In [None]:
navigator.display_dataframe_in_notebook(synthetic)

In [None]:
# Comparing quality
import IPython

from gretel_client.evaluation.text_quality_report import TextQualityReport

real=pd.DataFrame()
real['text'] = synthetic['instruction'] + " " + synthetic['response']
synthetic['text'] = synthetic['synthetic_instruction'] + " " + synthetic['synthetic_response']

report = TextQualityReport(data_source=synthetic,
                           ref_data=real,
                           target='text',
                           record_count=len(synthetic))
report.run()

IPython.display.HTML(report.as_html, metadata=dict(isolated=True))