[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_Text_Generation_Blueprint.ipynb)

<br>

<center><a href=https://gretel.ai/><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></a></center>

<br>

## Welcome to the Gretel Text Generation Blueprint!

In this Blueprint, we will leverage Gretel's platform to **fine tune** and **prompt** a multi-billion parameter large language model (LLM). If you already worked through the tabular Blueprints, you will notice that Gretel's SDK interface is the same for all data modalities, so you only need to learn it once!

As with the previous Blueprints, we will submit training and generation jobs to the Gretel Cloud, which will spin up the compute resources required for fine tuning and prompting an LLM.

## In the right place?

If you are new to Gretel, we recommend starting with these [SDK Blueprints](https://github.com/gretelai/gretel-blueprints/tree/main/sdk_blueprints):

1. [Gretel 101 Blueprint](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb).

2. [Gretel Advanced Tabular Blueprint](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_Advanced_Tabular_Blueprint.ipynb)

**Note:** You will need a [free Gretel account](https://console.gretel.ai/) to run this notebook.

<br>

#### Ready? Let's go 🚀

In [None]:
%%capture
!pip install gretel-client

## 🛜 Configure your Gretel session

- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).  

- You can set the project name at instantiation, or you can use the `set_project` method.

- If you do not set the project, a random project will be created with your first job submission.


- You can retrieve your API key [here](https://console.gretel.ai/users/me/key).

In [None]:
from gretel_client import Gretel

gretel = Gretel(project_name="text-gen", api_key="prompt", validate=True)

In [None]:
# @title 🗂️ Set the dataset path

dataset_path_dict = {
    "banking intents and queries (1082 examples)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-banking-questions-intents.csv",
}
dataset = "banking intents and queries (1082 examples)" # @param ["banking intents and queries (1082 examples)"]

dataset = dataset_path_dict[dataset]

- This Blueprint's dataset is based on a subset of [banking77](https://huggingface.co/datasets/PolyAI/banking77), which consists of banking queries and the associated intents.

- In the [data subset](https://github.com/gretelai/gretel-blueprints/blob/main/sample_data/sample-banking-questions-intents.csv), we concatenate each intent and query into a single string for input into a pretrained language model.

<br>

#### Example inputs

- **intent**,_query_
- **card payment fee charged**,_What was the extra charge for using my card?_
- **transaction charged twice**,_Why are there multiple transactions showing for one purchase?_
- **cash withdrawal charge**,_Why did I get a fee for getting cash?_
- **wrong amount of cash received**,_You shorted me money when I tried to make a withdrawal._
- **direct debit payment not recognized**,_Why is there a direct debit to my account? I didn't do that._
- **balance not updated after cheque or cash deposit**,_I've deposited a check in my account but the cash isn't showing as available._

<br>

> By fine tuning a model on examples similar to the above, we can conditionally generate banking queries with a particular intent simply by prompting the model with the intent. We will show you how to do this below!

## 🎛️ Fine tune a multi-billion parameter LLM

- We use the natural language [base config](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/natural-language.yml) by setting `base_config="natural-language"`.

- If you are fine tuning a model, `data_source` must be a path or `DataFrame`.

- If your goal is zero/few-shot prompting (i.e., no fine tuning), set `data_source=None`.


> **Customizing the config:** As we saw in the [Gretel Advanced Tabular Blueprint](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_Advanced_Tabular_Blueprint.ipynb), you can pass model config sections like `params` and `generate` as keyword arguments in the `submit_train` method. Model parameters that are not nested in a section can be passed directly as keyword arguments. For example, below we set Gretel GPT's `column_name` parameter to `"intent_and_text"`.

In [None]:
trained = gretel.submit_train(
    base_config="natural-language",
    data_source=dataset,
    column_name="intent_and_text",
    params={"batch_size": 16, "steps": 500},
    generate={"num_records": 100, "temperature": 0.8}
)

In [None]:
# view the text data quality scores
print(trained.report)

In [None]:
# display the full report within this notebook
trained.report.display_in_notebook()

In [None]:
def print_examples(dataframe, num=5):
    """Print num random examples from the given DataFrame"""
    print("\n".join([row for _, row in dataframe.sample(num).itertuples()]))

# view synthetic examples used in the quality report
print_examples(trained.fetch_report_synthetic_data())

## 🤖 Generate synthetic data

- You can pass any of Gretel GPT's [`generate` parameters](https://docs.gretel.ai/reference/synthetics/models/gretel-gpt#data-generation) as keyword arguments in the `submit_generate` method.



In [None]:
generated = gretel.submit_generate(
    trained.model_id,
    num_records=50,
    temperature=0.8
)

In [None]:
# view random synthetic examples
print_examples(generated.synthetic_data)

## 🎙️ Prompting your fine-tuned LLM

- To prompt your fine-tuned LLM, use the `seed_data` argument of the `submit_generate` method.

- Below we prompt the model to generate 50 synthetic queries with an intent of `"card payment fee charged"`.

In [None]:
import pandas as pd

seed_data = pd.DataFrame(["card payment fee charged,"] * 50)

prompted = gretel.submit_generate(trained.model_id, seed_data=seed_data)

In [None]:
# view random queries conditioned on the intent
print_examples(prompted.synthetic_data, num=10)