<a href="https://colab.research.google.com/gist/zredlined/542135c75ec9b1ba4fdd54fd71400d28/tabular_llm_basic_sdk_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 💾 Install `gretel-client` and its dependencies

In [None]:
%%capture
!pip install -Uqq gretel-client
!pip install -qq Jinja2 pandas

## 🛜 Configure your Gretel session

- You will be prompted to enter your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).

In [None]:
import json

import pandas as pd
import yaml
from IPython.display import display

from gretel_client import configure_session, projects
from gretel_client.helpers import poll
from gretel_client.projects import create_or_get_unique_project

# Configure Gretel session
configure_session(endpoint="https://api.gretel.cloud", api_key="prompt", cache="yes")

# Set Pandas display options (if required)
pd.set_option('display.max_rows', 100)

# Create or get a unique Gretel project
project = create_or_get_unique_project(name="TabularLLM")

print(f"Project URL: {project.get_console_url()}")

## 🏗️ Initialize Gretel's Tabular LLM with a custom configuration

- Below we initialize the Tabular LLM model in your Gretel Cloud project using a base yaml configuration.

- We use JSONL as the output format in this notebook, but CSV can also be used with `output_format: csv`.

In [None]:
# Create Navigator model
model_config = """
schema_version: 1.0
models:
  - navigator:
        model_id: "gretelai/tabular-v0"
        output_format: "jsonl"
"""
model_config = yaml.safe_load(model_config)
model = project.create_model_obj(model_config)
model.submit_cloud()
poll(model, verbose=False)

In [None]:
# @title 🧰 Define helper functions
# @markdown - Run this cell to define helper functions for
# @markdown submitting generation jobs and displaying the results.

# Set pandas display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

def clear_project_artifacts(project):
    """Clear artifacts from the given project."""
    artifacts = project.artifacts
    if artifacts:
        print("Clearing artifacts")
        for artifact in artifacts:
            print(f" -- {artifact}")
            project.delete_artifact(artifact['key'])

def display_all_rows(df):
    # Style DataFrame for better visibility and word-wrap
    styled = df.style.set_properties(**{
        'text-align': 'left',
        'white-space': 'normal',
        'height': 'auto'
    })

    # Display the styled DataFrame
    display(styled)

def submit_generate(model, prompt: str, params: dict, ref_data=None) -> pd.DataFrame:
    """
    Generate or augment data from the Tabular LLM model.

    Args:
    model: The model object that will process the prompt.
    prompt (str): The text prompt to generate data from.
    params (dict): Parameters for data generation.
    ref_data: Optional existing dataset to edit or augment.

    Returns:
    pd.DataFrame: The generated data.
    """
    data_processor = model.create_record_handler_obj(
        data_source=pd.DataFrame({"prompt": [prompt]}),
        params=params,
        ref_data=ref_data
    )
    data_processor.submit_cloud()
    poll(data_processor, verbose=False)
    return pd.read_json(data_processor.get_artifact_link("data"), lines=True, compression="gzip")


In [None]:
# Optionally clear out previous project artifacts
clear_project_artifacts(project)

## 🤖 Generate synthetic data

- Prompt Tabular LLM to create a synthetic dataset.


In [None]:
# Generate mock dataset
prompt = """\
Generate a mock dataset for users from the Foo company based in France.

Each user should have the following columns:
* first_name: traditional French first names.
* last_name: traditional French surnames.
* email: formatted as the first letter of their first name followed by their last name @foo.io (e.g., jdupont@foo.io).
* gender: Male/Female/Non-binary.
* city: a city in France.
* country: always 'France'.
"""

params = {
    "num_records": 10,
    "temperature": 0.8,
    "top_p": 1,
    "top_k": 50
}
df = submit_generate(model=model, prompt=prompt, params=params)

df

## 🔧 Augment an existing dataset

- Prompt Tabular LLM to add new columns to an existing dataset

In [None]:
# Add a new column to our Pandas Dataframe that is derived from existing values.

prompt = """Add a new column: initials, which will contain initials of the person."""
params = {"num_records": len(df), "temperature": 0.8}
ref_data = {"data": df}

df = submit_generate(model, prompt=prompt, params=params, ref_data=ref_data)

df

## 📊 Generate diverse data with Tabular LLM

- Prompt Tabular LLM to answer questions and create new and diverse examples on your domain-specific data.

In [None]:
# List of questions
questions = [
    "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?",
    "Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?",
    "Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?",
    "James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year?"
]

# Create a DataFrame
df = pd.DataFrame(questions, columns=['question'])

prompt = """Add a new column: answer, which contains a detailed step-by-step answer to the question in each row."""
params = {"num_records": len(df), "temperature": 0.8}
ref_data = {"data": df}

df = submit_generate(model, prompt=prompt, params=params, ref_data=ref_data)

display_all_rows(df)