[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/text-classification/llm/fine-tuned-gpt.ipynb)


# <a id="top">Tweet sentiment classification using an LLM</a>

This notebook illustrates how LLMs (such as OpenAI's GPT) can be uploaded to the Openlayer platform.

## <a id="toc">Table of contents</a>

1. [**Getting the data and training the model**](#1)
    - [Downloading the dataset](#download)
    - [Preparing the data](#prepare)
    - [Training the model](#train)
    = [Getting predictions from the trained model](#preds)
    

2. [**Using Openlayer's Python API**](#2)
    - [Instantiating the client](#client)
    - [Creating a project](#project)
    - [Uploading datasets](#dataset)
    - [Uploading models](#model)
        - [Shell models](#shell)
        - [Full models](#full-model)
    - [Committing and pushing to the platform](#commit)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/text-classification/sklearn/sentiment-analysis/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="1"> 1. Getting the data and training the model </a>

[Back to top](#top)

In this first part, we will get the dataset, pre-process it, split it into training and validation sets, and train a model. Feel free to skim through this section if you are already comfortable with how these steps look for an sklearn model.   

In [None]:
import numpy as np
import pandas as pd

### <a id="download">Downloading the dataset </a>


We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv files. Alternatively, you can also find the original datasets on [this Kaggle competition](https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset?select=testdata.manual.2009.06.14.csv). The training set in this example corresponds to the first 20,000 rows of the original training set.

In [None]:
%%bash

if [ ! -e "sentiment_train.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/sentiment-analysis/sentiment_train.csv" --output "sentiment_train.csv"
fi

if [ ! -e "sentiment_val.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/sentiment-analysis/sentiment_val.csv" --output "sentiment_val.csv"
fi

In [None]:
columns = ['polarity', 'tweetid', 'query_name', 'user', 'text']

df_train = pd.read_csv(
    "./sentiment_train.csv",
    encoding='ISO-8859-1', 
)

df_val = pd.read_csv(
    "./sentiment_val.csv",
    encoding='ISO-8859-1'
)
df_train.columns = columns
df_val.columns = columns

In [None]:
df_train.head()

In [None]:
# Making the 'polarity' column zero-indexed (0, 1, 2)
df_val['polarity'] = df_val['polarity'].replace(4, 1)
df_train['polarity'] = df_train['polarity'].replace(4, 1)

### <a id="prepare">Preparing the data</a>

**Disclaimer: there are costs associated with using OpenAI's API. Use at your own discretion. If you don't want to fine-tune a model, but would like to see how the data uploaded to the Openlayer platform looks like, feel free to skip to the [dataset upload](#dataset) section.**

From this part onward, we assume that you have an OpenAI API key as the environment variable `OPENAI_API_KEY`. If this is not the case, run:

In [None]:
!export OPENAI_API_KEY="<OPENAI_API_KEY>"

We are going to fine-tune an LLM for our task. As described in [OpenAI's documentation](https://platform.openai.com/docs/guides/fine-tuning/prepare-training-data), the training data must be a JSONL object, as the example:
```
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...
```
In our case, the `"<prompt text>"` would be the tweet (in the column `"text"` of our original dataset) and the `"<ideal generated text>"` would be one of the classes `"negative"`, `"positive"`, or `"neutral"` (in the column `"polarity"`, as indexes originally).

In [None]:
# Getting the relevant columns
df_train_ = df_train[["polarity", "text"]].copy()
df_val_ = df_val[["polarity", "text"]].copy()

In [None]:
# Re-mapping the "polarity" column to have text (instead of indexes)
# Note the blank space before the class names -- https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
classes_map = {0: " negative", 1: " positive", 2: " neutral"}
df_train_.loc[:, "polarity"] = df_train_["polarity"].map(classes_map)
df_val_.loc[:, "polarity"] = df_val_["polarity"].map(classes_map)

In [None]:
# Re-naming the columns
names_map = {"polarity": "completion", "text": "prompt"}
df_train_ = df_train_.rename(columns=names_map)
df_val_ = df_val_.rename(columns=names_map)

In [None]:
# Adding a unique separator to the end of the prompts \n\n###\n\n  -- https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
df_train_["prompt"] += "\n\n###\n\n"
df_val_["prompt"] += "\n\n###\n\n"

In [None]:
# Adding a fixed stop sequence to the end of the completions ###  -- https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
df_train_["completion"] += "###"
df_val_["completion"] += "###"

Let's save the dataframes to csv so that we can use [OpenAI's CLI data preparation tool](https://platform.openai.com/docs/guides/fine-tuning/cli-data-preparation-tool) to generate a JSONL.

In [None]:
df_train_.to_csv("training_set.csv", index=False)
df_val_.to_csv("validation_set.csv", index=False)

In [None]:
!openai tools fine_tunes.prepare_data -f "training_set.csv" 

### <a id="train">Training the model</a>

With our file `training_set_prepared.jsonl` saved, we can create a new model for fine-tuning.

In [None]:
!openai api fine_tunes.create -t "training_set_prepared.jsonl" -m ada

The above command queues a fine-tuning job. We can se the fine-tuning process is complete when the shell command `openai api fine_tunes.list` returns a `fine_tuned_model` name. **This may take several minutes to complete**.

### <a id="preds">Getting predictions from the trained model</a>

With the fine-tuned model created, we can use the Completions API to get its predictions for the training and validation sets. Fill out the `FINE_TUNED_MODEL` with your own fine tuned model (from the previous step).

In [None]:
import openai
import os 

openai.api_key = os.environ.get("OPENAI_API_KEY")
FINE_TUNED_MODEL = "YOUR_FINE_TUNED_MODEL"

Let's test our fine-tuned model with a sample text:

In [None]:
response = openai.Completion.create(
  model=FINE_TUNED_MODEL,
  prompt="Today is going to be a great day!" + "\n\n###\n\n",
  temperature=0,
  max_tokens=1,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0
)

In [None]:
response

In [None]:
from typing import List

def get_predictions(df: pd.DataFrame) -> List[int]:
    """Uses the Completion API to get the fine-tuned model's
    predictions for each row of a dataset df. 
    
    Some models support batching, so you may want to adapt this
    function or use async requests."""
    preds = []
    
    for row in df["prompt"]:        
        response = openai.Completion.create(
            model=FINE_TUNED_MODEL,
            prompt=row,
            temperature=0,
            max_tokens=1,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=0.0
        )
        
        preds.append(response["choices"][0]["text"])
    
    return preds

In [None]:
preds_val = get_predictions(df_val_)

In [None]:
# Let's use just a sample from the training set
df_train_ = df_train_[:100]

In [None]:
preds_train = get_predictions(df_train_)

In [None]:
df_val_["predictions"] = preds_val
df_train_["predictions"] = preds_train

## <a id="2"> 2. Using Openlayer's Python API</a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

In [None]:
!pip install openlayer

### <a id="client">Instantiating the client</a>

In [None]:
import openlayer


client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")

### <a id="project">Creating a project on the platform</a>

In [None]:
from openlayer.tasks import TaskType

project = client.create_or_load_project(
    name="Sentiment analysis with GPT",
    task_type=TaskType.TextClassification,
    description="Evaluating a GPT model"
)

### <a id="dataset">Uploading datasets</a>

**If you haven't fined-tuned a model but would like to see what the datasets look like, please download the csv files for the [training](https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/GPT-datasets/training_set_gpt.csv) and [validation](https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/GPT-datasets/validation_set_gpt.csv) sets. Then, load them into `df_train_` and `df_val_`, respectively.**

Before adding the datasets to a project, we need to do two things:
1. Process the labels and predictions columns, so that both contain zero-indexed integers (instead of strings).
2. Prepare a `dataset_config.yaml` file. This is a file that contains all the information needed by the Openlayer platform to utilize the dataset. It should include the column names, the class names, etc. For details on the fields of the `dataset_config.yaml` file, see the [API reference](https://reference.openlayer.com/reference/api/openlayer.OpenlayerClient.add_dataset.html#openlayer.OpenlayerClient.add_dataset).

Let's start by processing the labels and predictions columns:

In [None]:
labels_map = {" negative###": 0, " positive###": 1, " neutral###": 2}

df_train_["completion"] = df_train_["completion"].map(labels_map)
df_val_["completion"] = df_val_["completion"].map(labels_map)

In [None]:
predictions_map = {" negative": 0, " positive": 1, " neutral": 2}

df_train_["predictions"] = df_train_["predictions"].map(predictions_map)
df_val_["predictions"] = df_val_["predictions"].map(predictions_map)

Now, we can prepare the `dataset_config.yaml` files for the training and validation sets.

In [None]:
# Some variables that will go into the `dataset_config.yaml` file
column_names = list(df_train_.columns)
class_names = ["Negative", "Positive"]
label_column_name = "completion"
predictions_column_name = "predictions"
text_column_name = "prompt"

In [None]:
import yaml 

# Note the camelCase for the dict's keys
training_dataset_config = {
    "classNames": class_names,
    "columnNames": column_names,
    "textColumnName": text_column_name,
    "label": "training",
    "labelColumnName": label_column_name,
    "predictionsColumnName": predictions_column_name,
}

with open("training_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(training_dataset_config, dataset_config_file, default_flow_style=False)

In [None]:
import copy

validation_dataset_config = copy.deepcopy(training_dataset_config)

# In our case, the only fields that change are the `label`, from "training" -> "validation", and the `classNames`
validation_dataset_config["label"] = "validation"
validation_dataset_config["classNames"] = ["Negative", "Positive", "Neutral"]

with open("validation_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(validation_dataset_config, dataset_config_file, default_flow_style=False)

In [None]:
# Training set
project.add_dataframe(
    dataset_df=df_train_,
    dataset_config_file_path="training_dataset_config.yaml",
)

In [None]:
# Validation set
project.add_dataframe(
    dataset_df=df_val_,
    dataset_config_file_path="validation_dataset_config.yaml",
)

We can check that both datasets are now staged using the `project.status()` method. 

In [None]:
project.status()

### <a id="model">Uploading models</a>

Now, we are going to add the GPT model as a shell model to the platform. To do so, we need to prepare a `model_config.yaml` file:

In [None]:
import yaml

model_config = {
    "name": "Fine-tuned ada model",
    "architectureType": "llm",
    "classNames": ["Negative", "Positive"],
}

with open("model_config.yaml", "w") as model_config_file:
    yaml.dump(model_config, model_config_file, default_flow_style=False)

In [None]:
project.add_model(
    model_config_file_path="model_config.yaml",
)

We can check that both datasets and model are staged using the `project.status()` method.

In [None]:
project.status()

### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the first project version to the platform. 

In [None]:
project.commit("Initial commit!")

In [None]:
project.status()

In [None]:
project.push()