# Template for Zero-Shot Classification with OpenAI Models

### Import all Modules

In [4]:
#IMPORTANT - to ensure package loading, first add the path of the utils folder to your system path
import os
import sys

module_dir = os.getcwd()
sys.path.append(os.path.abspath(os.path.join(module_dir, os.pardir, "utils")))

In [5]:

import time

from tqdm import tqdm

import wandb
import openai
import pandas as pd
import glob

from utils import (
    dataset_has_format_errors,
    write_jsonl,
)
from dataload_utils import load_full_dataset,load_dataset_task_prompt_mappings
from label_utils import map_label_to_completion


# read API key
with open('OpenAI_key.txt') as f:
    openai.api_key = f.readlines()[0]

### Setup Arguments and Data

In the following code block, you are required to set up several key parameters that will define the behavior and environment of your fine-tuning process:

1. **WandB Project Name (`WANDB_PROJECT_NAME`)**: This is the name of the project in Weights & Biases (WandB) where your training run will be logged. WandB is a tool that helps track experiments, visualize data, and share insights. By setting the project name here, you ensure that all the metrics, outputs, and logs from your training process are organized under a single project for easy access and comparison. Specify a meaningful name that reflects the nature of your training session or experiment. If you leave the argument empty, the project will not be tracked on WandB.

2. **Model Name (`MODEL_NAME`)**: Here, you select the specific model from OpenAI's suite that you wish to fine-tune. The model name, such as `'gpt-3.5-turbo-0613'`, refers to a particular configuration and version of the model. This selection dictates the starting point of your fine-tuning process, leveraging the pre-trained weights and architecture of the specified model. Ensure that the model name corresponds to an existing and available model in OpenAI's library. You can find more models here:  https://platform.openai.com/docs/models 

3. **Completion Retries (`COMPLETION_RETRIES`)**: This parameter defines the number of retry attempts for generating a completion in case the initial attempts fail. When interacting with the model, especially in a fine-tuning context, certain queries may not succeed on the first try due to various reasons (e.g., network issues, API errors). This setting provides resilience, allowing the process to attempt the generation multiple times before considering it a failure.


In [6]:
# Specs WandB and Which Model you want to fine-tune
PROJECT_NAME = "chatGPT_template_1"
WANDB_PROJECT_NAME = "chatGPT_template_1"
MODEL_NAME = 'gpt-3.5-turbo-1106'
COMPLETION_RETRIES = 10

In the next code block, you are required to set up various configuration variables that will dictate how the inference processes are executed. These variables are crucial as they define the nature of the task, the data, and the specific behaviors during the model's training and evaluation.

1. **Task (`task`)**: Specify the type of task you want to run inference on. The task is represented by an integer, with each number corresponding to a different type of task (e.g., 1, 2, 3, etc.). You must select from the predefined choices, which are typically mapped to specific NLP tasks or scenarios.

2. **Dataset (`dataset`)**: Choose the dataset on which you want to run inference. Like tasks, datasets are identified by integers, and each number corresponds to a different dataset. Ensure that the dataset selected is relevant to the task at hand.

3. **Output Directory (`output_dir`)**: Define the path to the directory where you want to store the generated samples. This is where the output of your training and inference processes will be saved.

4. **Random Seed (`seed`)**: Setting a random seed ensures that the results are reproducible. By using the same seed, you can achieve the same outcomes on repeated runs under identical conditions.

5. **Data Directory (`data_dir`)**: Specify the path to the directory containing the datasets you plan to use for training and evaluation.

6. **Label Usage (`not_use_full_labels`)**: This boolean variable determines whether to use the full label descriptions or abbreviated labels during training and inference. Setting it to `False` means full labels will be used.

7. **Dataset-Task Mappings File Path (`dataset_task_mappings_fp`)**: Define the path to the file containing mappings between datasets and tasks. This file is crucial for ensuring the correct dataset is used for the specified task.

8. **Rewrite DataFrame (`rewrite_df_in_openai`)**: A boolean that dictates whether to rewrite the dataframe in OpenAI format. If set to `True`, the data will be reformatted according to OpenAI's expected input structure during the execution.

9. **Number of Epochs (`n_epochs`)**: Specify the number of epochs for training the model. An epoch refers to one complete pass through the entire training dataset.

10. **Run Name (`run_name`)**: Give a unique name to your run, which will help you identify it later, especially when tracking multiple experiments or runs.

11. **Temperature (`temp`)**: Set the temperature for text generation. Temperature controls the randomness of the output; a lower temperature results in less random completions.

12. **Few-shot (`few_shot`)**: A boolean indicating whether to use a few-shot learning approach. When set to `True`, the model will be fine-tuned with only a few examples.

13. **System-User Prompt Division (`system_user_division`)**: This integer defines the separation between system and user prompts, which is critical for structuring the input data correctly for the model.

**Customizing for Your Own Tasks:**
If you plan to run a custom task or use a dataset that is not predefined, you will need to make modifications to the `utils_src` file. This file contains all mappings for different datasets and tasks. Adding your custom task or dataset involves defining the new task or dataset number and specifying its characteristics and mappings in the `utils_src` file. This ensures that your custom task or dataset integrates seamlessly with the existing framework for training and inference.


In [7]:
# Configuration Variables

# Number of task to run inference on
task = 2  # As defined in dataset_task_mappings.csv 1-6 or more if you add them in label_utils

# Number of dataset to run inference on
dataset = 1  # s defined in dataset_task_mappings.csv

# Path to the directory containing the datasets
data_dir = '../../data'

#Name of dataset to run inference on
#NOTE: we recommend keeping the names of the training and evaluation sets in the way we provided.
eval_set_name = f"ds_{dataset}__task_{task}_eval_set"

# Path to the data directory
output_dir = '../../data'

# Random seed to use
seed = 2019

# Path to the directory containing the datasets
data_dir = '../../data'

# Whether to use the full label
use_full_labels = True

# Path to the dataset-task mappings file
dataset_task_mappings_fp = os.path.normpath(os.path.join(module_dir, '..', '..', 'dataset_task_mappings.csv'))

# Whether to rewrite the dataframe in OpenAI format
rewrite_df_in_openai = True

# Number of epochs to train the model
n_epochs = 3

# Name of the run - if not empty make sure the run identifies which dataset and task is used
run_name = ''

# Temperature to use when generating text
temp = 0.0

# Fewshot
few_shot = False

# Separation between system and user prompt
system_user_prompt_division_line = 3

You can see here how the run_name is automatically produced and change if needed

In [8]:
project_run_name = run_name if run_name != '' else f'ds_{dataset}_t_{int(task)}_sample_0__fl_{str(use_full_labels)}_temp_{temp}_{MODEL_NAME}'
if few_shot:
    project_run_name += "few_shot"

In [9]:
print(project_run_name)

ds_1_t_2_sample_0__fl_True_temp_0.0_gpt-3.5-turbo-1106


In [10]:
client = openai.OpenAI(api_key=openai.api_key)

### Define Utility Functions

In [11]:
def create_training_example(system_prompt, user_prompt_format, user_prompt_text, completion):
    return {'messages': [
        {'role': 'system',
         'content': system_prompt},

        {'role': 'user',
         'content': user_prompt_format.format(text=user_prompt_text)},

        {'role': 'assistant',
         'content': completion}
    ]}

In [12]:
#TODO: can any of these be moved to utils?
def upload_datasets_to_openai(output_dir, not_use_full_labels, rewrite_df_in_openai, datasets):

    uploaded_files = client.files.list()
    #store IDs created in this session for easier reference
    datasets_open_ai_metadata = list()

    df_id_metadata = pd.DataFrame(columns = ["df_name", "file_id"]) if not os.path.exists('datasets_open_ai_metadata.csv') \
        else pd.read_csv('datasets_open_ai_metadata.csv')

    for df_name, df in datasets.items():
        #file to send
        df_filename = f'{PROJECT_NAME}_{df_name}'
        df_jsonl_filename = os.path.join(output_dir, 'temp', df_filename+'.jsonl')
        write_jsonl(data_list=df['openai_instance_format'].tolist(), filename=df_jsonl_filename)

        if not_use_full_labels:
            df_filename += '_single_letter_labels'

        #check how many files are already
        matches_openai = [d.id for d in uploaded_files.data if d.filename == df_filename]

        #check if there are matches in the dataframe
        matches_stored_data = df_id_metadata.loc[df_id_metadata['df_name'] == df_filename, 'file_id'].values if df_filename in df_id_metadata['df_name'].values else []

        #skip uploading only if the ID in the local file corresponds to the ID's in the Organization account
        if not rewrite_df_in_openai and len(matches_stored_data) > 0 and len(matches_openai) > 0:
                if matches_stored_data[-1] in matches_openai:
                    print(f"Dataset {df_name} has already uploaded to OpenAI during this project.")

                else:
                    print(f"Dataset of this name has already been uploaded to OpenAI outside of this project under ID {matches_openai[0]}. Please add this ID to metadata manually if you don't want to re-upload this file.")
                continue
        
        #otherwise upload the files to OpenAI
        print(f"Uploading {df_filename} to OpenAI")
        df_response = client.files.create(
            file=open(df_jsonl_filename, "rb"), purpose="fine-tune"
        )
        df_file_id = df_response.id

        # Wait until the file is processed
        while True:
            file = client.files.retrieve(df_file_id)
            if file.status == "processed":
                break
            time.sleep(15)
        datasets_open_ai_metadata.append({'df_name': df_filename, 'file_id': df_file_id})

    #Store IDs of file uploaded in this session for later check
    df_id_metadata = pd.concat([df_id_metadata, pd.DataFrame(datasets_open_ai_metadata)])
    df_id_metadata.to_csv('datasets_open_ai_metadata.csv', index=False)

    return df_id_metadata


## Main Implementation

In [13]:
# Initialize the Weights and Biases run
if WANDB_PROJECT_NAME != "":
    wandb.init(
        # set the wandb project where this run will be logged
        project=WANDB_PROJECT_NAME,
        name=project_run_name,
        # track hyperparameters and run metadata
        config = {
            "model": MODEL_NAME,
            "dataset": dataset,
            "task": task,
            "epochs": n_epochs,
            "temp": temp
        }
    )

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


[34m[1mwandb[0m: Currently logged in as: [33mmaria-korobeynikova[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Load and Process Data

In [14]:
dataset_idx, dataset_task_mappings = load_dataset_task_prompt_mappings(
    dataset_num=dataset, task_num=task, dataset_task_mappings_fp=dataset_task_mappings_fp)

In [15]:
# Get information specific to the dataset
label_column = dataset_task_mappings.loc[dataset_idx, "label_column"]
if few_shot:
    prompt = dataset_task_mappings.loc[dataset_idx, 'few_shot_prompt']
else:
    prompt = dataset_task_mappings.loc[dataset_idx, 'zero_shot_prompt']

labelset = dataset_task_mappings.loc[dataset_idx, "labelset_fullword" if use_full_labels else "labelset"].split(";")
labelset = [label.strip() for label in labelset]

In [16]:
system_prompt = ('\n'.join(prompt.split('\n')[:-system_user_prompt_division_line])).strip()
user_prompt_format = ('\n'.join(prompt.split('\n')[-system_user_prompt_division_line:])).strip()
print("This is where the user prompt starts:")
print(user_prompt_format)

# Log the system prompt and user_prompt_format as files in wandb
prompts_artifact = wandb.Artifact('prompts', type='prompts')
with prompts_artifact.new_file('system_prompt.txt', mode='w', encoding="utf-8") as f:
    f.write(system_prompt)
with prompts_artifact.new_file('user_prompt_format.txt', mode='w', encoding="utf-8") as f:
    f.write(user_prompt_format)
wandb.run.log_artifact(prompts_artifact)

This is where the user prompt starts:
Now, is the following tweet describing content moderation as a PROBLEM, as a SOLUTION, or NEUTRAL?

{text}


<Artifact prompts>

In [19]:
datasets = load_full_dataset(data_dir, eval_set_name)
eval_df = datasets[eval_set_name]

In [20]:
preprocessed_output_dir = os.path.join(
    output_dir, 'preprocessed', 'full_name_labels' if use_full_labels else 'single_letter_labels')

In [21]:
#make an OpenAI type datafile to run predictions on
eval_df['completion_label'] = eval_df[label_column].map(lambda label: map_label_to_completion(label=label, task_num=task,
                                        full_label=use_full_labels)
)
eval_df['openai_instance_format'] = eval_df.apply(lambda row: create_training_example(
    system_prompt=system_prompt, user_prompt_format=user_prompt_format,
    user_prompt_text=row['text'],completion=row['completion_label']),axis=1)

eval_df['openai_instance_without_completion'] = eval_df['openai_instance_format'].map(lambda x: x['messages'][:-1])

#check for any errors
print(f'Check for errors in the prediction set: ')
assert not dataset_has_format_errors(eval_df['openai_instance_format'].tolist()), f"Errors found"

#export pre-processed output (optional)
os.makedirs(preprocessed_output_dir, exist_ok= True)
eval_df.to_csv(os.path.join(preprocessed_output_dir,eval_set_name + '.csv'), index=False)

Check for errors in the prediction set: 
No errors found


In [22]:
# Create jsonl file and upload to OpenAI
df_id_metadata = upload_datasets_to_openai(output_dir, not use_full_labels, rewrite_df_in_openai, datasets)

Uploading chatGPT_template_1_ds_1__task_2_eval_set to OpenAI


### Run predictions

In [None]:
predictions = []
for messages in tqdm(eval_df['openai_instance_without_completion'].tolist()):
    # Retry the completion at least COMPLETION_RETRIES times
    num_retries = 0
    response = None
    while num_retries < COMPLETION_RETRIES and response is None:
        try:
            response = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            temperature=temp,
            n=1
        )
        except Exception as e:
            print('Error getting predictions. Retrying...')
            time.sleep(5)
            num_retries += 1
            if num_retries >= COMPLETION_RETRIES:
                print('Maximum amount of retires reached')
                raise e
    predictions.append(response.choices[0].message.content)

# Add predictions to df
eval_df['prediction'] = predictions

In [34]:
# Store output
predictions_dir = os.path.join(output_dir, 'predictions', MODEL_NAME.replace("/", "_"))
os.makedirs(predictions_output_dir, exist_ok=True)
datasets[eval_set_name].to_csv(os.path.join(predictions_output_dir, f"{project_run_name}.csv"),index=False)

### Run preliminary accuracy for WandB

Unlike in the finetuning script, the accuracy for zero-shot when comparing just the ChatGPT output with the original labels will be 0 as the extra words in the output need to be stripped first. Please see the ``02-measure_performance_chatGPT.ipynb`` for suggestions.