# Run Fine-tune CoT on OpenAI using our `oai` module

This notebook contains code to (1) generate reasoning samples from teacher models (e.g., GPT-3 175B `text-davinci-002`), (2) fine-tune student models (e.g., GPT-3 0.3B `ada`) and (3) generate and evaluate samples from fine-tuned student models.

- To run from scratch, first download and save original benchmark data (see README).
- To use existing teacher-generated samples, first download and save original benchmark data and teacher completion data (see README). Then, replace the completion_key `zs_cot_test` with `zs_cot` in the code below.

### TODO: Set OpenAI Key

Create an account on OpenAI and retrieve your API key. Experiments will incurs fees on your OpenAI account.

In [1]:
import os
os.environ['API_KEY'] = ""

### Imports and Parameters

In [2]:
import sys
import os

# Get the parent directory
parent_dir = os.path.dirname(os.getcwd())

# Add the module's directory to Python path
module_dir = os.path.join(parent_dir, 'src')
sys.path.append(module_dir)

In [3]:
from data.completion_dataset import CompletionMetadata, CompletionDataset
from oai.inference import infer_completion_data

In [4]:
teacher_base_model = "gpt-3.5-turbo-instruct"
# base_model = "ada"                       # GPT-3 (0.3B)
# base_model = "babbage"                   # GPT-3 (1.3B)
# base_model = "curie"                     # GPT-3 (6.7B)
dataset_key = "date_understanding"

## Infer teacher completions using OpenAI (generate CompletionDataset)

In [5]:
# Note, completion_key identifies the method used to generate completions
# Note, prediction_template selects the prediction template from those pre-defined in
#       `oai.data.format.Formatter`.
completion_metadata = CompletionMetadata(base_model=teacher_base_model, completion_key="zs_cot_test",
                                         dataset_key=dataset_key, prediction_template="zs_cot")

In [6]:
# Run Zero-shot-CoT step 1 (rationale generation)
# Note, sample_indices=None means we want to infer on all samples
completion_dataset = infer_completion_data(completion_metadata, zs_cot_step=1,
                                           sample_indices=[0], augs=1, temperature=0,
                                           max_tokens=128)

Loaded 1 samples from:
/Users/khuongle/Documents/cs470/reasoning-teacher/saved/completion_data/B_gpt-3.5-turbo-instruct__C_zs_cot_test/D_date_understanding.json
Inferring completions for 1 remaining samples (total=1)


Inferring completions via OpenAI: 100%|██████████| 1/1 [00:01<00:00,  1.46s/it]


In [7]:
# Run Zero-shot-CoT step 2 (answer)
completion_dataset = infer_completion_data(completion_metadata, zs_cot_step=2,
                                           sample_indices=[0], augs=1, temperature=0,
                                           max_tokens=128)

Loaded 1 samples from:
/Users/khuongle/Documents/cs470/reasoning-teacher/saved/completion_data/B_gpt-3.5-turbo-instruct__C_zs_cot_test/D_date_understanding.json
Inferring completions for 1 remaining samples (total=1)


Inferring completions via OpenAI: 100%|██████████| 1/1 [00:00<00:00,  1.17it/s]


## Load CompletionDataset and evaluate test set

In [7]:
from data.completion_dataset import CompletionIdentifier
from data.split import load_train_test_split 
from evaluation.evaluator import Evaluator
from evaluation.summary import summarize_evaluation 

In [8]:
completion_identifier = CompletionIdentifier(teacher_base_model, "zs_cot_test", dataset_key)
completion_dataset = CompletionDataset.load(completion_identifier)
# Note, completion_metadata can be used instead of completion_identifier such as below
# completion_dataset = CompletionDataset.load(completion_metadata)
train, test = load_train_test_split(dataset_key)

In [9]:
evaluator = Evaluator.for_completion_dataset(completion_dataset)
evaluation = evaluator.evaluate_completion_dataset(completion_dataset, test)

In [10]:
evaluation.head()

Unnamed: 0,sample_index,completion_index,correct,contains_answer,correct_format,complete
0,0,0,True,True,True,True
1,9,0,True,True,True,True
2,23,0,True,True,True,True
3,25,0,True,True,True,True
4,28,0,True,True,True,True


In [11]:
summarize_evaluation(evaluation)

{'accuracy': 0.7477477477477478,
 'contains_answer': 0.7477477477477478,
 'correct_format': 1.0,
 'complete': 1.0}

## Create fine-tune `File` and `Finetune` using training set

In [12]:
from oai.finetune import init_finetune, generate_finetune_data_from_completion_dataset
from oai.utils.api_wrapper import fetch_model_ids

In [13]:
# Replace "zs_cot_test" with "zs_cot" to use our teacher-generated completions (see README for how to download).
completion_identifier = CompletionIdentifier(teacher_base_model, "zs_cot_test", dataset_key)
completion_dataset = CompletionDataset.load(completion_identifier)
train, test = load_train_test_split(dataset_key)

In [14]:
finetune_key = "zs_cot_test_{}".format(dataset_key)
train_key = "ft_cot_test"

In [15]:
# Note, finetune_key is a unique identifier for the finetuning data and should contain the source dataset
generate_finetune_data_from_completion_dataset(completion_dataset=completion_dataset,
                                               prediction_template="ft_cot_token",
                                               finetune_key=finetune_key,
                                               sample_indices=train,
                                               only_correct=True,  # default
                                              )

Saving 171 fine-tuning samples to /Users/itsnamgyu/code/temp/reasoning-teacher/saved/finetune_data/P_openai/F_zs_cot_test_date_understanding.jsonl


In [16]:
# Inspect finetune data
import json
from paths import get_finetune_data_path
with open(get_finetune_data_path("openai", finetune_key)) as f:
    print(json.dumps(json.loads(f.readline()), indent=4))

{
    "prompt": "Yesterday was April 30, 2021. What is the date one year ago from today in MM/DD/YYYY?\nWhich choice is true? Answer choices: (A) 05/01/1971, (B) 04/01/2020, (C) 05/15/2020, (D) 05/01/2020, (E) 05/08/2020. ###",
    "completion": " One year ago from today would be 2020. Today is 2021. 2020 is two years ago. Two years ago from today would be 05/01/2019. --> D END"
}


In [17]:
# Note, train_key identifies the method used to train the model, i.e., the method used to fine-tune the base model.
init_finetune(finetune_key, base_model, dataset_key, train_key)



'B_ada__D_date_understanding__T_ft_cot_test'

### Fetch fine-tuned `Model` id

You need to keep calling this function to check if your `Finetune` is finished. Fine-tuning typically take about 5 minutes to 1 hour.

In [19]:
fetch_model_ids()

No model ids to fetch


True

### Access OpenAI metadata

We use metadata files to map our identifiers (keys) to the identifier (ids) used by OpenAI objects.
These can be accessed manually, as follows.

In [20]:
from oai.utils.metadata import get_file_id, get_finetune_id, get_model_id, get_model_key

In [21]:
# Note that `base_model`, `dataset_key`, `train_key` are joined together to form a `model_key` which
# identifies fine-tuned models. There is a one-to-one-to-one mapping between a model_key, Finetune object,
# and Model object.
model_key = get_model_key(base_model, dataset_key, train_key)

In [22]:
# Note that our `finetune_key` identifies the fine-tuning "data", therefore is mapped to a File object
# rather than a Finetune object.
get_file_id(finetune_key)

'file-3lwlV7lJRebTr0JTniuZ7lCX'

In [23]:
get_finetune_id(model_key)

'ft-ord6Qs8vmXQI8VjVWrfNTrTs'

In [24]:
get_model_id(model_key)  # fetched by `fetch_model_ids()`

'ada:ft-namgyu-ho-2023-06-11-04-37-04'

## Infer student completions

We only infer test set samples for evaluation.

In [25]:
# Note, completion_key and train_key are both "ft_cot_test". Recall that completion_key refers to
# the method used to generate completions by the student model, and train_key refers to the method
# used to train the student model.
completion_metadata = CompletionMetadata(base_model=base_model, completion_key="ft_cot_test",
                                         dataset_key=dataset_key, finetune_key=finetune_key,
                                         prediction_template="ft_cot_token",
                                         train_key=train_key, epoch=None)
train, test = load_train_test_split(dataset_key)

In [26]:
# Note, `infer_completion_data` will find our new student model (that we fetched above) by using
#       `base_model`, `dataset_key`, and `train_key` which is specified in `completion_metadata`.
completion_dataset = infer_completion_data(completion_metadata, zs_cot_step=None,
                                           sample_indices=test, augs=1, temperature=0,
                                           max_tokens=1024)  # note, we use 1024 tokens for student inference

Initializing new CompletionDataset at:
/Users/itsnamgyu/code/temp/reasoning-teacher/saved/completion_data/B_ada__C_ft_cot_test/D_date_understanding__T_ft_cot_test.json
Inferring completions for 111 remaining samples (total=111)


Inferring completions via OpenAI: 100%|███████████████████████████| 111/111 [00:12<00:00,  8.72it/s]


## Evaluate student completions

In [27]:
completion_identifier = CompletionIdentifier(base_model, completion_key="ft_cot_test", dataset_key=dataset_key,
                                             train_key="ft_cot_test")
completion_dataset = CompletionDataset.load(completion_identifier)
train, test = load_train_test_split(dataset_key)

In [28]:
evaluator = Evaluator(dataset_key, "ft_cot_token")
evaluation = evaluator.evaluate_completion_dataset(completion_dataset, test)

In [29]:
summarize_evaluation(evaluation)

{'accuracy': 0.12612612612612611,
 'contains_answer': 0.12612612612612611,
 'correct_format': 0.9819819819819819,
 'complete': 0.9819819819819819}