# Divergent Thinking Scoring with GPT-3

<a href="https://colab.research.google.com/github/massivetexts/llm_aut_study/blob/main/notebooks/GPT-3 AUT Scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code is for the GPT-3 portion of experiments in Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2022). Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. http://dx.doi.org/10.13140/RG.2.2.32393.31840.

GPT-3 is a type of large-language-model from OpenAI ([Brown et al 2020](https://arxiv.org/abs/2005.14165)). It was not released publicly, and is offered as a service instead. While that presents challenges to what you can do with your trained models and how you can iterate them, there are also some benefits. OpenAI did a good job with the service, and this is the easiest model to re-apply from our paper.

In [2]:
#@title Prep, Installs, imports
!pip -qq install openai wandb
import openai
import time
from openai.wandb_logger import WandbLogger
from pathlib import Path
import numpy as np
import json
from tqdm.auto import tqdm
import pandas as pd
from scipy.spatial.distance import cosine
tqdm.pandas()

#@markdown Point to a text file with your OpenAI key. GPT-3 is a hosted service and fine-tuning/hosting of models is done on their servers.
openai.api_key_path = '/content/drive/MyDrive/keys/openaikey.txt' #@param {type:'string'}

[?25l[K     |███████▌                        | 10 kB 23.2 MB/s eta 0:00:01[K     |███████████████                 | 20 kB 21.3 MB/s eta 0:00:01[K     |██████████████████████▌         | 30 kB 16.9 MB/s eta 0:00:01[K     |██████████████████████████████  | 40 kB 7.5 MB/s eta 0:00:01[K     |████████████████████████████████| 43 kB 2.0 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 1.8 MB 9.3 MB/s 
[K     |████████████████████████████████| 163 kB 45.3 MB/s 
[K     |████████████████████████████████| 158 kB 41.8 MB/s 
[K     |████████████████████████████████| 181 kB 46.8 MB/s 
[K     |████████████████████████████████| 63 kB 2.1 MB/s 
[K     |████████████████████████████████| 157 kB 57.2 MB/s 
[K     |████████████████████████████████| 157 kB 81.2 MB/s 
[K     |████████████████████████████████| 157 kB 60.7 MB/s

In [3]:
#@markdown Working directory.
base_dir = Path('drive/MyDrive/Grants/MOTES/') #@param { type: 'raw' }
#@markdown Where Ground truth data is held (from [Process AUT GT.ipynb](https://colab.research.google.com/github/massivetexts/llm_aut_study/blob/main/notebook/Process_AUT_GT.ipynb))
gt_dir = base_dir / 'Data' / 'aut_ground_truth' #@param { type: 'raw' }
print("GT options", [x.name for x in gt_dir.glob('*tar.gz')])
#@markdown Name of the ground truth tar.gz file
data_subdir = "gt_alltests2" #@param ['gt_main2', 'gt_byparticipant', 'gt_byprompt', 'all'] {allow-input: true}
#@markdown Where outputs are saved
evaldir = base_dir / 'Data' / 'evaluation' / data_subdir #@param { type: 'raw' }
evaldir.mkdir(exist_ok=True)

# copy locally and unzip
!cp "{gt_dir}/{data_subdir}.tar.gz" .
!rm -rf data
!tar -xf {data_subdir}.tar.gz
data_dir = Path(f"data/{data_subdir}")
print("Data decompressed to", data_dir)

# this is unsupervised, so go straight to test (unless this is a special case)
splits = [x.name for x in data_dir.iterdir() if x.is_dir()]
print(splits)
if 'test' in splits:
    testdata = pd.DataFrame([pd.read_json(x, orient='index')[0] for x in (data_dir / 'test').glob('*json')])
    testdata.sample()

GT options ['gt_main.tar.gz', 'gt_bypart3.tar.gz', 'gt_byprompt4.tar.gz', 'gt_byparticipant.tar.gz', 'gt_byprompt.tar.gz', 'all.tar.gz', 'gt_main2.tar.gz', 'gt_main_std.tar.gz', 'gt_alltests1.tar.gz', 'gt_alltests2.tar.gz']
Data decompressed to data/gt_alltests2
['group2', 'group3', 'val', 'group1']


# 1. Using Embeddings

In [None]:
#@markdown Class def: GPTEmbedding
# notes: max is 2048 tokens, replace '\n' with ' '
# https://beta.openai.com/docs/api-reference/embeddings/create
#@markdown Init class as `embedder`
embedmodel = "text-similarity-babbage-001" #@param ["text-similarity-ada-001","text-similarity-babbage-001","text-similarity-curie-001","text-similarity-davinci-001"]
embedding_dir = base_dir / 'Data' / 'gpt-embeddings' #@param {type:'raw'}

#Original JSON model deleted, see pre-deletion in revision 'ConvertedToParquet'
class GPTEmbeddingsParquet():
    ''' An alternative to GPT Embedding that uses parquet rather than JSON via Pyson '''
    PRICING = {
        "text-similarity-ada-001":	0.0080,
        "text-similarity-babbage-001":	0.0120,
        "text-similarity-curie-001":	0.0600,
        "text-similarity-davinci-001":	0.6000,
    } # per thousand tokens

    def __init__(self, model="text-similarity-ada-001", parquet_root='gpt.parquet'):

        assert model in self.PRICING.keys()
        self.parquet_root = Path(parquet_root)
        self.parquet_root.mkdir(exist_ok=True)
        self.df = pd.read_parquet(self.parquet_root)
        if self.df.empty:
            self.df = pd.DataFrame([], columns=['phrase', 'model', 'usage', 'embedding'])
        self.model = model
        self.total_tokens = 0
        self.preload = None
        self._buffer = []

    def _clean(self, phrase):
        return phrase.replace('\n', '')

    def full_embedding(self, phrase, autocommit=True, autocommit_every=400):
        ''' Load from parquet or buffer, else make a call to OpenAI (and cache) '''

        cleanphrase = self._clean(phrase)

        results = self.query(phrase)
        if results.empty:
            embed = openai.Embedding.create(model=self.model, input=cleanphrase)
            input = {
                "phrase": cleanphrase,
                "model": self.model,
                "usage": embed.usage.total_tokens,
                "embedding": embed.data[0]['embedding']
                }
            self.total_tokens += embed.usage.total_tokens
            self._buffer.append(input)
            if autocommit and (len(self._buffer) >= autocommit_every):
                self.commit()
            return input
        else:
            return results

    def embedding(self, phrase, autocommit=True, autocommit_every=400):
        ''' Get just the vector, as an array '''
        full = self.full_embedding(phrase,
                                   autocommit=autocommit,
                                   autocommit_every=autocommit_every)
        return np.array(full['embedding'])

    def query(self, phrase):
        cleanphrase = self._clean(phrase)
        results = self.df[(self.df.phrase == phrase) & (self.df.model == self.model)]
        if not results.empty:
            return results.iloc[0]

        results = self._query_buffer(cleanphrase)
        if len(results):
            return pd.Series(results[0])
        else:
            return pd.Series(dtype='object') # empty series
    
    def _query_buffer(self, phrase, early_stop = True):
        results = []
        for entry in self._buffer:
            if (entry['model'] == self.model) and (entry['phrase'] == phrase):
                results.append(entry)
                if early_stop:
                    break
        return results

    def commit(self):
        ''' Write jsonl buffer to parquet '''
        if len(self._buffer) == 0:
            return
        n = len(list(self.parquet_root.glob('*.parquet')))
        outname = self.parquet_root / f"{n+1:04.0f}.parquet"
        
        newfile = pd.DataFrame(self._buffer)
        newfile.to_parquet(outname, compression='snappy')
        self.df = pd.read_parquet(self.parquet_root)
        self._buffer = []
        self.total_tokens = 0

    def getAll(self):
        ''' Get all by model '''
        return self.db

    def get_cost(self, all_runs=False, all_models=False):
        ''' Get cost in dollars. If all_runs=True, count up all token use for the model'''
        if all_models:
            df = pd.DataFrame(self._buffer, columns=['phrase', 'model', 'usage', 'embedding'])
            if all_runs:
                df = pd.concat([df, self.df])
            by_model = df.groupby('model').usage.sum() / 1000 * pd.Series(self.PRICING)
            return by_model.fillna(0).round(2)
        else:
            total_tokens = self.total_tokens
            if all_runs:
                past_tokens = self.df[self.df.model == self.model].usage.sum()
                total_tokens += past_tokens
            return total_tokens/1000*self.PRICING[self.model]

embedder = GPTEmbeddingsParquet(embedmodel, parquet_root=embedding_dir)
embedder

<__main__.GPTEmbeddingsParquet at 0x7f29fdb4ab10>

In [None]:
prompt_emb = testdata.drop_duplicates('prompt')[['prompt', 'question']]
prompt_emb['pemb'] = prompt_emb.prompt.progress_apply(lambda x: embedder.embedding(x))
#prompt_emb['qemb'] = prompt_emb.question.progress_apply(lambda x: embedder.embedding(x))
testdata['remb'] = testdata.response.progress_apply(lambda x: embedder.embedding(x))
embedder.commit()

  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/3030 [00:00<?, ?it/s]

In [None]:
combined = testdata.merge(prompt_emb)
combined['predicted'] = combined.apply(lambda x: cosine(x['pemb'], x['remb']), axis=1)
#combined['predicted_against_question'] = combined.apply(lambda x: cosine(x['qemb'], x['remb']), axis=1)
#combined['predicted_combined'] = combined[['predicted', 'predicted_against_prompt']].mean(1)
combined['src'] = combined['id'].apply(lambda x: x.split('_')[0].split('-')[0])
combined['model'] = 'gpt-emb-' + embedder.model
output = combined[['id', 'model', 'participant', 'prompt', 'target', 'predicted', 'src']]
output.to_csv(evaldir / f'gpt-emb-{embedder.model}.csv')

In [None]:
display(combined.corr().loc['predicted', 'target'].round(4))
print('By task')
combined.groupby('src').corr().round(4)['target'].loc[(slice(None), 'predicted'),].sort_values()

0.1727

By task


src               
snbmo09  predicted    0.1509
bs12     predicted    0.2142
motesf   predicted    0.2143
setal08  predicted    0.2434
hmsl     predicted    0.2525
snb17    predicted    0.3191
betal18  predicted    0.3204
motesp   predicted    0.3397
dod20    predicted    0.3517
Name: target, dtype: float64

In [None]:
combined.groupby('prompt').corr().round(4)['target'].loc[(slice(None), 'predicted'),].sort_values()

prompt               
spoon       predicted    0.1113
brick       predicted    0.2048
hat         predicted    0.2360
lightbulb   predicted    0.2640
knife       predicted    0.2789
box         predicted    0.2801
tire        predicted    0.2801
book        predicted    0.2809
backpack    predicted    0.2847
table       predicted    0.2959
rope        predicted    0.2987
bottle      predicted    0.3257
sock        predicted    0.3622
paperclip   predicted    0.3712
pants       predicted    0.3835
shovel      predicted    0.4150
fork        predicted    0.4215
pencil      predicted    0.4649
ball        predicted    0.4652
toothbrush  predicted    0.5528
shoe        predicted    0.6448
Name: target, dtype: float64

## API Costs

- `ada`: $`0.07`

In [None]:
embedder.get_cost(all_runs=True, all_models=True)

text-similarity-ada-001        0.55
text-similarity-babbage-001    0.73
text-similarity-curie-001      0.00
text-similarity-davinci-001    0.00
dtype: float64

In [None]:
embedder.df[embedder.df.model.str.contains('babbage')][['usage']].sum()

usage    61244
dtype: int64

# 2. Fine-tuning

## Fine-tune training

In [None]:
#@title Prep data
#@markdown completions are multiplied by 10 and rounded
all_data = []
for split in splits:
    if (split == 'val') and ('byprompt' in data_subdir):
        continue
    print(f'preparing {split} data')
    df = pd.DataFrame([pd.read_json(x, orient='index')[0] for x in (data_dir / split).glob('*json')])
    df['split'] = split
    all_data.append(df)
all_data = pd.concat(all_data)

def gt_preparation(df):
    df = df[~df.target.isna()]
    df['response'] = df.response.str.replace('\n', ' ')
    df['completion'] = df.target.apply(lambda x: f' {int(x*10)}')

    if 'type' not in df.columns:
        # this is the functionality for the first LLM paper, which was AUT only and did not include a 'type' column
        df['gptprompt'] = df.apply(lambda x: f"AUT Prompt:{x['prompt']}\nResponse:{x['response']}\nScore:\n", 1)
    else:
        # construct prompts for everything
        match = df['type'] == 'uses'
        df.loc[match, 'gptprompt'] = df.loc[match].apply(lambda x: f"DT Uses Prompt:{x['prompt']}\nResponse:{x['response']}\nScore:\n", 1)

        match = (df['type'] == 'instances')
        df.loc[match, 'gptprompt'] = df.loc[match].apply(lambda x: f"DT Instances Prompt:{x['prompt']}\nResponse:{x['response']}\nScore:\n", 1)

        match = (df['type'] == 'completion')
        df.loc[match, 'simplequestion'] = df.loc[match, 'question'].str.replace("Complete.*?\: \"(.*?)\.\.\.\".*", "\\1", regex=True).tolist()
        df.loc[match, 'gptprompt'] = df.loc[match].apply(lambda x: f"DT Completion Prompt:{x['simplequestion']}\nResponse:{x['response']}\nScore:\n", 1)

        match = (df['type'] == 'consequences')
        df.loc[match, 'gptprompt'] = df.loc[match].apply(lambda x: f"DT Consequences Prompt:if {x['question'].split('consequence if')[1].replace('?', '').strip()}\nResponse:{x['response']}\nScore:\n", 1)

    return df

all_data = gt_preparation(all_data)

In [316]:
#@title Save data
#@markdown Partial splits are for training on less data
use_partial = False #@param {type:'boolean'}
partial_portions = [0.01, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0] #@param {type:"raw"}

#@markdown The naming suffix can help differentiate custom tweaks to the data. Make it blank if you don't need it (in most cases, you'll want it blank).
naming_suffix = '' #@param {type:'string'}

#@markdown If using atypical names for the dataset (e.g. group1, group2, etc.), describe here,
#@markdown else make it `train = ['train']`, etc. In most cases (such as reproducing results from our paper), you'll want the values to be 'train', 'val', and 'test'.
setnames = dict()
trainsetnames = ['train'] #@param {type:'raw'}
valsetnames= ['val']#@param {type:'raw'}
testsetnames = ['test'] #@param {type:'raw'}

setnames = dict(train=trainsetnames,
                val=valsetnames,
                test=testsetnames)

# save testset
for split in ['train', 'val']:
    out = all_data[all_data.split.isin(setnames[split])][['gptprompt', 'completion']]
    out.columns = ['prompt', 'completion']
    fname = f'finetune-{data_subdir}{naming_suffix}_prepared_{split}.jsonl'
    print(f'saving {split} data to {fname}')
    out.to_json(fname, orient='records', lines=True)

    if use_partial and (split == 'train'):
        for prop in partial_portions:
            fname = f'finetune-{data_subdir}{naming_suffix}_prepared_{split}-{prop}.jsonl'
            print(f'saving {split} data to {fname}')
            out.sample(frac=prop, random_state=12345).to_json(fname, orient='records', lines=True)
        print("Saved partials with the following proportions:", partial_portions)
# use to doublecheck data
# don't split since validation already exists

saving train data to finetune-gt_alltests2-perm3_prepared_train.jsonl
saving val data to finetune-gt_alltests2-perm3_prepared_val.jsonl


In [None]:
#@markdown Upload training files or retrieve already-trained files from OpenAI
sid = dict()

def upload_or_load(fname):
    existing_files = [file for file in openai.File.list().data if file.filename == fname]
    if len(existing_files):
        print(f"Using already uploaded file for {split}. If this was unintended, delete the server one.")
        return existing_files[0].id
    else:
        print("Uploading", fname)
        with open(fname) as f:
            results = openai.File.create(file=f, purpose='fine-tune', user_provided_filename=fname)
        return results.id

for split in ['train', 'val']:
    if (split == 'val') and ('byprompt' in data_subdir):
        continue
    fname = f'finetune-{data_subdir}{naming_suffix}_prepared_{split}.jsonl'
    sid[split] = upload_or_load(fname)

    if use_partial and (split == 'train'):
        for prop in partial_portions:
            fname = f'finetune-{data_subdir}{naming_suffix}_prepared_{split}-{prop}.jsonl'
            sid[f"{split}-{prop}"] = upload_or_load(fname)

print("files:", sid.keys())

This is mainly a note to self, but might be valuable for others applying similar methods. The LLM paper was trained on our data (and other researchers'), but we also needed unbiased scores for our main work on MOTES, meaning the model can't have previously seen the responses that it's scoring. To get around this, we trained multiple models, each with different test/train data, so each response can be scored by a model that hasn't seen it before.

e.g. the `alltests2` had three groups of input data, each 32% of the data. purmutation1 is trained on group1+2 (test on group 3), perm2 is trained on group1+3 (test on group 2), and perm3 is trained on group2+3 (test on group 1).

## Fine-tuning

In [318]:
finetune_model = "curie" #@param ["ada", "babbage", "curie", "davinci"]
project_name = 'aut-gpt3' #@param {type:'string'}

In [319]:
#@markdown ### Start Fine-tuning
if use_partial:
    print("Stopping. Are you sure you don't want to run code from section below, 'Fine-tuning for partial train data'?")
else:
    if 'val' in sid:
        results = openai.FineTune.create(training_file=sid['train'],
                            validation_file=sid['val'],
                            model=finetune_model,
                            suffix=data_subdir+naming_suffix)
    else:
        results = openai.FineTune.create(training_file=sid['train'],
                            model=finetune_model,
                            suffix=data_subdir+naming_suffix)

In [None]:
# check in every few minutes to see training status
while True:
    status = openai.FineTune.retrieve(id=results.id).status
    print(status)
    if status not in ['running', 'pending']:
        break
    time.sleep(5*60)

### Costs

- Davinci: "Fine-tune costs $35.65". Curie is 1/10th of the price.

*These notes are outdated, from before I grew the dataset:*
Fine-tuning was 942,736 tokens according to wandb logs, 620,984 tokens according to the billing interface. On ada: `/1000*0.0004=$0.248`, which is about what the billing interface say. At that rate, Babbage (`0.0006`) would train for `$0.37`, Curie (`0.0030`)for `$1.86`, and Davinci (`0.030`) for `$18.92`.

My training+val example n = 8400.

Usage cost is 4x per token, though because training had 4 epochs, the cost per item should be about the same. If test data is 15% of the dataset size, just divide the train costs by $5\frac{2}{3}$.

#### Updates with larger dataset

- finetuning on gt_main was 1,033,728: 0.41 on ada, 0.61 on babbage, $3.10 on curie, and $31 on davinci.

## Fine-tuning for partial train data

In [None]:
finetune_model = "curie" #@param ["ada", "babbage", "curie", "davinci"]
project_name = 'aut-gpt3' #@param {type:'string'}

In [None]:
doublecheck = input(f"Are you sure you want to start fine-tuning runs for {partial_portions}? (Y/N)")
if doublecheck.lower() == 'y':
    for prop in partial_portions:
        print(prop)
        results = openai.FineTune.create(training_file=sid[f'train-{prop}'],
                            validation_file=sid['val'],
                            model=finetune_model,
                            suffix=f'{data_subdir}#{prop}')

Are you sure you want to start fine-tuning runs for [0.01]? (Y/N)Y
0.01


In [None]:
openai.FineTune.retrieve(results.id)

<FineTune fine-tune id=ft-0lEwbHncradjiReqMlyOYAPp at 0x7ffb1faaccb0> JSON: {
  "created_at": 1659726118,
  "events": [
    {
      "created_at": 1659726118,
      "level": "info",
      "message": "Created fine-tune: ft-0lEwbHncradjiReqMlyOYAPp",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1659726119,
      "level": "info",
      "message": "Fine-tune costs $0.04",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1659726120,
      "level": "info",
      "message": "Fine-tune enqueued. Queue number: 0",
      "object": "fine-tune-event"
    }
  ],
  "fine_tuned_model": null,
  "hyperparams": {
    "batch_size": 1,
    "learning_rate_multiplier": 0.1,
    "n_epochs": 4,
    "prompt_loss_weight": 0.1
  },
  "id": "ft-0lEwbHncradjiReqMlyOYAPp",
  "model": "curie",
  "object": "fine-tune",
  "organization_id": "org-48rfjyoSnZfLJWOte33sjuqL",
  "result_files": [],
  "status": "pending",
  "training_files": [
    {
      "bytes": 15458,
      "created

## Test Fine-Tune



In [4]:
#@markdown Show all fine-tuned models
rows = []
all_models = [(x['id']) for x in openai.Model.list()['data'] if 'massive-texts-lab' in x['id']]
# or see all fine tunes (including deleted models)
#all_models = [(x['fine_tuned_model']) for x in openai.FineTune.list()['data'] if x['status'] == 'succeeded']
for x in all_models:
    modelsize, lab, fullname = x.split(':')
    split, date = fullname.split('-2022-')
    if split.count('-') == 3:
        a,b,c,d = split.split('-')
        split, proportion = f"{a}-{b}", float(f"{c}.{d}")
    else:
        proportion = 1
    rows.append((x, modelsize, lab, split, proportion, date))
all_models = pd.DataFrame(rows, columns=['name', 'size', 'lab', 'split', 'proportion', 'date'])
all_models['proportion'] = all_models.proportion.astype(float)
all_models = all_models.set_index(['split', 'size', 'proportion']).sort_index()
all_models

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,name,lab,date
split,size,proportion,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
gt-alltests2-perm1,curie,1.0,curie:ft-massive-texts-lab:gt-alltests2-perm1-...,ft-massive-texts-lab,09-20-23-08-34
gt-alltests2-perm2,curie,1.0,curie:ft-massive-texts-lab:gt-alltests2-perm2-...,ft-massive-texts-lab,09-21-15-22-40
gt-alltests2-perm3,curie,1.0,curie:ft-massive-texts-lab:gt-alltests2-perm3-...,ft-massive-texts-lab,09-21-15-51-50
gt-bypart3,ada,1.0,ada:ft-massive-texts-lab:gt-bypart3-2022-07-22...,ft-massive-texts-lab,07-22-19-33-36
gt-byprompt,ada,1.0,ada:ft-massive-texts-lab:gt-byprompt-2022-08-1...,ft-massive-texts-lab,08-14-02-20-49
gt-byprompt,babbage,1.0,babbage:ft-massive-texts-lab:gt-byprompt-2022-...,ft-massive-texts-lab,08-13-21-01-44
gt-byprompt,curie,1.0,curie:ft-massive-texts-lab:gt-byprompt-2022-08...,ft-massive-texts-lab,08-13-21-27-20
gt-main2,ada,0.01,ada:ft-massive-texts-lab:gt-main2-0-01-2022-08...,ft-massive-texts-lab,08-03-21-54-16
gt-main2,ada,0.05,ada:ft-massive-texts-lab:gt-main2-0-05-2022-08...,ft-massive-texts-lab,08-03-22-02-14
gt-main2,ada,0.1,ada:ft-massive-texts-lab:gt-main2-0-1-2022-08-...,ft-massive-texts-lab,08-03-22-21-38


In [None]:
#@title delete old models
model_to_delete = "curie:ft-massive-texts-lab:gt-alltests1-perm1-2022-09-20-19-42-57" #@param {type: 'string'}
results = None
if model_to_delete:
    results = openai.Model.delete(model_to_delete)
results

In [48]:
#@title Select model.
finetuned_size = "curie" #@param ["ada", "babbage", "curie", "davinci"]
finetuned_split = "gt-main2" #@param ["gt-byparticipant", "gt-byprompt", "gt-main2"] {allow-input: true}
finetuned_suffix = "" #@param [""] {allow-input: true}
finetuned_proportion = 1 #@param [0.01, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0] {type:"raw"}

try:
    finetuned_model = all_models.loc[(finetuned_split+finetuned_suffix, finetuned_size, float(finetuned_proportion)), 'name']
    print("Using", finetuned_model)
except KeyError:
    print("No trained model for those settings")
    raise

Using curie:ft-massive-texts-lab:gt-main2-2022-08-01-19-44-29


In [None]:
def score_gpt(prompt, model, just_final=False):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        temperature=0,
        n=1,
        logprobs=None,
        stop='\n',
        max_tokens=1
    )
    if just_final:
        return response.choices[0]['text']
    else:
        return response

def score_batch(gptprompts, model, raise_errs=False, batch_size=500, **kwargs):
    ''' this is adapted from the batch scoring in the Open Scoring library module
    https://github.com/massivetexts/open-scoring/blob/master/open_scoring/scoring.py
    '''
    scores = []
    nbatches = np.ceil(len(gptprompts) / batch_size).astype(int)
    for i in tqdm(range(nbatches)):
        promptbatch = gptprompts[i*batch_size:(i+1)*batch_size]

        sleeptime = 10
        while True:
            try:
                response = score_gpt(promptbatch, model=model, just_final=False)
                break
            except openai.error.RateLimitError:
                print(f"Rate limit error - trying again in {sleeptime} seconds")
                time.sleep(sleeptime)
                sleeptime += 2
        total_tokens = response.usage.total_tokens
        scores_raw = [x.text for x in response.choices]
        avg_tokens_per = total_tokens / len(scores_raw)
        for i, score_raw in enumerate(scores_raw):
            try:
                score = int(score_raw.strip()) / 10
            except:
                if raise_errs:
                    print(f"GPT prompt: {promptbatch[i].strip()}")
                    print(f"raw response: {score_raw}")
                    raise
                score = None
            scores.append((score, avg_tokens_per))
    return scores

testdata = all_data[all_data.split.isin(setnames['test'])].copy()
results = score_batch(testdata.gptprompt.tolist(), finetuned_model, batch_size=400)
testdata[['predicted', 'total_tokens']] = pd.DataFrame(results).values

s = finetuned_split.replace('-', '_')

if (finetuned_proportion < 1) and (finetuned_suffix == ""):
    raise Exception("This condition not coded yet!")

if finetuned_suffix != "":
    fname = f'gpt-ft-{finetuned_size}-s{finetuned_suffix}.csv'
elif finetuned_proportion == 1:
    fname = f'gpt-ft-{finetuned_size}.csv'
else:
    fname  = f'gpt-ft-{finetuned_size}-{finetuned_proportion}.csv'

testdata['model'] = f"gpt3-{finetuned_size}"
testdata['proportion'] = finetuned_proportion

#testdata['predicted'] = testdata.predicted_raw.str.strip().str.replace('[\-\:/]','', regex=True).apply(lambda x:x.split(' ')[0])
#testdata['predicted'] = pd.to_numeric(testdata['predicted'], errors='coerce').div(10)
returncols = ['id', 'model', 'type', 'participant', 'prompt', 'target', 'predicted', 'src', 'total_tokens', 'proportion']
output = testdata[[x for x in returncols if x in testdata.columns]]
print("Saving to", (base_dir / 'Data' / 'evaluation' / s / fname))
output.to_csv(base_dir / 'Data' / 'evaluation' / s / fname)

In [None]:
output.groupby('type')[['target', 'predicted']].corr().iloc[1::2, 0]

type                   
completion    predicted    0.847867
consequences  predicted    0.713104
instances     predicted    0.907785
uses          predicted    0.785307
Name: target, dtype: float64

## Special Eval

This code is specific for the multiple permutation training (where multiple models were trained on different data, to allow scores for 100% of the data without any data leakage). You probably don't need this :)

In [11]:
all_models[all_models.name.str.contains('alltests2')].name.tolist()

['curie:ft-massive-texts-lab:gt-alltests2-perm1-2022-09-20-23-08-34',
 'curie:ft-massive-texts-lab:gt-alltests2-perm2-2022-09-21-15-22-40',
 'curie:ft-massive-texts-lab:gt-alltests2-perm3-2022-09-21-15-51-50']

In [56]:
modelname = ''
all_eval = []
for gptmodel, testpath in [('curie:ft-massive-texts-lab:gt-alltests2-perm1-2022-09-20-23-08-34', 'group3'),
                            ('curie:ft-massive-texts-lab:gt-alltests2-perm2-2022-09-21-15-22-40', 'group2'),
                            ('curie:ft-massive-texts-lab:gt-alltests2-perm3-2022-09-21-15-51-50', 'group1'),
                            ('curie:ft-massive-texts-lab:gt-alltests2-perm3-2022-09-21-15-51-50', 'val')]:
    df = pd.DataFrame([pd.read_json(x, orient='index')[0] for x in (data_dir / testpath).glob('*json')])
    just_motes = df.query('src=="motesf"')
    just_motes = gt_preparation(just_motes)
    results = score_batch(just_motes.gptprompt.tolist(), gptmodel, batch_size=600)
    just_motes[['predicted', 'total_tokens']] = pd.DataFrame(results).values
    all_eval.append(just_motes)
    
fulldata = pd.concat(all_eval)
fulldata['participant'] = fulldata['participant'].str.replace('motesf','')
fulldata.prompt = fulldata.prompt.str.replace('lightbulb', 'light bulbs').str.replace('hat', 'hat cap').str.replace('ball','soccer ball').str.replace('pencil', 'lead pencil').str.replace('spoon', 'spoons')
fulldata.to_csv(base_dir / 'motesf_alltest2_allperms.csv')
fulldata.groupby('type')[['target', 'predicted']].corr().iloc[1::2, 0]

type                 
completion  predicted    0.851884
instances   predicted    0.787339
uses        predicted    0.747670
Name: target, dtype: float64

In [54]:
# Merge back to original data
import hashlib
df = pd.read_csv(gt_dir / 'motes_full_gt_scores.csv').replace(-999, pd.NA).copy()
items = [col.replace('_prompt', '') for col in df.columns if col.startswith('G') and col.endswith('_prompt')]
collector = []
for item in items:
    subset = df[['Order'] + [col for col in df.columns if col.startswith(item)]].copy()
    subset.columns = [col.split('_')[-1] for col in subset.columns]
    subset['game'] = item.split('_')[0]
    subset['prompt_code'] = item
    collector.append(subset)
reshaped = pd.concat(collector)
reshaped = reshaped.rename(columns={'corrected':'response'})
final = reshaped.merge(fulldata.drop('participant', axis=1), on=['prompt', 'response'])[['Order', 'response_num', 'prompt', 'response', 'target', 'predicted', 'type']]
final.to_csv(base_dir / 'motesf-llm-scores.csv')
final.sample(1)

Unnamed: 0,Order,response_num,prompt,response,target,predicted,type
7540,126,6,rain,cat dog,2.3,1.9,completion


# 3. Prompt Engineering (Zero-shot or Few-shot)

In [None]:
base_dir = Path('drive/MyDrive/Grants/MOTES/') #@param { type: 'raw' }
data_subdir = "gt_main_std" # should be identical to gt_main2, but I didn't want to accidentally overwrite gt_main when exporting standard deviations
!cp "{gt_dir}/{data_subdir}.tar.gz" .
!rm -rf data
!tar -xf {data_subdir}.tar.gz
data_dir = Path(f"data/{data_subdir}")

In [None]:
testdata = pd.DataFrame([pd.read_json(x, orient='index')[0] for x in (data_dir / 'test').glob('*json')])
traindata = pd.DataFrame([pd.read_json(x, orient='index')[0] for x in (data_dir / 'train').glob('*json')])

In [None]:
#@title Few/Zero Shot Definitions
import re
PROMPT_TEMPLATE = "Below is a list of uses for a {0}. On a scale of 10-50, judge how original each use for a {1} is, where 10 is 'not at all creative' and 50 is 'very creative':\n\nUSES\n{2}\n{3}\n\nRATINGS\n{4}\n{5}" #@param {type:"raw"}
#@markdown Examples and completions per prompt. nexamples=0 for zero-shot. Note that this may not work.
nexamples = 5 #@param {type:"integer"}
ncompletions =  5#@param {type:"integer"}

def select_by_std(x, nexamples=5, max_std=None):
    ''' Choose examples from low stdev group that span the range of '''
    z = x.sort_values('target')
    if max_std:
        z = z[z.rating_std <= max_std]
    nexamples = min(len(z), nexamples)
    batch_size = len(z) // nexamples
    samples = []
    for i in range(nexamples): 
        maxv = (i+1)*batch_size if (i+1) < nexamples else len(z)+1
        sample = z.iloc[(i*batch_size):maxv].sample(1).iloc[0]
        samples.append(sample)
    return pd.DataFrame(samples)

def format_fewshot(df, startn=1, no_target=False, shuffle=True):
    assert len(df.question.unique()) == 1
    if shuffle:
        # jumble, so GPT doesn't speculate based on monotonically increasing values
        df = df.sample(frac=1)
    q = df.iloc[0]['question']
    rlist = "\n".join([f"{i+startn}. {response}" for i, response in enumerate(df.response)])
    if no_target:
        tlist = f"{startn}."
    else:
        tlist = "\n".join([f"{i+startn}. {int(target*10)}" for i, target in enumerate(df.target)])
    return pd.Series(dict(q=q, rlist=rlist, tlist=tlist))

def fewshot_prompt(x, nexamples=5, max_std=None):
    df = select_by_std(x, nexamples=nexamples, max_std=max_std)
    return format_fewshot(df)

def chunked_prompts(x, startn=6, ncomplete = 5, no_target=True):
    nchunks = np.ceil(len(x) / ncomplete).astype(int)
    rows = []
    for i in range(nchunks):
        subset = x[i::nchunks]
        rows.append(format_fewshot(subset, startn=startn, no_target=no_target))
    return pd.DataFrame(rows)

def parse_uses_response(gptresponse, startn = 6, ncomplete = 5):
    prompt = re.findall(" each use for a? ?(.*?) is, where \d+ is '", gptresponse)[0]
    raw_list = re.findall('(\d+\..*)$', gptresponse, flags=re.MULTILINE)
    cleaned = []
    for n in range(startn, startn+ncomplete):
        rawvals = [x for x in raw_list if x.startswith(f"{n}. ")]
        vals = [x.split('.', 1)[-1].strip() for x in rawvals]
        if len(vals) > 2:
            print(f"ERROR WITH {n}: {vals}")
        elif len(vals) == 0:
            # skip
            continue

        try:
            response, predicted = vals[0], vals[1]
            if not predicted.isnumeric():
                # failsafe where everything but last two integers are stripped
                # many ways this can fail, but let it fail
                predicted = "".join([x for x in list(predicted) if x.isnumeric()][-2:])
        except:
            print(f"Can't Parse {n}: {rawvals}")

        try:
            predicted = int(predicted)
            cleaned.append(dict(prompt=prompt, response=response, predicted=predicted))
        except:
            print(f"Can't cast predicted to int {n}: {rawvals}")
    return pd.DataFrame(cleaned)

def full_prompt_from_completes_row(row):
    start = prompt_starts.loc[row.name]
    full_prompt = PROMPT_TEMPLATE.format(row.name.upper(), row.name.lower(),start.rlist, row.rlist,start.tlist, row.tlist)
    full_prompt = full_prompt.replace('A PANTS', 'PANTS').replace('a pants', 'pants')
    return full_prompt

In [None]:
sdlim = traindata.rating_std.quantile(0.4)

prompt_completes = testdata.groupby('prompt').apply(lambda x: chunked_prompts(x, startn=nexamples+1, ncomplete=ncompletions)).reset_index(1)
if nexamples > 0:
    prompt_starts = traindata.groupby(['prompt']).apply(lambda x: fewshot_prompt(x, nexamples=nexamples, max_std=sdlim))
else:
    prompt_starts = traindata[['prompt', 'question']].drop_duplicates().set_index('prompt').rename(columns={'question':'q'})
    prompt_starts[['rlist','tlist']] = ('', '')

prompt_completes['prompt_for_gpt'] = prompt_completes.apply(full_prompt_from_completes_row, axis=1)

if nexamples ==  0:
    def replaces(x):
        return (x.replace('USES\n\n', 'USES\n')
        .replace('RATINGS\n\n', 'RATING FROM 1-10\n')
        .replace('10-50', '1-10').replace('10 is', '1 is').replace('50 is', '10 is') # change scale to be 1-10 (/2 for final score)
        )
    prompt_completes['prompt_for_gpt'] = prompt_completes['prompt_for_gpt'].apply(replaces, 1)

testprompt = prompt_completes.sample().iloc[0]['prompt_for_gpt']
print(testprompt)

Below is a list of uses for a SOCK. On a scale of 10-50, judge how original each use for a sock is, where 10 is 'not at all creative' and 50 is 'very creative':

USES
1. to use it like a puppet.
2. You can put googly eyes and make a sock puppet show.
3. You can color it and maybe make a snake.
4. a cool and funny puppet.
5. maybe you could put it on your hands and pretend to have superpowers.
6. using it as gloves.
7. you could use it for ASMR
8. Cut them and make a 3D scupture.
9. you can make a dress for your doll
10. to use it like a backpack or store money in it

RATINGS
1. 27
2. 27
3. 32
4. 24
5. 36
6.


In [None]:
fewshot_model = "text-davinci-002" #@param ['text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-002']

def score_fewshot(prompt, model=fewshot_model, just_final=False):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        temperature=0,
        n=1,
        logprobs=None,
        max_tokens=200
    )
    if just_final:
        return response.choices[0]['text']
    else:
        return response

In [None]:
# example in paper
testprompt = '''Below is a list of uses for a SOCK. On a scale of 10-50, judge how original each use for a sock is, where 10 is 'not at all creative' and 50 is 'very creative':

USES
1. to use it like a puppet.
2. You can put googly eyes and make a sock puppet show.
3. You can color it and maybe make a snake.
4. a cool and funny puppet.
5. maybe you could put it on your hands and pretend to have superpowers.
6. using it as gloves.
7. you could use it for ASMR
8. Cut them and make a 3D scupture.
9. you can make a dress for your doll
10. to use it like a backpack or store money in it

RATINGS
1. 27
2. 27
3. 32
4. 24
5. 36
6.'''
gptresponse = score_fewshot(testprompt)
print(gptresponse['choices'][0]['text'])

 20
7. 40
8. 50
9. 45
10. 35


In [None]:
collector = []

davinci_cost = 0.06/1000
total_usage = 0
n_processed = 0
err_prompts = []

pbar = tqdm(prompt_completes.prompt_for_gpt)
for fullprompt in pbar:
    try:
        gptresponse = score_fewshot(fullprompt)
        completed = (fullprompt + gptresponse['choices'][0]['text'])
        out = parse_uses_response(completed, startn=nexamples+1, ncomplete=ncompletions)
        out['usage_per_prompt'] = gptresponse['usage']['total_tokens'] / 10
        collector.append(out)
        # Calculate costs
        n_processed += len(out)
        total_usage += gptresponse['usage']['total_tokens']
        est = np.round(len(testdata) * (total_usage / n_processed) * davinci_cost, 2)
        pbar.set_description(f"Cost est (davinci). $%s" % est)
    except KeyboardInterrupt:
        raise
    except:
        raise
        print("Err with ", completed)
        err_prompts.append((fullprompt, gptresponse['choices'][0]['text']))

all_fewshot = pd.concat(collector)
all_fewshot

  0%|          | 0/615 [00:00<?, ?it/s]

In [None]:
all_fewshot.usage_per_prompt.sum() * davinci_cost

3.1992419999999995

In [None]:
testdata.count()

src             3030
question        3030
prompt          3030
response        3030
id              3030
target          3030
participant     3030
response_num    2249
rating_std      3023
count            338
dtype: int64

In [None]:
results.predicted.count()

2995

Cost for DaVinci with 10 completions:
 - $4.32 (`r=0.42`, n(errs)=39)

 Cost for DaVinci with 5 completions:
 - $ 3.20 (`r=0.36`, n(errs)=36) (huh? Why lower?)

Cost for DaVinci zero-shot with 10 completions:
- $3.34 (`r=0.13`, n(errs)=106)

In [None]:
results = testdata.merge(all_fewshot, on=['prompt', 'response'], how='left')
if nexamples > 0:
    results.predicted = results.predicted.div(10)
else:
    results.predicted = results.predicted.div(2)
results['model'] = f"gpt3-{fewshot_model.split('-')[1]}"
results['nexamples'] = nexamples
results['ncompletions'] = ncompletions
print(f"n(errs)={results.predicted.isna().sum()}")
results = results.rename(columns={'usage_per_prompt': 'total_tokens'})

output = results[['id', 'model', 'participant', 'prompt', 'target', 'predicted', 'src', 'total_tokens', 'nexamples', 'ncompletions']]
(base_dir / 'Data' / 'evaluation' / 'fewshot').mkdir(exist_ok=True)
fname = f'gpt3-{fewshot_model}-{nexamples}-{ncompletions}.csv'
output.to_csv(base_dir / 'Data' / 'evaluation' / 'fewshot' / fname)

n(errs)=36


In [None]:
output.corr().loc['target', 'predicted']

0.3639529389033076

In [None]:
output.groupby('prompt').corr().loc[(slice(None),'target'), 'predicted']

prompt            
backpack    target   -0.372925
ball        target    0.248275
book        target    0.009457
bottle      target    0.227964
box         target    0.223513
brick       target    0.103470
fork        target    0.271246
hat         target    0.430603
knife       target    0.118874
lightbulb   target    0.182097
pants       target    0.536770
paperclip   target    0.293652
pencil      target    0.213501
rope        target    0.210758
shoe        target    0.161375
shovel      target    0.002604
sock        target    0.078571
spoon       target    0.194876
table       target    0.055694
tire        target    0.126105
toothbrush  target    0.274063
Name: predicted, dtype: float64