# Notebook 4: GPT Responses

This notebook will go over 3 parts of analysis:
1. Running the dataset through the API
2. Cleaning the response data

**Make sure to download the necessary zip file and upload it to JupyterLab before running this script**

In [None]:
import os

# make sure nb2_files.zip exists
if not os.path.exists('nb4_files.zip'):
    print('nb4_files.zip not found. Please make sure it exists in the current directory.')
    exit(1)

In [None]:
# First unzip tutorial contents
import shutil
shutil.unpack_archive('nb4_files.zip', 'nb4_files')

notebook_files_path = 'nb4_files/nb4_files/'

### 1 Running your dataset through the API

In [None]:
from nb4_files.nb4_files.gpt_api import process_dataset

In [None]:
# inspect the function
process_dataset()

In [None]:
# Let's start with the text-only task
dataset_path = f"{notebook_files_path}nb3_textonly_dataset.json"
API_KEY = ""
output_path = "nb4_textonly_responses.json"
num_attempts = 1

textonly_responses = process_dataset(dataset_path, API_KEY, output_path, num_attempts)

In [None]:
# Now let's try the image-text pair task
dataset_path = f"{notebook_files_path}nb3_imagetext_dataset.json"
output_path = "nb4_imagetext_responses.json"
num_attempts = 5
image_dir = f"{notebook_files_path}images"

imagetextpair_responses = process_dataset(dataset_path, API_KEY, output_path, num_attempts, image_dir)

### 2 Text-only data

#### 2.1 Cleaning text-only data

Now we want to clean the output responses so we can measure whether they are correct or incorrect. We can do this manually by looking at the json outputs or using code

For the text only data we prompted the model to respond whether the conclusion was valid or invalid. Let's first store these in a dataframe

In [9]:
import pandas as pd

In [10]:
textonly_responses[:3]

[{'id': '00',
  'attempt': 0,
  'prompt': "Each question contains two premises and a conclusion. Your task is to determine whether the conclusion logically follows from the premises.\nIf the conclusion is logically valid, select 'Valid'.\nIf the conclusion does not logically follow, select 'Invalid'.\nPremise 1: All fruits have seeds.\nPremise 2: An apple is a fruit.\nConclusion: Therefore, an apple has seeds.",
  'response': 'Valid'},
 {'id': '01',
  'attempt': 0,
  'prompt': "Each question contains two premises and a conclusion. Your task is to determine whether the conclusion logically follows from the premises.\nIf the conclusion is logically valid, select 'Valid'.\nIf the conclusion does not logically follow, select 'Invalid'.\nPremise 1: All squares are rectangles.\nPremise 2: All rectangles have four sides.\nConclusion: Therefore, all squares have four sides.",
  'response': 'Valid.'},
 {'id': '02',
  'attempt': 0,
  'prompt': "Each question contains two premises and a conclusio

In [11]:
textonly_df = pd.DataFrame(textonly_responses)
textonly_df.head()

Unnamed: 0,id,attempt,prompt,response
0,0,0,Each question contains two premises and a conc...,Valid
1,1,0,Each question contains two premises and a conc...,Valid.
2,2,0,Each question contains two premises and a conc...,Valid
3,3,0,Each question contains two premises and a conc...,Valid.
4,4,0,Each question contains two premises and a conc...,Valid


From visual inspection, it looks like sometimes the response contains a period at the end and sometimes it doesn't. Let's remove periods if they exist and also make the response all lower-case to make our analysis easier:

In [14]:
textonly_df['response'] = textonly_df['response'].apply(lambda x: x.replace(".", "").lower())
textonly_df.head()

Unnamed: 0,id,attempt,prompt,response
0,0,0,Each question contains two premises and a conc...,valid
1,1,0,Each question contains two premises and a conc...,valid
2,2,0,Each question contains two premises and a conc...,valid
3,3,0,Each question contains two premises and a conc...,valid
4,4,0,Each question contains two premises and a conc...,valid


In [15]:
textonly_df['response'].value_counts()

response
valid                                                                                                                                                                                                                                                     5
invalid                                                                                                                                                                                                                                                   4
invalid the conclusion does not logically follow from the premises the premises only establish that no insects are mammals and that a spider is not an insect however, this does not provide any information about whether a spider is a mammal or not    1
Name: count, dtype: int64

In [16]:
# looks like there's still a weird long answer so let's take only the first word of each response
textonly_df['response'] = textonly_df['response'].apply(lambda x: x.split(" ")[0])
textonly_df['response'].value_counts()

response
valid      5
invalid    5
Name: count, dtype: int64

#### 2.2 Text-only analysis

In [17]:
# Now let's add a correct answers column
answers = ["valid"] * 5 + ["invalid"] * 5
textonly_df['answer'] = answers
textonly_df.head()

Unnamed: 0,id,attempt,prompt,response,answer
0,0,0,Each question contains two premises and a conc...,valid,valid
1,1,0,Each question contains two premises and a conc...,valid,valid
2,2,0,Each question contains two premises and a conc...,valid,valid
3,3,0,Each question contains two premises and a conc...,valid,valid
4,4,0,Each question contains two premises and a conc...,valid,valid


In [18]:
# finally let's make a corrrect column if the response is the same as the answer
textonly_df['correct'] = textonly_df['response'] == textonly_df['answer']
textonly_df.head()

Unnamed: 0,id,attempt,prompt,response,answer,correct
0,0,0,Each question contains two premises and a conc...,valid,valid,True
1,1,0,Each question contains two premises and a conc...,valid,valid,True
2,2,0,Each question contains two premises and a conc...,valid,valid,True
3,3,0,Each question contains two premises and a conc...,valid,valid,True
4,4,0,Each question contains two premises and a conc...,valid,valid,True


In [20]:
accuracy = textonly_df['correct'].sum()/len(textonly_df)
accuracy

1.0

### 3 Image-text pair data

#### 3.1 Cleaning

In [58]:
import json

In [None]:
with open("nb4_imagetext_responses.json", "r") as f:
    imagetextpair_responses = json.load(f)

In [78]:
imagetextpair_responses[:3]

[{'id': '00',
  'attempt': 0,
  'prompt': 'Choose which word best describes what the person in the picture is thinking or feeling based on their eyes alone.\nEven if you feel like you cannot tell based on their eyes alone, please select the best word.\nYou may feel that more than one word is applicable, but please choose just one word, the word which you consider to be most suitable.\nYour 4 choices are: playful comforting irritated bored',
  'response': "I'm unable to determine what the person is thinking or feeling based on their eyes alone."},
 {'id': '00',
  'attempt': 1,
  'prompt': 'Choose which word best describes what the person in the picture is thinking or feeling based on their eyes alone.\nEven if you feel like you cannot tell based on their eyes alone, please select the best word.\nYou may feel that more than one word is applicable, but please choose just one word, the word which you consider to be most suitable.\nYour 4 choices are: playful comforting irritated bored',
  

These responses look a little trickier to clean...

Maybe we should clean these manually

In [68]:
import time
import re

In [79]:
cleaned_imagetextpair_responses = imagetextpair_responses.copy()

for i, item in enumerate(cleaned_imagetextpair_responses):
    response = item['response']
    if len(response.split(" ")) == 1:
        clean_response = response.strip().lower().replace(".", "")
    else:
        # find the word in "" and take that
        quote_match = re.search(r'"([^"]*)"', response)
        if quote_match:
            clean_response = quote_match.group(1).strip().lower().replace(".", "")
        else:
            clean_response = None
    cleaned_imagetextpair_responses[i]['response'] = clean_response

In [80]:
imagetext_df = pd.DataFrame(cleaned_imagetextpair_responses)
imagetext_df.head()

Unnamed: 0,id,attempt,prompt,response
0,0,0,Choose which word best describes what the pers...,
1,0,1,Choose which word best describes what the pers...,playful
2,0,2,Choose which word best describes what the pers...,playful
3,0,3,Choose which word best describes what the pers...,
4,0,4,Choose which word best describes what the pers...,playful


In [None]:
imagetext_df['response'].value_counts()

In [84]:
# how many are invalid?
num_invalid = imagetext_df['response'].isnull().sum()
num_invalid

76

In [None]:
with open(f"{notebook_files_path}answers.txt", "r") as f:
    answers = f.readlines()
answers = [x.strip() for x in answers]
answers[:3]

['playful', 'upset', 'desire']

In [88]:
id = [f"{i:02d}" for i in range(0, 36)]
answers_id = dict(zip(id, answers))
answers_id = pd.DataFrame(answers_id.items(), columns=['id', 'answer'])
answers_id.head()

Unnamed: 0,id,answer
0,0,playful
1,1,upset
2,2,desire
3,3,insisting
4,4,worried


In [89]:
imagetext_df = imagetext_df.merge(answers_id, on='id')
imagetext_df.head()

Unnamed: 0,id,attempt,prompt,response,answer
0,0,0,Choose which word best describes what the pers...,,playful
1,0,1,Choose which word best describes what the pers...,playful,playful
2,0,2,Choose which word best describes what the pers...,playful,playful
3,0,3,Choose which word best describes what the pers...,,playful
4,0,4,Choose which word best describes what the pers...,playful,playful


In [90]:
imagetext_df['correct'] = imagetext_df['response'] == imagetext_df['answer']
imagetext_df.head()

Unnamed: 0,id,attempt,prompt,response,answer,correct
0,0,0,Choose which word best describes what the pers...,,playful,False
1,0,1,Choose which word best describes what the pers...,playful,playful,True
2,0,2,Choose which word best describes what the pers...,playful,playful,True
3,0,3,Choose which word best describes what the pers...,,playful,False
4,0,4,Choose which word best describes what the pers...,playful,playful,True


In [92]:
for i in range(36):
    print(f"Item {i + 1} Accuracy:",
    (imagetext_df[imagetext_df['id'] == f"{i:02d}"]['correct'].sum())/5 # 5 attempts per item
    )

Item 1 Accuracy: 0.6
Item 2 Accuracy: 0.6
Item 3 Accuracy: 1.0
Item 4 Accuracy: 0.2
Item 5 Accuracy: 1.0
Item 6 Accuracy: 0.2
Item 7 Accuracy: 1.0
Item 8 Accuracy: 0.8
Item 9 Accuracy: 0.8
Item 10 Accuracy: 0.0
Item 11 Accuracy: 0.8
Item 12 Accuracy: 0.8
Item 13 Accuracy: 0.6
Item 14 Accuracy: 0.6
Item 15 Accuracy: 1.0
Item 16 Accuracy: 1.0
Item 17 Accuracy: 0.2
Item 18 Accuracy: 0.2
Item 19 Accuracy: 0.0
Item 20 Accuracy: 0.0
Item 21 Accuracy: 0.0
Item 22 Accuracy: 0.0
Item 23 Accuracy: 0.0
Item 24 Accuracy: 0.0
Item 25 Accuracy: 0.0
Item 26 Accuracy: 0.0
Item 27 Accuracy: 0.2
Item 28 Accuracy: 0.0
Item 29 Accuracy: 1.0
Item 30 Accuracy: 1.0
Item 31 Accuracy: 0.0
Item 32 Accuracy: 1.0
Item 33 Accuracy: 0.0
Item 34 Accuracy: 0.4
Item 35 Accuracy: 0.0
Item 36 Accuracy: 1.0


In [None]:
imagetext_df.to_csv("nb4_gpt_rmet_results.csv")