## Imports

Importing the required modules for the project.

OpenAI API for interacting with ChatGPT.

In [2]:
from openai import OpenAI

The ```dotenv_values``` function from the dotenv module for accessing configuration file properties.

In [3]:
from dotenv import dotenv_values

Numpy and Pandas for advanced array and dataframe functionality, respectively.

In [4]:
import numpy as np
import pandas as pd

## Global Settings

Global settings and variables to be used throughout the project.

Creating the OpenAI client and setting the OpenAI API key.

**NOTE:** You should have your key stored in a file named `.env` inside the parent folder of this notebook. 

In this file, you need to store your key as such:

`OPENAI_API_KEY={YOUR_API_KEY}`

In [5]:
openai_client = OpenAI(api_key=dotenv_values("../../.env")["OPENAI_API_KEY"])

Defining a random number generator to be used throughout the project in order to ensure the same random results are obtained each time the project is run.

In [6]:
rng = np.random.default_rng(seed=101)

## Function Definitions

Defining the public and private functions to be used.

### Private Functions

These are private, helper functions. They are not meant to be used on their own, and are instead called by the main functions to be utilised by the user.

In [7]:
def document_class_instance(instance: any) -> None:
    print(f"\tInstance Class: {type(instance)}\n")
    for key, value in vars(instance).items():
        print(f"{key} (type: {type(value)}):\n\t{value}\n")

In [8]:
def create_initial_messages(system_message: str) -> list[dict[str, str]]:
    messages: list[dict[str, str]] = [
        
        {
            "role": "system",
            "content": system_message
        }

    ]

    return messages

### Public Functions

The below are the public functions to be used in this project.

In [9]:
def get_chatgpt_response(
        *args, 
        messages: list[dict[str, str]],
        engine: str = "gpt-3.5-turbo",
        max_tokens: int = 500,
        temperature: float = 0.0,
        n: int = 1,
        **kwargs
        ):

    response = openai_client.chat.completions.create(
        messages=messages,
        model=engine,
        max_tokens=max_tokens,
        temperature=temperature,
        n=n,
        seed=42
    )

    return response

In [10]:
def get_chatgpt_responses(
        *args,
        system_message: str,
        prompts: list[str],
        engine: str = "gpt-3.5-turbo",
        max_tokens: int = 500,
        temperature: float = 0.0,
        n: int = 1,
        **kwargs
):
    
    if len(prompts) > 1:
        n = 1

    chat: dict[str, list] = {
        "messages": create_initial_messages(system_message),
        "responses": []
              }

    for prompt in prompts:
        
        chat["messages"].append({"role": "user", "content": prompt})

        response = get_chatgpt_response(
            messages=chat["messages"],
            engine=engine,
            max_tokens=max_tokens,
            temperature=temperature,
            n=n
            )
        
        response_message_content: str = response.choices[0].message.content

        chat["responses"].append(response)
        chat["messages"].append({"role": "assistant", "content": response_message_content})

        print(f"\tgpt: {chat["messages"][-1]["content"]}")

    return chat

In [11]:
def serialise(responses: list):
    cdfs = []
    for chat_completion in responses:
        chat_dict = dict(chat_completion)
        choice_dicts = []
        for choice in chat_dict["choices"]:
            choice_dict = dict(choice)
            choice_dict["message"] = dict(choice_dict["message"])
            choice_dicts.append(choice_dict)  

        chat_dict["choices"] = choice_dicts
        chat_dict["usage"] = dict(chat_dict["usage"])
        cdf = pd.json_normalize(chat_dict)

        choices_normalised_list = [pd.json_normalize(choices['message']) for choices in choice_dicts]
        cdf_normalised = pd.concat(choices_normalised_list, axis=1)

        cdf = pd.concat([cdf, cdf_normalised], axis=1).drop("choices", axis=1)

        cdf.columns = cdf.columns.str.removeprefix("usage.")

        cdfs.append(cdf)

    df = pd.concat(cdfs)
    return df

In [12]:
def run_scenario(
        scenario: pd.DataFrame,
        n: int = 1,
        engine: str = "gpt-3.5-turbo",
        max_tokens: int = 500,
        temperature: float = 0.0
        ):
    
    system_message = scenario.iloc[0]["system_message_content"]

    prompts = scenario["question_content"].to_list()
    prompts[0] = f"{scenario.iloc[0]["story_content"]}\n\n{prompts[0]}"
    
    responses = get_chatgpt_responses(
        system_message=system_message,
        prompts=prompts,
        engine=engine,
        max_tokens=max_tokens,
        temperature=temperature,
        n=n
        )
    
    return responses

In [13]:
def run_scenarios(scenarios: pd.DataFrame,
                  n_scenario: int = 1,
                  n_response: int = 1):
    responses_dfs = []
    for scenario_id in scenarios["scenario_id"].unique():
        scenario = scenarios[scenarios["scenario_id"] == scenario_id]
        
        responses = run_scenario(scenario, n_response)
        responses_df = serialise(responses["responses"])

        responses_dfs.append(responses_df)
    
    return pd.concat(responses_dfs)

## Variable Definitions

Creating the required variables for our project.

Creating the questions dictionary in order to test for four different variations of the question to be asked. Two of these questions are "observation" based, while the remaining two are "intervention" based.

## Examples

We will have a look at an example prompt and response.

In [14]:
df = pd.read_csv("../input/processed/combined.csv", index_col=0)

In [15]:
df.head(3)

Unnamed: 0,study_id,study_name,scenario_id,scenario_code,story_id,chat_id,story_common_id,story_category,story_name,story_content,...,question_common_id,question_content,question_type,question_language,question_has_fbv_zan,question_has_fbv_san,question_tom_order,question_tom_type,answer_id,answer_correct
0,1,Ullman Replication,1,1-EN/1,1-EN,1,1,Unexpected Contents,Base 1,Here is a bag filled with popcorn. There is no...,...,1,She believes that the bag is full of {RESPONSE}.,Closed-ended,English,,,1,Belief,1,chocolate
1,1,Ullman Replication,2,1A-EN/1,1A-EN,1,1A,Unexpected Contents,Transparent,Here is a bag filled with popcorn. There is no...,...,1,She believes that the bag is full of {RESPONSE}.,Closed-ended,English,,,1,Belief,11,popcorn
2,1,Ullman Replication,3,1B-EN/1,1B-EN,1,1B,Unexpected Contents,Uninformative,Here is a bag filled with popcorn. There is no...,...,1,She believes that the bag is full of {RESPONSE}.,Closed-ended,English,,,1,Belief,21,uncertainty


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 328 entries, 0 to 323
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   study_id                 328 non-null    int64  
 1   study_name               328 non-null    object 
 2   scenario_id              328 non-null    int64  
 3   scenario_code            328 non-null    object 
 4   story_id                 328 non-null    object 
 5   chat_id                  328 non-null    int64  
 6   story_common_id          328 non-null    object 
 7   story_category           328 non-null    object 
 8   story_name               328 non-null    object 
 9   story_content            328 non-null    object 
 10  story_language           328 non-null    object 
 11  chat_name                328 non-null    object 
 12  chat_language            328 non-null    object 
 13  chat_has_fbv_zan         77 non-null     float64
 14  chat_has_fbv_san         77 non

In [17]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
study_id,328.0,2.146341,0.701627,1.0,2.0,2.0,3.0,3.0
scenario_id,328.0,74.567073,27.694739,1.0,58.75,79.0,92.0,116.0
chat_id,328.0,31.533537,15.004549,1.0,22.0,31.0,44.0,56.0
chat_has_fbv_zan,77.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
chat_has_fbv_san,77.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
system_message_id,328.0,4.027439,1.494898,1.0,3.0,4.0,5.0,6.0
question_index,328.0,1.695122,1.732503,0.0,0.0,1.0,3.0,7.0
question_common_id,328.0,10.04878,9.689873,1.0,3.0,6.0,15.0,35.0
question_has_fbv_zan,41.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
question_has_fbv_san,41.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


In [19]:
output = []

for scenario_id in df["scenario_id"].unique():
    print(scenario_id, end=":\n")
    scenario = df[df["scenario_id"] == scenario_id].reset_index(drop=True)

    responses = run_scenario(scenario)
    responses_df = serialise(responses["responses"]).reset_index(drop=True)

    concat = pd.concat([scenario, responses_df], axis=1).reset_index(drop=True)
    output.append(concat)

81:
	gpt: According to Karen, Bea believes that Karen bought some chocolate cake.
	gpt: Karen believes that Bea believes she bought some chocolate cake because Bea was not present when Bruce informed her about the available options at the bazaar. Therefore, Karen assumes that Bea is still under the impression that chocolate cake is available.
	gpt: Bea believes that Karen bought some honey buns.
	gpt: No, Karen does not know that Bruce told Bea that there isn't any chocolate cake at the bazaar.
	gpt: Karen bought some cinnamon cookies from the bazaar.
	gpt: Bruce believes that honey buns are being sold at the bazaar.
82:
	gpt: Ece Selin'in ballı çörek satın aldığına inanıyor.
	gpt: Selin, Ece'nin kermeste sadece ballı çörek olduğunu öğrendiğini ve Selin'in kermese çikolatalı kek almak için gittiğini söylediğini biliyor. Bu nedenle Selin, Ece'nin bu bilgilere dayanarak Selin'in ballı çörek satın aldığına inandığını düşünüyor.
	gpt: Ece, Selin'in kermesten çikolatalı kek satın aldığına i

In [20]:
len(output)

7

In [176]:
df = pd.concat(output, axis="index", ignore_index=True).reset_index(drop=True)

In [177]:
df.to_csv("../output/extra.csv")