# <center> Chain-of-Thought Prompting + Active Learning
### <center> Model Generations


This notebook is provided with our EAAI '24 submission *A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science*.

This is the first of two notebooks accompanying our submission and can be used to prompt GPT-4 for scoring and explaing students' formative assessment question responses. This notebook saves the model generations, along with the human scores, in a separate ```.csv``` file. In the second notebook, *analysis.ipynb*, we parse the generations to obtain the model scores and CoT reasoning chains, analyze the results, and save the results for further reference.

This notebook can be used to elicit model generations during the Active Learning process (i.e., the validation set) or during the Evaluation process (i.e., the test set). This notebook assumes train/test splits were created prior to using.

## Define Constants

Each of these constants needs to be defined by the user prior to running the script.

In [None]:
'''
Path of dataset to import. If performing Active Learning, this should be the
training set, where we will remove few-shot examples to create the validation set.
If performing Evaluation, this should be the test set.

This script is set up for the Active Learning component, so the dataset referenced
will be a sample of the training set.
'''
DATA_PATH = f"drive/My Drive/.../FOLDER/Q3_ANON_SAMPLE.csv"

'''
UUIDs of few-shot instances. These will be removed from the training set to create
the validation set.
'''
FEW_SHOT_IDS = ["19414","19109","19116","23205","23145"]

'''
Number of few-shot instances included in the prompt. There is no need to define
this variable, as it will simply be the number of UUIDs in the FEW_SHOT_IDS list.
'''
N_FEW_SHOT = len(FEW_SHOT_IDS)

'''
Path of prompt. This should be a .txt file that contains the prompt with the
few-shot CoT instances.
'''
PROMPT_PATH = f"drive/My Drive/.../FOLDER/DEMO_PROMPT.txt"

'''
Delimiter for separating messages in the prompt. We used "&&&".
'''
LINE_DELIMITER = "&&&"

'''
Delimiter for separting the "role" from the "content" of each message. In our
case, we used ": ".
'''
ROLE_DELIMITER = ": "

'''
Number of messages from the prompt to send each API call. This does not need
to be changed and is equal to the number of few-shot instances (2 per message,
i.e., one for the user and one for the assistant), plus the system message that
contains the task instruction and rubric.
'''
N_MESSAGES = N_FEW_SHOT*2 +1

'''
Column names in the imported .csv file.

The 0th index must be the UUID.
The 1st index must be the student's formative assessment response.
Indexes [2:2+S] must be the subscores, where S is the number of subscore columns.
The last index must be the total score.
'''
IN_COLS = [
    "student","Q3","arrow_direction","arrow_direction_reasoning",\
    "arrow_size", "arrow_size_reasoning", "total_score"
]

'''
Your personal OpenAI API key. One can be obtained from OpenAI's website:

https://platform.openai.com/
'''
OPEN_AI_API_KEY = ""
assert OPEN_AI_API_KEY

'''
OpenAI Model.

IMPORTANT: UNLESS YOUR ORGANIZATION HAS BEEN GIVEN PERMISSION BY OPENAI TO
USE THE GPT-4 API, THE SCRIPT WILL CRASH IF THIS IS NOT CHANGED.

If you have not yet been granted GPT-4 authorization, you can use gpt-3.5-turbo
'''
MODEL = "gpt-4"

'''
Path to output the generations and each one's corresponding information.
'''
GENERATIONS_PATH = f"drive/My Drive/.../FOLDER/DEMO_GENERATIONS.csv"

## Import Data

This notebook was designed for use in Google Colab, so we must mount Google Drive to access any outside data or files.

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


Import the data.

In [None]:
import pandas as pd

# Used to expand column width in DataFrame for fuller viewing
pd.set_option('max_colwidth', 400)
pd.set_option('display.max_rows', 100)

df_train_all = pd.read_csv(DATA_PATH)

assert not df_train_all.isnull().values.any()

df_train_all.head(5)

Unnamed: 0,student,Q3,arrow_direction,arrow_direction_reasoning,arrow_size,arrow_size_reasoning,total_score
0,23167,"I would add labels so you know what is going where, and I would add more arrows explaining direction towards the stream.",0,0,0,0,0
1,19421,the arrows and directions,0,0,0,0,0
2,19116,I would put an arrow going towards the \n Stream because some water goes there.,0,0,0,0,0
3,23225,lables,0,0,0,0,0
4,23131,One thing I would change about Taylor's model is her arrows direction. For runoff it is pointing left upwards when it should be pointing right downward and flowing downhill like in real life. I would also change the size of the arrows because the rainfall with be the biggest arrow with a lot of water than the absorption arrow which looks bigger than everything else.,1,1,1,1,4


## Identify Few-Shot Instances

If conducting Active Learning, the few-shot instances need to be identified by their UUID so they can be removed from the training set to create the validation set.

In [None]:
df_few_shot = df_train_all[df_train_all[IN_COLS[0]].isin(FEW_SHOT_IDS)]
df_few_shot.head(10)

Unnamed: 0,student,Q3,arrow_direction,arrow_direction_reasoning,arrow_size,arrow_size_reasoning,total_score
2,19116,I would put an arrow going towards the \n Stream because some water goes there.,0,0,0,0,0
9,23145,I think that Taylor needs to make the rainfall the biggest arrow because the absorption and runoff is derived from the rainfall. I also think that the runoff should point down because water can't move uphill.,1,1,1,1,4


Create the validation set by removing the few-shot instances in the prompt from the training set.

In [None]:
df = df_train_all[~df_train_all[IN_COLS[0]].isin(df_few_shot[IN_COLS[0]])]

# Ensure no few-shot IDs are in the validation set
assert len(df) == len(df_train_all) - len(df_few_shot)
fsl_students = set(df_few_shot[IN_COLS[0]])
val_students = set(df[IN_COLS[0]])
for student in fsl_students:
  assert student not in val_students

df.reset_index(inplace=True, drop=True)
df.head(10)

Unnamed: 0,student,Q3,arrow_direction,arrow_direction_reasoning,arrow_size,arrow_size_reasoning,total_score
0,23167,"I would add labels so you know what is going where, and I would add more arrows explaining direction towards the stream.",0,0,0,0,0
1,19421,the arrows and directions,0,0,0,0,0
2,23225,lables,0,0,0,0,0
3,23131,One thing I would change about Taylor's model is her arrows direction. For runoff it is pointing left upwards when it should be pointing right downward and flowing downhill like in real life. I would also change the size of the arrows because the rainfall with be the biggest arrow with a lot of water than the absorption arrow which looks bigger than everything else.,1,1,1,1,4
4,23184,"First, I would change the arrow pointing towards the school, it should point to the stream because if the water was moving towards the school, then the school should be on the lower area of where the stream is. Secondly, I would change the size of the arrows to represent how much water is there.",1,1,1,0,3
5,19218,I would change the absorbtion to arrows down the Streams and make the rainfall arrow bigger,0,0,1,0,1
6,23182,changing the direction of the arrows,0,0,0,0,0
7,19121,"One thing I would change is, I would draw pictures, specifically Of whats happening with direction. I would also explain how each thing a detail is effection the school",0,0,0,0,0
8,23194,I would change the arrow going uphill to going downhill to show runoff and I would change the big arrow showing absorption to a small arrow because if theres a lot of absorption there wouldn't be any runoff.,1,0,1,0,2
9,19319,that runnoff doesnt go uphill and it doesnt absorb that mutch water,1,0,1,0,2


## Import and Set Up Prompt

In [None]:
# We use our Q3 prompt as the example for this notebook

with open(PROMPT_PATH,"r") as f:
    prompt = f.read()

print(prompt)

system: You are a teacher whose job it is to score middle school student short answer formative assessment question responses in the Earth Science domain.

Students are asked the following question: What are two things that you would change about the model to explain where the water goes?

Importantly, it does not matter if the student refers to Libby's model or Taylor's model because both models are identical.

You are to score the responses based on the following rubric:
Direction of Runoff Arrow [0 or 1]: 1 point if the student mentions the water runoff arrow should be pointing in the opposite direction (it currently points uphill, but it should point downhill). 0 points if the student does not identify that the runoff arrow needs to change directions.
Reasoning About Direction of Runoff Arrow [0 or 1]: 1 point if the student received a point for Direction of Runoff Arrow AND provides an explanation or justification for why the runoff arrow should point in the opposite direction tha

The ```save_prompt``` variable strips off the last two messages in the prompt, which serve as placeholders to let the user know where to put the inference instance.

In [None]:
save_prompt = "".join(prompt.split(LINE_DELIMITER)[:-2])

print(save_prompt)

system: You are a teacher whose job it is to score middle school student short answer formative assessment question responses in the Earth Science domain.

Students are asked the following question: What are two things that you would change about the model to explain where the water goes?

Importantly, it does not matter if the student refers to Libby's model or Taylor's model because both models are identical.

You are to score the responses based on the following rubric:
Direction of Runoff Arrow [0 or 1]: 1 point if the student mentions the water runoff arrow should be pointing in the opposite direction (it currently points uphill, but it should point downhill). 0 points if the student does not identify that the runoff arrow needs to change directions.
Reasoning About Direction of Runoff Arrow [0 or 1]: 1 point if the student received a point for Direction of Runoff Arrow AND provides an explanation or justification for why the runoff arrow should point in the opposite direction tha

## Convert Prompt to Messages

We now need to split the prompt into individual messages for the OpenAI API. This cell will create the template to add the test instances to during inference.

In [None]:
# Store messages
messages_template = []

# Don't take the last two lines, which we will input manually in the initial prompt.
for item in prompt.split(LINE_DELIMITER)[:-2]:

  # Separate into
  splt = item.split(ROLE_DELIMITER)

  # Our system message has two ": " ROLE_DELIMITER cases, so we have to take
  #   ": ".join(splt[1:] instead of just splt[1]
  role, content = splt[0].strip(), (": ".join(splt[1:]).strip())

  # Add message to template
  messages_template.append({"role": role, "content": content})

# GPT-4 only allows for 3 roles: system, user, or assistant
valid_roles = {"system", "user", "assistant"}

# Confirm roles are correct in messages so script doesn't crash during inference
for mes in messages_template:
  assert mes["role"] in valid_roles
  assert mes["content"]
  assert len(mes) == 2

# Ensure correct number of messages in the prompt
assert len(messages_template) == N_MESSAGES
messages_template

[{'role': 'system',
  'content': "You are a teacher whose job it is to score middle school student short answer formative assessment question responses in the Earth Science domain.\n\nStudents are asked the following question: What are two things that you would change about the model to explain where the water goes?\n\nImportantly, it does not matter if the student refers to Libby's model or Taylor's model because both models are identical.\n\nYou are to score the responses based on the following rubric:\nDirection of Runoff Arrow [0 or 1]: 1 point if the student mentions the water runoff arrow should be pointing in the opposite direction (it currently points uphill, but it should point downhill). 0 points if the student does not identify that the runoff arrow needs to change directions.\nReasoning About Direction of Runoff Arrow [0 or 1]: 1 point if the student received a point for Direction of Runoff Arrow AND provides an explanation or justification for why the runoff arrow should p

## Set up Instances

Process the FA questions to be fed into the LM.

In [None]:
# Create instance triples:
#   (UUIDs, formative assessment responses, human labels)

# DataFrame indexes of few-shot instances contained in prompt need to be skipped
#   so they don't contaminate our validation or test sets

instances = []

# Build instances list
for i, row in df.iterrows():

  # Create instance and add to instance list
  inst = {}
  for col in IN_COLS:
    inst[col] = row[col]

  # Append instance to instances array
  instances.append(inst)

# Confirm the mapping was performed correctly
assert len(instances) == len(df)
for col in IN_COLS:
  assert list(df[col]) == [inst[col] for inst in instances]

print(len(instances))
print(instances[0])

51
{'student': '23167', 'Q3': 'I would add labels so you know what is going where, and I would add more arrows explaining direction towards the stream.', 'arrow_direction': 0, 'arrow_direction_reasoning': 0, 'arrow_size': 0, 'arrow_size_reasoning': 0, 'total_score': 0}


## Set Up Environment

```pip install``` OpenAI, then import it and set up the environment.

At the time of this research, version ```0.27.0``` was stable and without issue, which is why we used it throughout the duration of our work.

In [None]:
!pip3 install openai==0.27.0 --quiet

In [None]:
# Import the OpenAI package downloaded in the cell above
import openai

# Import textwrap to print without having to scroll left or right on screen
import textwrap

# Define OpenAI key
openai.api_key = OPEN_AI_API_KEY

# This statement tests the API call to make sure the API is working properly
assert openai.Model.list()["data"]

## Get Generations

Get the generations for each instance.



In [None]:
# Used to sleep the script in between instances.
#   This may be necessary with GPT-4, as the max tokens per minute is 10,000
import time

# Store the results for later use
results = []

# TEST WITH ONLY A FEW INSTANCES FIRST
for i in range(3):

# KEEP THIS COMMENTED UNTIL YOU'VE TESTED YOUR CALLS WITH A FEW INSTANCES
# for i in range(len(instances)):

  # Store information for each instance
  instance_i = {}
  for col in IN_COLS:
    instance_i[col] = instances[i][col]


  # Append the student response to the messages
  messages = messages_template + [{"role": "user","content": instances[i][IN_COLS[1]]}]

  # Get model response
  response = openai.ChatCompletion.create(
    model=MODEL,
    messages=messages,
    temperature = 0.0,
  )

  # THIS IS THE MODEL'S RESPONSE
  generation = response["choices"][0]["message"]["content"].strip()

  # API call token length information. This is used to track token limits
  #   and for calculating how long to sleep the script in between instances.
  total_tokens = response["usage"]["total_tokens"]

  # Add response generation to list
  instance_i["prompt"] = save_prompt
  instance_i["n_few_shot"] = N_FEW_SHOT
  instance_i["model_generation"] = generation
  instance_i["total_tokens"] = total_tokens
  results.append(instance_i)

  # Track progress
  if not i%5:
    print(f"Finished instance {i}.")

  # Sleep in between calls if needed 5 or 10 seconds, depending on prompt length
  time.sleep(5)

print("FINISHED ALL INSTANCES.")

Finished instance 0.
Finished instance 5.
Finished instance 10.
Finished instance 15.
Finished instance 20.
Finished instance 25.
Finished instance 30.
Finished instance 35.
Finished instance 40.
Finished instance 45.
Finished instance 50.
FINISHED ALL INSTANCES.


In [None]:
# # Used for testing and debugging to view the model generations!

# for res in results:
#   # Student response
#   print(res[IN_COLS[1]])

#   # Model generation
#   print(res["model_generation"])

#   # Total number of tokens (prompt + generation)
#   print(res["total_tokens"])

#   print()

## Save Generations

Put generations into DataFrame and save them with the prompt and other information for each instance.

In the second notebook (```analysis.ipynb```), we will parse the model generations to obtain the scores, and we will run the analysis.

In [None]:
# Format results for DataFrame creation
results_formatted = [list(results[i].values()) for i in range(len(results))]

In [None]:
# Define DataFrame columns
cols = list(results[0].keys())

# Create DataFrame for generations
df_results = pd.DataFrame(results_formatted, columns = cols)
df_results.head(1)

Unnamed: 0,student,Q3,arrow_direction,arrow_direction_reasoning,arrow_size,arrow_size_reasoning,total_score,prompt,n_few_shot,model_generation,total_tokens
0,23167,"I would add labels so you know what is going where, and I would add more arrows explaining direction towards the stream.",0,0,0,0,0,"system: You are a teacher whose job it is to score middle school student short answer formative assessment question responses in the Earth Science domain.\n\nStudents are asked the following question: What are two things that you would change about the model to explain where the water goes?\n\nImportantly, it does not matter if the student refers to Libby's model or Taylor's model because both...",5,"Direction of Runoff Arrow: The student mentions ""I would add more arrows explaining direction towards the stream"". This suggests that the student understands the runoff arrow should be pointing in the opposite direction. However, if an arrow pointing towards the stream is added to the model, but the original arrow is not removed, the model will still be wrong because there will now be two runo...",2524


In [None]:
# Convert dtypes to string to prevent formatting errors
for col in df_results:
  df_results[col] = df_results[col].astype(str)
df_results.dtypes

student                      object
Q3                           object
arrow_direction              object
arrow_direction_reasoning    object
arrow_size                   object
arrow_size_reasoning         object
total_score                  object
prompt                       object
n_few_shot                   object
model_generation             object
total_tokens                 object
dtype: object

In [None]:
# Save generations
df_results.to_csv(path_or_buf=GENERATIONS_PATH, index=False)

In [None]:
# Import saved generations to make sure there were no formatting issues when saving
df_results_import = pd.read_csv(GENERATIONS_PATH)

for col in df_results_import:
  df_results_import[col] = df_results_import[col].astype(str)
assert not df.isnull().values.any()
assert df_results_import.equals(df_results)