# Evaluating the Models

Using the evaluation.csv file created in `./head2head_eval.py` we will evaluate the models and compare their performance. This will be done in the following stages:

1. Exploratory analysis of the evaluation results.
2. Verifying the results by manually inspecting the predictions.
3. Comparison of the models specifically using their evaluation metrics.
4. Visualizing the predictions to understand the models' performance.
5. Conclusion and final thoughts.

In [11]:
import pandas as pd

df = pd.read_csv('data/evaluation.csv', na_filter=False)
df.head(2)

Unnamed: 0,question,answer,player_a,player_b,whois_player_a,whois_player_b,prompt,best_response
0,What courses will I take in my first year as a...,"In your first year, you'll embark on an exciti...","Welcome to SUNY Brockport! In your first year,...",In your first year as a Physics Adolescence In...,Finetuned,RAG,Question: What courses will I take in my first...,
1,What are the rules for cannabis use on SUNY Br...,"At SUNY Brockport, the use or possession of ca...",We value your creativity! We offer a variety o...,"I'm sorry, I don't know.",Scratch,RAG,Question: What are the rules for cannabis use ...,


In [13]:
df['best_response'].value_counts()

best_response
None    1521
A       1334
B       1180
Name: count, dtype: int64

In [26]:
df_best = df[df['best_response'].isin(['A', 'B'])].reset_index(drop=True)
print(df_best.loc[0, 'prompt'])

# Pretty clearly in favor of B

Question: What kind of support can I expect from the Hazen Center for Integrated Care if I'm dealing with a stressful situation?
Ground Truth: At the Hazen Center for Integrated Care, you'll find a supportive and confidential environment where counselors and health care professionals are ready to listen and help. They offer a range of services to assist you through stressful times, and any information you share will be kept confidential, with the exception of situations involving harm to self or others. Don't hesitate to reach out at (585) 395-2414 for the support you need.
Player A: Our hazen center for integrated care is dedicated to supporting students with disabilities. They offer a range of support services, including [UNK] assistance with [UNK], [UNK] [UNK], [UNK] [UNK], and [UNK] [UNK] [UNK], [UNK] [UNK] [UNK], and [UNK] [UNK] [UNK]. Our dedicated staff will work with you to ensure your wellbeing is secure and effective care.
Player B: At the Hazen Center for Integrated Care, we

In [38]:
idx=1
print(df_best.loc[idx, 'prompt'])
print("\nGPT-4 picked Player", df_best.loc[idx, 'best_response'])

# Slightly in favor of A, just on enthuasiasm. But its less significant than the first example.

Question: Are there any prerequisites or requirements to participate in the Nursing study abroad program?
Ground Truth: To ensure the best possible experience, there may be certain prerequisites or requirements for the Nursing study abroad program. These could include academic standing, language proficiency, or specific coursework completion. We recommend reaching out to the Nursing department for the most up-to-date information on eligibility.
Player A: Yes, to be eligible for the nursing program, you must be a junior with a senior major with a minimum of c or higher in nursing major courses. This is a fantastic opportunity to gain a global perspective and experience in a nursing field!
Player B: To participate in the Nursing study abroad program, you need to have completed the prerequisites with a minimum grade of C and meet the criteria listed for prospective nursing students.

GPT-4 picked Player A


In [39]:
idx=2
print(df_best.loc[idx, 'prompt'])
print("\nGPT-4 picked Player", df_best.loc[idx, 'best_response'])

# I prefer A, player B telling me to carry a flashlight or headlamp everywhere is a bit much lol.

Question: What essential items should I bring to every class to ensure I'm prepared for an emergency?
Ground Truth: Being prepared is key! Always carry your class roster, a cell phone, and a list of important telephone numbers, such as department leadership or lab managers. These items will help you manage any situation more effectively.
Player A: You should bring a copy of the basic emergency procedures, a list of emergency contacts, and any necessary medical information to ensure you are prepared for an emergency in every class. It's always better to be prepared!
Player B: To prepare for an emergency, it's important to have essential items with you at all times. This includes a water bottle, a first aid kit, a flashlight or headlamp, a whistle, a personal ID card or other identification, a cell phone charger, and an extra battery or power bank. Remember, always carry these items with you in case of an unexpected situation!

GPT-4 picked Player A


In [41]:
idx=3
print(df_best.loc[idx, 'prompt'])
print("\nGPT-4 picked Player", df_best.loc[idx, 'best_response'])

# Player A nearly aced it, I suspect this is the scratch model because it repeated itself many times...

Question: How can I learn more about SUNY Brockport's community justice program or schedule a visit?
Ground Truth: We'd love to share more about our exciting community justice program with you! To get personalized information about this area of interest, simply request more info on our website. And if you're ready to experience SUNY Brockport firsthand, we offer tours throughout the year. February and March are particularly busy, so be sure to schedule your visit soon. We can't wait to welcome you to our campus and show you all that our program has to offer!
Player A: Wed love to share more about our community with you! You can request personalized information about your area of interest by visiting our website and filling out a request form. We offer tours throughout the year, with february and march being particularly popular. Schedule a visit to our campus firsthand, with february and march being particularly popular. We offer tours throughout the year, with february and march being

In [42]:
idx=4
print(df_best.loc[idx, 'prompt'])
print("\nGPT-4 picked Player", df_best.loc[idx, 'best_response'])

# Player A nearly aced it, I suspect this is the scratch model because it repeated itself many times...

Question: What should I do if I want to serve alcohol at my event?
Ground Truth: For events involving alcohol, please contact the Events Coordinator directly. We have specific guidelines to ensure your event is both enjoyable and compliant with our policies.
Player A: If you plan on serving alcoholic beverages at your event, you must have an Alcohol Service Permit (ASP). You can apply for this permit through the Office of Campus Activities by the Monday before the event. Make sure to follow all guidelines and rules related to alcohol service to ensure a smooth and compliant experience.
Player B: If you suspect that [UNK] is [UNK], its important to [UNK] the [UNK] to avoid [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]. [UNK] [UNK] [UNK] [UNK], [UNK] [UNK] [UNK], or [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [

# Look for GPT-4 bias

Specifically I am interested if GPT-4 had bias regarding the order of the players. I have seen in other papers doing similar strategies that GPT-4 prefers the first response (QLoRA paper specifically highlighted this), is this the case here?

In [43]:
a_responses = df[df['best_response'] == 'A']
a_responses['whois_player_a'].value_counts()

whois_player_a
RAG          648
Finetuned    636
Scratch       50
Name: count, dtype: int64

In [51]:
a_responses['whois_player_b'].value_counts()

whois_player_b
Scratch      870
RAG          247
Finetuned    217
Name: count, dtype: int64

In [44]:
b_responses = df[df['best_response'] == 'B']
b_responses['whois_player_b'].value_counts()

whois_player_b
RAG          609
Finetuned    515
Scratch       56
Name: count, dtype: int64

In [52]:
b_responses['whois_player_a'].value_counts()

whois_player_a
Scratch      747
Finetuned    218
RAG          215
Name: count, dtype: int64

It is pretty clear the scratch model is typically losing.

In [50]:
none_responses = df[df['best_response'] == 'None']
# print((none_responses['whois_player_a'] + " vs " + none_responses['whois_player_b']).value_counts())

# Combined them from above to see total counts of "None" responses
print(f"Scratch vs Finetuned    {337+213}")
print(f"RAG vs Scratch          {227+190}")
print(f"RAG vs Finetuned        {292+213}")

Scratch vs Finetuned    550
RAG vs Scratch          417
RAG vs Finetuned        505
