# Implementation of the MMLU objective assessment method
## 0-Shot (0 example - 1 question)

- Based on the code: https://github.com/hendrycks/test
- Database: https://people.eecs.berkeley.edu/~hendrycks/data.tar
- Paper: Measuring Massive Multitask Language Understanding (https://arxiv.org/abs/2009.03300)

In [1]:
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
import os
import pandas as pd
from dotenv import load_dotenv
import time
from tqdm import tqdm

from IPython.display import display, Markdown
from reuse import gen_prompt, format_example

## Setup

In [2]:
load_dotenv('../../.env')
pd.set_option('display.max_colwidth', None)

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

## Reading the Q&A database for assessment

In [3]:
df = pd.read_csv('../../data/mmlu/anatomy_test.csv', header=None)
print(df.shape)
df.head(5)

(135, 6)


Unnamed: 0,0,1,2,3,4,5
0,A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral,paralysis of the facial muscles.,paralysis of the facial muscles and loss of taste.,"paralysis of the facial muscles, loss of taste and lacrimation.","paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation.",A
1,"A ""dished face"" profile is often associated with",a protruding mandible due to reactivation of the condylar cartilage by acromegaly.,a recessive maxilla due to failure of elongation of the cranial base.,an enlarged frontal bone due to hydrocephaly.,defective development of the maxillary air sinus.,B
2,Which of the following best describes the structure that collects urine in the body?,Bladder,Kidney,Ureter,Urethra,A
3,Which of the following structures is derived from ectomesenchyme?,Motor neurons,Skeletal muscles,Melanocytes,Sweat glands,C
4,Which of the following describes the cluster of blood capillaries found in each nephron in the kidney?,Afferent arteriole,Glomerulus,Loop of Henle,Renal pelvis,B


## Testing...

In [4]:
prompt_end = format_example(df, 0, include_answer=False)
train_prompt = gen_prompt(df, 'anatomy', 0)
prompt = train_prompt + prompt_end

display(Markdown(prompt))

The following are multiple choice questions (with answers) about  anatomy.

Follow the answer instructions strictly, and answer only with the letter corresponding to the correct answer: A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral
A. paralysis of the facial muscles.
B. paralysis of the facial muscles and loss of taste.
C. paralysis of the facial muscles, loss of taste and lacrimation.
D. paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation.
 - Answer:

In [5]:
llm = ChatOpenAI(model="gpt-4o-mini",
    temperature=0,
    max_tokens=1,
    max_retries=2,
)

messages = [
    HumanMessage(content=prompt)
]
response = llm.invoke(messages)
print(response.content)

A


## Evaluating...

In [6]:
answers = []
for q in tqdm(range(35, 55)):
    prompt_end = format_example(df, q, include_answer=False)
    train_prompt = gen_prompt(df, 'anatomy', 0)
    prompt = train_prompt + prompt_end

    messages = [
        HumanMessage(content=prompt)
    ]
    response = llm.invoke(messages)
    answers.append(response.content)
    time.sleep(1)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:31<00:00,  1.59s/it]


In [7]:
answers[0:10]

['C', 'D', 'C', 'D', 'C', 'C', 'D', 'B', 'D', 'A']

In [8]:
last_col = df.columns[-1]
real_answers = df.loc[35:55, last_col].tolist()
real_answers[0:10]

['C', 'D', 'C', 'D', 'C', 'C', 'C', 'B', 'C', 'A']

In [9]:
cont = sum(1 for x, y in zip(answers, real_answers) if x == y)
print(f"Grade: {cont} in 20")

Grade: 18 in 20
