# Results - ENEM dataset

Exam codes:

## CH - Humanities
## CN - Natural Sciences
## MT - Math


In [4]:
# jupyter nbconvert --no-input --to pdf enem-llm-track-experiments.ipynb             
palette = ["red", "purple", "blue", "green"]

In [5]:
from read_functions import read_human_data
from read_functions import read_llm_data

dic_human_scores, dic_human_itens, dic_average_human_thetas_df = read_human_data()
dic_scores, dic_itens, dic_logs, dic_test_responses, dic_average_theta_by_ctt_score, dic_average_theta_by_ctt_random_score = read_llm_data()

Loading... CH 2020 mistral simple-zero-shot
Loading... CH 2021 mistral simple-zero-shot
Loading... CH 2022 mistral simple-zero-shot
Loading... MT 2020 mistral simple-zero-shot
Loading... MT 2021 mistral simple-zero-shot
Loading... MT 2022 mistral simple-zero-shot
Loading... CN 2020 mistral simple-zero-shot
Loading... CN 2021 mistral simple-zero-shot
Loading... CN 2022 mistral simple-zero-shot
Loading... CH 2020 llama2 simple-zero-shot
Loading... CH 2021 llama2 simple-zero-shot
Loading... CH 2022 llama2 simple-zero-shot


Since each exam is comprised of questions containing 5 options, if we shuffle all possible orders there are 5! = 120 different shuffles.
Due to resource contraints we may not have run all shuffles for all exams; the table below indicates how many shuffles we have run.


In [6]:
import pandas as pd

llms = []
exams = []
years = []
shuffles = []
unique_exams = []

for llm in ['mistral', 'llama2']:
    for exam in dic_logs[llm].keys():
        for year in dic_logs[llm][exam].keys():
            rows = dic_test_responses[llm][exam][year].shape[0]
            
            llms.append(llm)
            exams.append(exam)
            years.append(year)
            shuffles.append(rows)
            unique_exams.append(rows ** 45)
            
dfs = pd.DataFrame(data={'llm': llms, 
                         'exam': exams,
                         'year': years,
                         'shuffles': shuffles,
                         'unique_exams': unique_exams
                        })
dfs.sort_values(by='shuffles', ascending=False)


Unnamed: 0,llm,exam,year,shuffles,unique_exams
8,mistral,CN,2022,27,2578513367151428139611614894790917832183824875...
2,mistral,CH,2022,24,1286784666124443641354264752337365516453433086...
5,mistral,MT,2022,24,1286784666124443641354264752337365516453433086...
0,mistral,CH,2020,10,1000000000000000000000000000000000000000000000
1,mistral,CH,2021,10,1000000000000000000000000000000000000000000000
6,mistral,CN,2020,3,2954312706550833698643
7,mistral,CN,2021,3,2954312706550833698643
3,mistral,MT,2020,1,1
4,mistral,MT,2021,1,1
9,llama2,CH,2020,1,1


LLMs not always respond in a way we can parse the answer.
Let's keep track of the percentage of correct parsings.

In [7]:
llms = []
exams = []
years = []
valid_answers = []

for llm in ['mistral', 'llama2']:
    for exam in dic_logs[llm].keys():
        for year in dic_logs[llm][exam].keys():

            if 'invalid_answer' in dic_logs[llm][exam][year].columns:
                invalid_answers = dic_logs[llm][exam][year]['invalid_answer'].tolist()
                invalid_count = invalid_answers.count(True)
                valid_count = invalid_answers.count(False)

                llms.append(llm)
                exams.append(exam)
                years.append(year)
                valid_answers.append(valid_count/(invalid_count + valid_count))
 
dfs = pd.DataFrame(data={'llm': llms, 
                         'exam': exams,
                         'year': years,
                         'valid_answers': valid_answers,
                        })
dfs.sort_values(by='valid_answers', ascending=False)

Unnamed: 0,llm,exam,year,valid_answers
3,mistral,CN,2022,0.949794
2,mistral,MT,2022,0.892593
0,mistral,MT,2020,0.888889
1,mistral,MT,2021,0.844444
