# MMLU - RAG inference on long form answers

We ask the LLM to perform long form inference to open ended answers from MMLU dataset.  
When then perform RAG: get the choice answer that is closest to LLM answer.  

## install libs

In [1]:
%%capture
!pip install transformers datasets
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install -U sentence-transformers

## load MMLU dataset

We load a custom MMLU dataset that contains all categories and all shuffle permutations of all 4 choices.  

In [2]:
from datasets import load_dataset, get_dataset_config_names
from tqdm.auto import tqdm

dataset_ds = load_dataset("the-french-artist/shuffled_mmlu", split='test')
dataset_ds

Dataset({
    features: ['question', 'choices', 'answer', 'question_id', 'category', 'letter_order', '__index_level_0__'],
    num_rows: 337008
})

## Load embedding model

In [3]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(sentences)
print(embeddings.shape)



(2, 384)


## Define inference function

### Load the model

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
model.generation_config.pad_token_id = tokenizer.pad_token_id

# set up temperature
model.generation_config.temperature = 0.001

FastLanguageModel.for_inference(model)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.999 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Define long form prompt

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

mmlu_prompt = """
Below is a question, paried with possible answer choices. Write a response that appropriately completes the question.

### Question:
{}

### Choices:
{}

### Anwser:
Given the choices A B C D , the answer is : {}"""

In [6]:
# format choices to have a list of A to D answers
def format_choice(choices):

  letters = ["A", "B", "C", "D"]

  lines = []
  for letter, choice in zip(letters, choices):
    lines.append(f"  {letter}. {choice}  ")

  return "\n".join(lines) + '\n'

example_choices = [55, 19, 2, 3]
print(format_choice(example_choices))

  A. 55  
  B. 19  
  C. 2  
  D. 3  



### Define inference function

In [7]:
def get_open_ended_answer_to_question(question, choices):

  prompt = mmlu_prompt.format(
          question, # instruction
          format_choice(choices), # input
          "", # output - leave this blank for generation!
      )

  inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
  return (tokenizer.batch_decode(outputs, skip_special_tokens=True)[0][len(prompt):].strip())

question = 'What is the capital city of France?'
choices = ["London", "Paris", "Berlin", "Rome"]
response = get_open_ended_answer_to_question(question, choices)
print(response)

B. Paris


## Perform LLM inference on a small category  

We choose `abstract_algebra` for a start, as we know the LLM has a poor performance on this one.  

### create a subset of the dataset  

We select `abstract_algebra`, on a `ABCD` shuffle for now.  

In [8]:
dataset_df = dataset_ds.to_pandas()

In [9]:
dataset_df.head()

Unnamed: 0,question,choices,answer,question_id,category,letter_order,__index_level_0__
0,Find the degree for the given field extension ...,"[0, 4, 2, 6]",1,bc3778ec85a3abdf375449e14780a1318d32e859c2a2c1...,abstract_algebra,ABCD,0
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...","[8, 2, 24, 120]",2,9dbee06135bb2cd4f1d6fc47c5b9698485a7758ce3ee76...,abstract_algebra,ABCD,1
2,Find all zeros in the indicated finite field o...,"[0, 1, 0,1, 0,4]",3,4cfb894cedaec3e7dee2ba71a6a781fbf1d0ded44bac22...,abstract_algebra,ABCD,2
3,Statement 1 | A factor group of a non-Abelian ...,"[True, True, False, False, True, False, False,...",1,7bdc038b56be4a1a507b6d156e061fc66c43098d756822...,abstract_algebra,ABCD,3
4,Find the product of the given polynomials in t...,"[2x^2 + 5, 6x^2 + 4x + 6, 0, x^2 + 1]",1,ff99adc312cd773b4959d6f9398f00342297a0d0379c65...,abstract_algebra,ABCD,4


In [10]:
abstract_algebra_df = dataset_df[(dataset_df.category == 'abstract_algebra') & (dataset_df.letter_order == 'ABCD')]
len(dataset_df), len(abstract_algebra_df)

(337008, 100)

We also remove 3 features that are not useful anymore:

In [11]:
del abstract_algebra_df ['__index_level_0__']
del abstract_algebra_df ['letter_order']
del abstract_algebra_df ['category']

In [12]:
abstract_algebra_df.head()

Unnamed: 0,question,choices,answer,question_id
0,Find the degree for the given field extension ...,"[0, 4, 2, 6]",1,bc3778ec85a3abdf375449e14780a1318d32e859c2a2c1...
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...","[8, 2, 24, 120]",2,9dbee06135bb2cd4f1d6fc47c5b9698485a7758ce3ee76...
2,Find all zeros in the indicated finite field o...,"[0, 1, 0,1, 0,4]",3,4cfb894cedaec3e7dee2ba71a6a781fbf1d0ded44bac22...
3,Statement 1 | A factor group of a non-Abelian ...,"[True, True, False, False, True, False, False,...",1,7bdc038b56be4a1a507b6d156e061fc66c43098d756822...
4,Find the product of the given polynomials in t...,"[2x^2 + 5, 6x^2 + 4x + 6, 0, x^2 + 1]",1,ff99adc312cd773b4959d6f9398f00342297a0d0379c65...


### perform inference on the subset

In [13]:
from tqdm.auto import tqdm
tqdm.pandas()

abstract_algebra_df['inferred_answer'] = abstract_algebra_df.progress_apply(lambda x: get_open_ended_answer_to_question(x.question, x.choices), axis=1)

  0%|          | 0/100 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  abstract_algebra_df['inferred_answer'] = abstract_algebra_df.progress_apply(lambda x: get_open_ended_answer_to_question(x.question, x.choices), axis=1)


In [14]:
abstract_algebra_df[['question', 'answer', 'inferred_answer']]

Unnamed: 0,question,answer,inferred_answer
0,Find the degree for the given field extension ...,1,
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",2,0
2,Find all zeros in the indicated finite field o...,3,",4\n\n### Explanation:\nThe polynomial is x^5 ..."
3,Statement 1 | A factor group of a non-Abelian ...,1,"A. True, True"
4,Find the product of the given polynomials in t...,1,x^2 + 4x + 6\n\n### Explanation:\nThe product ...
...,...,...,...
95,Statement 1 | If H is a subgroup of G and a be...,2,"A. True, True"
96,Find all zeros in the indicated finite field o...,1,",1"
97,Find the number of elements in the indicated c...,2,0
98,"The element (4, 2) of Z_12 x Z_8 has order",2,### Explanation:\nThe order of an element is t...


## Perform RAG between inferred answers and choices

### define RAG function
1. compute embeddings of the inferred answers  
2. compute embeddings of choices  
3. return the choice number with the highest similarity score

In [15]:
from sentence_transformers import util
import numpy as np

def get_closes_choice_rag(inferred_answer, choices):
  # format to A. B. etc... choices but as a list and without whitespaces
  rag_formatted_choices = [choice.strip() for choice in format_choice(choices).split('\n')]

  answer_embed = embedding_model.encode(inferred_answer, convert_to_tensor=True)
  choices_embeds = embedding_model.encode(rag_formatted_choices, convert_to_tensor=True)
  distances = util.cos_sim(answer_embed, choices_embeds)[0]
  return np.argmax(distances.cpu()).item()

In [16]:
inferred_answer = 'my cat is in the bedroom'

choices = [
    'cat in bathroom',
    'cat in bedroom',
    'dog in bedroom',
    'dog in bathroom',
]

get_closes_choice_rag(inferred_answer, choices)

1

### perform RAG on dataset

In [17]:
abstract_algebra_df['rag_answer'] = abstract_algebra_df.progress_apply(lambda x: get_closes_choice_rag(x.inferred_answer, x.choices), axis=1)
abstract_algebra_df.head()

  0%|          | 0/100 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  abstract_algebra_df['rag_answer'] = abstract_algebra_df.progress_apply(lambda x: get_closes_choice_rag(x.inferred_answer, x.choices), axis=1)


Unnamed: 0,question,choices,answer,question_id,inferred_answer,rag_answer
0,Find the degree for the given field extension ...,"[0, 4, 2, 6]",1,bc3778ec85a3abdf375449e14780a1318d32e859c2a2c1...,,4
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...","[8, 2, 24, 120]",2,9dbee06135bb2cd4f1d6fc47c5b9698485a7758ce3ee76...,0,4
2,Find all zeros in the indicated finite field o...,"[0, 1, 0,1, 0,4]",3,4cfb894cedaec3e7dee2ba71a6a781fbf1d0ded44bac22...,",4\n\n### Explanation:\nThe polynomial is x^5 ...",3
3,Statement 1 | A factor group of a non-Abelian ...,"[True, True, False, False, True, False, False,...",1,7bdc038b56be4a1a507b6d156e061fc66c43098d756822...,"A. True, True",0
4,Find the product of the given polynomials in t...,"[2x^2 + 5, 6x^2 + 4x + 6, 0, x^2 + 1]",1,ff99adc312cd773b4959d6f9398f00342297a0d0379c65...,x^2 + 4x + 6\n\n### Explanation:\nThe product ...,1


In [18]:
abstract_algebra_df['is_correct'] = abstract_algebra_df['answer'] == abstract_algebra_df['rag_answer']
abstract_algebra_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  abstract_algebra_df['is_correct'] = abstract_algebra_df['answer'] == abstract_algebra_df['rag_answer']


Unnamed: 0,question,choices,answer,question_id,inferred_answer,rag_answer,is_correct
0,Find the degree for the given field extension ...,"[0, 4, 2, 6]",1,bc3778ec85a3abdf375449e14780a1318d32e859c2a2c1...,,4,False
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...","[8, 2, 24, 120]",2,9dbee06135bb2cd4f1d6fc47c5b9698485a7758ce3ee76...,0,4,False
2,Find all zeros in the indicated finite field o...,"[0, 1, 0,1, 0,4]",3,4cfb894cedaec3e7dee2ba71a6a781fbf1d0ded44bac22...,",4\n\n### Explanation:\nThe polynomial is x^5 ...",3,True
3,Statement 1 | A factor group of a non-Abelian ...,"[True, True, False, False, True, False, False,...",1,7bdc038b56be4a1a507b6d156e061fc66c43098d756822...,"A. True, True",0,False
4,Find the product of the given polynomials in t...,"[2x^2 + 5, 6x^2 + 4x + 6, 0, x^2 + 1]",1,ff99adc312cd773b4959d6f9398f00342297a0d0379c65...,x^2 + 4x + 6\n\n### Explanation:\nThe product ...,1,True


## Analyze results  

We have results that are exactly equal to random baseline...

In [19]:
# display accuracy
print(f"Mean Accuracy : {abstract_algebra_df['is_correct'].mean()*100:.02f}%")

Mean Accuracy : 25.00%


## Perform inference on complete MMLU - single shuffle

In [22]:
single_shuffle = dataset_df[dataset_df.letter_order == 'ABCD']
del single_shuffle ['__index_level_0__']
del single_shuffle ['letter_order']
del single_shuffle ['category']

In [24]:
len(single_shuffle)

14042

In [25]:
single_shuffle['inferred_answer'] = single_shuffle.progress_apply(lambda x: get_open_ended_answer_to_question(x.question, x.choices), axis=1)

  0%|          | 0/14042 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  single_shuffle['inferred_answer'] = single_shuffle.progress_apply(lambda x: get_open_ended_answer_to_question(x.question, x.choices), axis=1)


In [27]:
single_shuffle.to_parquet('single_shuffle_inference_dataset.parquet')

In [28]:
single_shuffle['rag_answer'] = single_shuffle.progress_apply(lambda x: get_closes_choice_rag(x.inferred_answer, x.choices), axis=1)
single_shuffle.head()

  0%|          | 0/14042 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  single_shuffle['rag_answer'] = single_shuffle.progress_apply(lambda x: get_closes_choice_rag(x.inferred_answer, x.choices), axis=1)


Unnamed: 0,question,choices,answer,question_id,inferred_answer,rag_answer
0,Find the degree for the given field extension ...,"[0, 4, 2, 6]",1,bc3778ec85a3abdf375449e14780a1318d32e859c2a2c1...,,4
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...","[8, 2, 24, 120]",2,9dbee06135bb2cd4f1d6fc47c5b9698485a7758ce3ee76...,0,4
2,Find all zeros in the indicated finite field o...,"[0, 1, 0,1, 0,4]",3,4cfb894cedaec3e7dee2ba71a6a781fbf1d0ded44bac22...,",4\n\n### Explanation:\nThe polynomial is x^5 ...",3
3,Statement 1 | A factor group of a non-Abelian ...,"[True, True, False, False, True, False, False,...",1,7bdc038b56be4a1a507b6d156e061fc66c43098d756822...,"A. True, True",0
4,Find the product of the given polynomials in t...,"[2x^2 + 5, 6x^2 + 4x + 6, 0, x^2 + 1]",1,ff99adc312cd773b4959d6f9398f00342297a0d0379c65...,x^2 + 4x + 6\n\n### Explanation:\nThe product ...,1


### Analyze results

In [29]:
single_shuffle['is_correct'] = single_shuffle['answer'] == single_shuffle['rag_answer']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  single_shuffle['is_correct'] = single_shuffle['answer'] == single_shuffle['rag_answer']


In [31]:
print(f"Mean Accuracy : {single_shuffle['is_correct'].mean()*100:.02f}%")

Mean Accuracy : 55.25%


In [37]:
# we restore that category feature we shouldn't have deleted...
import pandas as pd
restore_category = dataset_df[dataset_df.letter_order == 'ABCD'][['question_id', 'category']]
single_shuffle_with_cat = pd.merge(how='left', left=single_shuffle, right=restore_category, on='question_id')
single_shuffle_with_cat.head()

Unnamed: 0,question,choices,answer,question_id,inferred_answer,rag_answer,is_correct,category
0,Find the degree for the given field extension ...,"[0, 4, 2, 6]",1,bc3778ec85a3abdf375449e14780a1318d32e859c2a2c1...,,4,False,abstract_algebra
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...","[8, 2, 24, 120]",2,9dbee06135bb2cd4f1d6fc47c5b9698485a7758ce3ee76...,0,4,False,abstract_algebra
2,Find all zeros in the indicated finite field o...,"[0, 1, 0,1, 0,4]",3,4cfb894cedaec3e7dee2ba71a6a781fbf1d0ded44bac22...,",4\n\n### Explanation:\nThe polynomial is x^5 ...",3,True,abstract_algebra
3,Statement 1 | A factor group of a non-Abelian ...,"[True, True, False, False, True, False, False,...",1,7bdc038b56be4a1a507b6d156e061fc66c43098d756822...,"A. True, True",0,False,abstract_algebra
4,Find the product of the given polynomials in t...,"[2x^2 + 5, 6x^2 + 4x + 6, 0, x^2 + 1]",1,ff99adc312cd773b4959d6f9398f00342297a0d0379c65...,x^2 + 4x + 6\n\n### Explanation:\nThe product ...,1,True,abstract_algebra


In [43]:
single_shuffle_with_cat.groupby(['category'])['is_correct'].mean().to_frame().reset_index().sort_values('is_correct')

Unnamed: 0,category,is_correct
25,high_school_mathematics,0.174074
17,global_facts,0.23
43,moral_scenarios,0.246927
0,abstract_algebra,0.27
10,college_physics,0.284314
8,college_mathematics,0.3
15,elementary_mathematics,0.306878
27,high_school_physics,0.350993
13,econometrics,0.368421
16,formal_logic,0.404762
