# Part 2 - Perform inference on the complete shuffled MMLU dataset  

This dataset is 24 times bigger than the original, due to the 24 permutations per question.  
To speedup inference, we use an A100 GPU in google colab.  
The dataset is retrieved from HuggingFace Hub, then batched inference is performed on 10 samples at a time.  

Finally, the result is uploaded to the HUB for further analysis.  

## install libs

In [None]:
%%capture
!pip install transformers datasets
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

## load shuffled MMLU dataset

In [None]:
from datasets import load_dataset, get_dataset_config_names
from tqdm.auto import tqdm

dataset_ds = load_dataset("the-french-artist/shuffled_mmlu", split='test')
dataset_ds

Dataset({
    features: ['question', 'choices', 'answer', 'category', 'letter_order', '__index_level_0__'],
    num_rows: 337008
})

In [None]:
dataset_df = dataset_ds.to_pandas()
dataset_df.head()

Unnamed: 0,question,choices,answer,category,letter_order,__index_level_0__
0,Find the degree for the given field extension ...,"[0, 4, 2, 6]",1,abstract_algebra,ABCD,0
1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...","[8, 2, 24, 120]",2,abstract_algebra,ABCD,1
2,Find all zeros in the indicated finite field o...,"[0, 1, 0,1, 0,4]",3,abstract_algebra,ABCD,2
3,Statement 1 | A factor group of a non-Abelian ...,"[True, True, False, False, True, False, False,...",1,abstract_algebra,ABCD,3
4,Find the product of the given polynomials in t...,"[2x^2 + 5, 6x^2 + 4x + 6, 0, x^2 + 1]",1,abstract_algebra,ABCD,4


## Load model

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## define inference functions

In [None]:
def format_choice(example_choices):

  letters = ["A", "B", "C", "D"]

  lines = []
  for letter, choice in zip(letters, example_choices):
    lines.append(f"  {letter}) {choice}  ")

  return "\n".join(lines) + '\n'

In [None]:
from itertools import permutations
import random

mmlu_prompt = """
Answer the following multiple choice question.
The last line of your response should be of the following format: 'The answer letter is : $LETTER' (without quotes) where LETTER is one of A B C D.
Think step by step before answering.

### Question:
{}

### Choices:
{}

### Anwser:
Given the choices A B C D , the answer is : {}"""

def get_number_result_from_question_batch_mode(rows):

    prompts = []
    for question, choices in zip(rows['question'], rows['choices']):
        prompt = mmlu_prompt.format(
            question,
            format_choice(choices),
            ""  # output - leave blank for model answer
        )
        prompts.append(prompt)

    answer_tokens = tokenizer.encode(" A B C D", add_special_tokens=False, return_tensors="pt")
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

    logits_list = []
    with torch.no_grad():
        logits = model(inputs.input_ids, attention_mask=inputs.attention_mask).logits
        torch.cuda.empty_cache()
        for i in range(len(prompts)):
            logits_ans = logits[i, -1, answer_tokens].cpu()
            logits_list.append(logits_ans)

    rows['inferred_answer'] = []
    for logits_ans in logits_list:
        prob_ans = torch.softmax(logits_ans, dim=-1)
        inferred_answer = prob_ans.argmax(dim=-1)[0]
        rows['inferred_answer'].append(inferred_answer)

    return rows

## Perform complete inference

In [None]:
dataset_ds = dataset_ds.map(get_number_result_from_question_batch_mode, batched=True, batch_size=10, num_proc=1)



Map:   0%|          | 0/337008 [00:00<?, ? examples/s]

In [None]:
analysis_df = dataset_ds.to_pandas()

## upload dataset to hub

In [None]:
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import login

login(hf_token, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from datasets import load_dataset

# it seems that we still need to pass the token if we have previously used a deactivated token,
# even if we use the login() function above
dataset_ds.push_to_hub("the-french-artist/shuffled_mmlu_no_splits_unsloth_llama-3-8b-bnb-4bit", token=hf_token)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/338 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/the-french-artist/shuffled_mmlu_no_splits_unsloth_llama-3-8b-bnb-4bit/commit/cdd9833f452f936113fccc3db3266bc94fd02728', commit_message='Upload dataset', commit_description='', oid='cdd9833f452f936113fccc3db3266bc94fd02728', pr_url=None, pr_revision=None, pr_num=None)

## conclusion  

Now that the expensive part has been done, we can analyze the results using a GPU-free session.  
This complete inference took 3h on an A100.  

We had to use 10 samples per batch because some samples were way bigger than others (>4000 tokens). We could have excluded those to get larger batches and rerun afterwards, but the speedups were observed to be minimal (batches of 10 samples: 60 samples/s, batches of 100 samples: 70 samples/s) compared to the hassle of cutting the dataset into pieces.    