<a href="https://colab.research.google.com/github/kslevi/kslevi/blob/main/Open_LLM_Benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open-LLM-Benchmark Assignment

This code block installs the datasets library, which is an essential Python package for accessing and processing a wide variety of datasets. Developed by Hugging Face, this library simplifies the process of loading, processing, and evaluating benchmark datasets commonly used in natural language processing (NLP) and machine learning tasks.

In this assignment, the datasets library will be used to load and preprocess datasets from the Open-LLM-Benchmark, which provides a standardized framework for evaluating large language models on various NLP tasks. By running this code, you ensure that all required tools for handling datasets are readily available in your Colab environment.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

This code block initiates the login process for Hugging Face using the huggingface-cli login command. Hugging Face is a platform offering tools and resources for machine learning, including pre-trained models, datasets, and APIs.

By running this command, you will be prompted to enter your Hugging Face token, which you can generate from your Hugging Face account. Logging in allows you to access private models, datasets, or APIs that require authentication. It also facilitates seamless integration with the Hugging Face ecosystem when working on tasks like model inference, dataset access, or fine-tuning.

In the context of this assignment, logging in ensures that you can fetch resources from the Hugging Face hub required for evaluating language models on the Open-LLM-Benchmark.

In [None]:
# !huggingface-cli login

*   Importing Necessary Libraries:

    *   datasets: A library from Hugging Face used for accessing and processing datasets.
    *   json: A Python library for handling JSON data structures.

*   Loading the Evaluation Dataset:

    *   The datasets.load_dataset function loads the Open-LLM-Benchmark dataset, specifically the "questions" split. This dataset contains evaluation questions designed to test the performance of large language models (LLMs) across various tasks.

*   Grouping the Dataset by Source:

    *   The dataset examples are iterated through, and the dataset field in each example is used to group the questions based on their originating dataset.
    *   A dictionary named grouped_datasets is created, where:
        *   Keys represent the source datasets (e.g., task names or domains).
        *   Values are lists of examples belonging to that dataset.

In [None]:
import datasets
import json

eval_set = datasets.load_dataset("Open-Style/Open-LLM-Benchmark", "questions")
grouped_datasets = {}
for example in eval_set['train']:
    dataset = example["dataset"]
    if dataset not in grouped_datasets:
        grouped_datasets[dataset] = []
    grouped_datasets[dataset].append(example)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/3.47k [00:00<?, ?B/s]

questions%2FARC.json:   0%|          | 0.00/2.39M [00:00<?, ?B/s]

questions%2FCommonsenseQA.json:   0%|          | 0.00/536k [00:00<?, ?B/s]

questions%2FHellaswag.json:   0%|          | 0.00/6.08M [00:00<?, ?B/s]

questions%2FMMLU.json:   0%|          | 0.00/7.21M [00:00<?, ?B/s]

questions%2FMedMCQA.json:   0%|          | 0.00/1.68M [00:00<?, ?B/s]

questions%2FOpenbookQA.json:   0%|          | 0.00/335k [00:00<?, ?B/s]

questions%2FWinogrande.json:   0%|          | 0.00/627k [00:00<?, ?B/s]

questions%2Fpiqa.json:   0%|          | 0.00/448k [00:00<?, ?B/s]

questions%2Frace.json:   0%|          | 0.00/8.09M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
grouped_datasets.keys()

dict_keys(['ARC', 'CommonsenseQA', 'Hellaswag', 'MMLU', 'MedMCQA', 'OpenbookQA', 'Winogrande', 'piqa', 'race'])

*   Importing Required Classes:

    *   AutoTokenizer: A generic class for loading the appropriate tokenizer for a given pre-trained model.
    *   AutoModelForCausalLM: A generic class for loading a causal language model, designed for autoregressive tasks like text generation.
*   Loading the Pre-trained Model:

    *   The AutoModelForCausalLM.from_pretrained function loads the Llama 3.2 (1B parameters) model from the Hugging Face Hub.
    *   device_map="auto" ensures that the model is automatically loaded onto the available hardware (e.g., GPU or CPU) for optimal performance.
*   Loading the Tokenizer:

    *   The AutoTokenizer.from_pretrained function loads the tokenizer corresponding to the same pre-trained model.
    *   padding_side="left" is set to align the input sequences for causal language modeling, where padding tokens are added to the left.

Note: The placeholder meta-llama/Llama-3.2-1B indicates the pre-trained model being used in this example. You need to replace this with the specific model and tokenizer you want to use. For example:

*   If you are testing another model, update the model identifier (e.g., "openai/gpt-3", "EleutherAI/gpt-j-6B", etc.).

*   Ensure the model and tokenizer are compatible with the dataset and task at hand.

You should find the suitable model in this list: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.

In [None]:
from transformers import AutoTokenizer,AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2", device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", padding_side="left")

### You have to change your using model and tokenizer here

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

1.   Device Allocation:

*   The device variable ensures the model runs on the GPU if available for faster inference.
*   Uncomment model = model.to(device) to explicitly move the model to the detected device.

2.   Inference Function (infer_llm):

*   Takes a single dataset sample as input.
*   onstructs a prompt that includes the question, options, and an "Answer:" section where the model generates its response.
*   Customization:

    *   Modify the prompt to match the style of your model (e.g., "Answer in yes/no," or "Answer in a single word").
*   Tokenizes the prompt using the tokenizer and prepares it for the model by converting it to the appropriate device.


*   Generates output using the model:
    *   max_length: Ensures the response length is capped for efficiency.
    *   do_sample=True: Allows for more varied outputs by introducing randomness.
*   Decodes the model's output into a human-readable string and extracts the generated answer.
3.   Evaluation Function (evaluate_samples):

*   Iterates through all samples in a dataset.
*   Calls the infer_llm function to predict answers for each sample.
*   Compares the predicted answers to the correct answers (sample["answerKey"]).
*   Tracks the number of correct predictions and calculates the accuracy.
4.   Dataset Evaluation:

*   The evaluate_samples function is applied to the CommonsenseQA dataset from the grouped datasets.
*   Returns the predicted answers and the model's accuracy on the dataset.

5.   Customization:
*   Prompt Design: Modify the prompt string in infer_llm to align with the expected input format of your chosen LLM.
*   Model Parameters: Adjust settings like max_length and do_sample to balance output quality and computational efficiency.


In [None]:
# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "Answer: "

    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the output
    outputs = model.generate(
        inputs["input_ids"],
        max_length=len(inputs["input_ids"][0]) + 5, ###
        num_return_sequences=1,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )

    # Decode the output
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    for sample in samples:
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1

    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['CommonsenseQA'])

In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['CommonsenseQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Predicted Answer: 
Correct Answer: A

Question: What do people aim to do at work?
Predicted Answer: A: Complete job
Correct Answer: A

Question: Where would you find magazines along side many other printed works?
Predicted Answer: 
Correct Answer: B

Question: Where are  you likely to find a hamburger?
Predicted Answer: A: fast food
Correct Answer: A

Question: James was looking for a good place to buy farmland.  Where might he look?
Predicted Answer: E) Illinois
Correct Answer: A

Question: In what Spanish speaking North American country can you get a great cup of coffee?
Predicted Answer: B: mex
Correct Answer: B

Question: What do animals do when an enemy is approaching?
Predicted Answer: 
Correct Answer: D

Question: Reading newspaper one of many ways to practice your what?
Predicted Answer: 1: A
Correct Answer: A

Question: What do people typically do while playin