<a href="https://colab.research.google.com/github/mzilberman40/LLM-Engineering-Essentials/blob/main/topic2/2.1_structured_inputs_and_outputs_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials by Nebius Academy

Course github: [link](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/tree/main)

# 2.1. Structured Inputs and Outputs

# **Practice task solutions**

In [1]:
!pip install openai -qU

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/734.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.4/734.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m727.0/734.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m734.3/734.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
from google.colab import userdata
from openai import OpenAI
import os

os.environ['NEBIUS_API_KEY'] = userdata.get("nebius_api_key")

nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

llama_model = "meta-llama/Llama-3.3-70B-Instruct"

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def answer_with_llm(prompt: str,
                    system_prompt="You are a helpful assistant",
                    max_tokens=512,
                    client=nebius_client,
                    model=llama_model,
                    prettify=True,
                    temperature=None) -> str:

    messages = []

    if system_prompt:
        messages.append(
            {
                "role": "system",
                "content": system_prompt
            }
        )

    messages.append(
        {
            "role": "user",
            "content": prompt
        }
    )

    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        return prettify_string(completion.choices[0].message.content)
    else:
        return completion.choices[0].message.content


## Task 1. LLM Information extraction

The goal of this task is to create a system, which extracts data about events from free text into a predictable format.

Let's imagine that you work for a marketing agency, and you need to gather analytics about the passing events dedicated to AI and Machine Learning. For that, you need to process press releases and extract:
- Event name,
- Event date,
- Number of participants,
- Number of speakers,
- Attendance price.

Of course, you can do it manually, but it's much more fun to use Generative AI! So, your task will be to write a function that does this with only one request to OpenAI API.

Below there is an example of a press release (generated by ChatGPT, of course, so that both the event and the personae are fictional). All of them are in the press_releases.zip archive in the hometask week 1 folder.

<blockquote>
<p>PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence</p>

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact:
Jane Cipher
Director of Communications, InnovAI Summit
Email: jane.cipher@innovai.org
Phone: +123-4567-8910</p>
</blockquote>

More specifically, you should write a function

```python
parse_press_release(pr: str) -> dict
```

where the output should be in the format

```python
{
  name: 'InnovAI Summit 2023',
  date: '08.11.2023',
  n_participants: 3500,
  n_speakers: 4,
  price:
}
```

If any of the four characteristics is not mentioned in the text, put `None` in the respective field.

At the end, calculate the statistics of right answers and analyse what kind of mistakes you "model" makes the most.

**Hints and suggestions:**
- It's gonna be more convenient to experiment in Nebius AI Studio's playground https://studio.nebius.com/playground.
- You need to be very accurate with what you want from the model.
- It will help if you specify in the prompt that the output should be in JSON format, this way you will spend less time parsing the output. But be careful. Though some models are easily prompted to output a JSON, please check the output format. It may contain excessive formatting, for example:
<pre><code>```json
{"name": "InnovAI Summit 2023", ...}
```</pre></code>
Actually, examining LLM outputs and their format is a must when working with them

- Please be careful with the details. For example, Jane Cipher in the text above is not a speaker and shouldn't be counter as such (how to get rid of a contact person?). Also pay attention to the date format,
- If the model is too wilful with the output format, don't hesitate to show some examples. Decreasing the temperature of predictions can help reduce the creativity of the answer, which is what we want for such task.
- Debugging an LLM-powered application may become a tough business. When you think that you've polished it, an LLM can still surprise you. So, we don't expect 100% accuracy in this task, but we expect that you do your best to achieve high quality results.

**Bonus points**:
Try writing the solution using:
- Structured JSON Output
- Guiding JSON Output using Structures

In [5]:
def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)


In [6]:
def answer_with_llm(
    messages: list,
    client,
    model,
    max_tokens=512,
    prettify=True,
    temperature=None,
    extra_body: dict = None,
) -> str:

    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature,
        extra_body=extra_body,
    )

    if prettify:
        return prettify_string(completion.choices[0].message.content)
    else:
        return completion.choices[0].message.content


In [7]:
press_release = """PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact: Jane Cipher Director of Communications, InnovAI Summit Email: jane.cipher@innovai.org Phone: +123-4567-8910"""

In [17]:
from typing import List, Union
from pydantic import BaseModel, Field, field_validator
import json
import re

In [11]:
LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct"


In [12]:
SYSTEM_PROMPT = """
You are an information extraction assistant.

Below is a press release describing a public event. Extract and return a JSON object with the following fields:

- "name": the full official name of the event.
- "date": the date of the event in the format "DD.MM.YYYY", or "DD.MM.YYYY-DD.MM.YYYY" if it lasted several days. If not specified, return "None".
- "n_participants": the number of attendees or audience members. Do not include organizers, staff, or service personnel. If not specified, return "None".
- "n_speakers": the number of people who gave talks, presentations, participated in panels, or spoke publicly during the event. Do not include contact persons, moderators, or organizers unless they also spoke publicly. If not specified, return "None".
- "price": the cost to attend the event, formatted as "EUR 100", "USD 1000", or "GBP 100". Do not use currency symbols. If the event was free or no price is mentioned, return "None".

Return only the JSON object. Do not add explanations. Use "None" (as a string) if any field is missing.
""".strip()

In [21]:
class EventProfile(BaseModel):
    name: str = Field(..., description="Full event name")

    date: str = Field(..., description="DD.MM.YYYY or DD.MM.YYYY-DD.MM.YYYY or 'None'")
    n_participants: Union[int, str] = Field(..., description="Number of participants or 'None'")
    n_speakers: Union[int, str] = Field(..., description="Number of speakers or 'None'")
    price: str = Field(..., description="Price in 'EUR 100' etc., or 'None'")

    @field_validator('date')
    @classmethod
    def validate_date_format(cls, v: str) -> str:
        if v == "None":
            return v
        single = r'\d{2}\.\d{2}\.\d{4}'
        date_range = f'{single}-{single}'
        if re.fullmatch(single, v) or re.fullmatch(date_range, v):
            return v
        raise ValueError("Date must be 'DD.MM.YYYY', 'DD.MM.YYYY-DD.MM.YYYY', or 'None'")

    @field_validator('n_participants', 'n_speakers')
    @classmethod
    def validate_number_or_none(cls, v: Union[int, str]) -> Union[int, str]:
        if v == "None":
            return v
        # Accept stringified integers like "3500"
        if isinstance(v, str) and v.isdigit():
            return int(v)
        if isinstance(v, int):
            return v
        raise ValueError("Must be an integer or 'None'")

    @field_validator('price')
    @classmethod
    def validate_price_format(cls, v: str) -> str:
        if v == "None":
            return v
        if re.fullmatch(r'^(EUR|USD|GBP) \d+$', v):
            return v
        raise ValueError("Price must be 'EUR 100', 'USD 1000', 'GBP 100', or 'None'")

In [22]:
# For JSON schema
schema = EventProfile.model_json_schema()
print(json.dumps(schema, indent=2))

{
  "properties": {
    "name": {
      "description": "Full event name",
      "title": "Name",
      "type": "string"
    },
    "date": {
      "description": "DD.MM.YYYY or DD.MM.YYYY-DD.MM.YYYY or 'None'",
      "title": "Date",
      "type": "string"
    },
    "n_participants": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "string"
        }
      ],
      "description": "Number of participants or 'None'",
      "title": "N Participants"
    },
    "n_speakers": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "string"
        }
      ],
      "description": "Number of speakers or 'None'",
      "title": "N Speakers"
    },
    "price": {
      "description": "Price in 'EUR 100' etc., or 'None'",
      "title": "Price",
      "type": "string"
    }
  },
  "required": [
    "name",
    "date",
    "n_participants",
    "n_speakers",
    "price"
  ],
  "title": "EventProfile",
  "type"

In [36]:
def parse_press_release(pr: str) -> dict:
  completion = nebius_client.chat.completions.create(
    model=LLM_MODEL,
    temperature=1,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": pr}
    ],
    extra_body={
        "guided_json": EventProfile.model_json_schema()
    }
  )
  validated = EventProfile.model_validate_json(completion.choices[0].message.content)
  return validated.model_dump()


In [37]:
parse_press_release(press_release)

{'name': 'InnovAI Summit 2023', 'date': '05.11.2023', 'n_participants': 3500, 'n_speakers': 5, 'price': 'None'}

In [16]:
import json
import re

def extract_triple_backtick_blocks(text):
    """
    Extracts all text enclosed between triple backticks (```).
    Returns a list of code/text blocks.
    """
    return re.findall(r"```(.*?)```", text, re.DOTALL)

def parse_press_release(pr: str) -> dict:
    answer = answer_with_llm(
            f"Here's a press release\n{pr}\n\nExtract from it the following json:"\
            "If any information needed for JSON is not available, write \"None\" instead (with quotes).\n"\
            '{"name": NAME_OF_EVENT, "date": DATE_OF_EVENT, "n_participants": NUM_PARTICIPANTS, "n_speakers": NUM_SPEAKERS, "price": PRICE}'\
            "NAME_OF_EVENT should be the name of event advertised,\n"\
            "DATE_OF_EVENT hould be the date of event mentioned in format DD.MM.YYYY or DD.MM.YYYY-DD.MM.YYYY if the event lasted for several days,\n"\
            "NUM_PARTICIPANTS should be the estimated amount of participants of said event in a format like 200 or 1000 or 10000, do not write it like 2,000,\n"\
            "NUM_SPEAKERS is a number, corresponding to amount of names of speakers and hosts mentioned\n"\
            "PRICE should be the price of event in the format EUR 100 or USD 1000 or GBP 100 depending on currency. Do not write currency symbol, instead write an abbreviation.\n"\
            "If any information needed for JSON is not available, write a json string \"None\" instead (with quotes)."
    )
    return answer
    # try:
    #     if "```" in answer:
    #         answer = extract_triple_backtick_blocks(answer)[0]
    #     return json.loads(answer)
    # except Exception as e:
    #     print(answer)
    #     raise

In [32]:
text = parse_press_release(press_release)
print(text)

{'name': 'InnovAI Summit 2023', 'date': '05.11.2023', 'n_participants': 3500, 'n_speakers': 5, 'price': 'None'}


In [18]:
extract_triple_backtick_blocks(text)

['\n{\n"name": "InnovAI Summit 2023",\n"date": "05.11.2023",\n"n_participants": 3500,\n"n_speakers": 5,\n"price": "None"\n}\n']

### **Solution**

In [25]:
parse_press_release(press_release)

{'name': 'InnovAI Summit 2023', 'date': '05.11.2023', 'n_participants': 3500, 'n_speakers': 5, 'price': 'None'}

### Testing

We've prepared a small dataset for you to test your prompt on. Provided you've written your function, try running the following code. At the end you also have an opportunity to look at the results in a table side-by-side in with_results.csv. Your goal is to get at least 60% of fields right..

In [26]:
!pip install --upgrade gdown
!gdown -O press_release_extraction.csv https://docs.google.com/spreadsheets/d/15IGdc3MV8864lxrLxsug0Ij480p76T1EAwBM7WGT_OI/export?format=csv

Downloading...
From: https://docs.google.com/spreadsheets/d/15IGdc3MV8864lxrLxsug0Ij480p76T1EAwBM7WGT_OI/export?format=csv
To: /content/press_release_extraction.csv
16.0kB [00:00, 23.4MB/s]


In [33]:
import pandas
pr_df = pandas.read_csv("press_release_extraction.csv")
pr_df.head()

Unnamed: 0,pr_text,pr_parsed
0,InnovAI Summit 2023: A Glimpse into the Future...,"{\n ""name"": ""InnovAI Summit 2023"",\n ""date"":..."
1,Press Dispatch: 'Artificial Mariners: Navigati...,"{""name"": ""Artificial Mariners: Navigatin' the ..."
2,FOR IMMEDIATE RELEASE\n\nAI Innovators Convene...,"{""name"": ""Annual Machine Learning Symposium 20..."
3,Press Release: Cutting-Edge Innovations Debute...,"{""name"": ""AI Advancements Summit 2023"",\n ""dat..."
4,"Press Release: Innovative Minds Gather at ""AI ...","{""name"": ""AI Horizon 2023"",\n ""date"": ""15.10.2..."


In [34]:
pr_df.pr_parsed[0]

'{\n  "name": "InnovAI Summit 2023",\n  "date": "05.11.2023",\n  "n_participants": 3500,\n  "n_speakers": 4,\n  "price": "None"\n}'

In [38]:
import json

parsed_list = []
fields = {
    "name": str,
    "date": str,
    "n_speakers": int,
    "n_participants": int,
    "price": str
}
correct_fields = 0
for row in pr_df.itertuples():
    parsed_release = parse_press_release(row.pr_text)
    parsed_list.append(json.dumps(parsed_release, indent=4))
    golden = json.loads(row.pr_parsed)
    for field, field_type in fields.items():
        golden_field = golden[field]
        parsed_field = parsed_release.get(field)
        try:
            parsed_field = field_type(parsed_field)
        except (ValueError, TypeError):
            pass
        if golden_field == parsed_field:
            correct_fields += 1
        else:
            print(f"For {golden['name']} {field} {parsed_release.get(field)} doesn't seem the same as {golden[field]}")

print(f"Correctly extracted {correct_fields} out of {5*len(pr_df)}")

For InnovAI Summit 2023 n_speakers 5 doesn't seem the same as 4
Correctly extracted 34 out of 35


### Bonus points
- Try and compare different ways of establishing the correct answer formatting
- Try and compare different LLMs

## Task 2. Character localiztion

Cool thing about structured output, is that it's very easy to make a translated version of a specific dataset, taking into account all the context and outputing in a format, which is super easy to parse. Let's try this on MMLU.

**Task:** Write a function which inputs a sample from MMLU and outputs a translated version, using structured outputs.

Tip: make sure that the correct answer didn't change.

In [None]:
!pip install -qU datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m430.1/491.5 kB[0m [31m12.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/193.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 re

In [None]:
from typing import List
from pydantic import BaseModel

class MMLUSample(BaseModel):
    ...

def translate_mmlu_sample(sample: MMLUSample, target_language: str) -> MMLUSample:
    ...

In [None]:
from typing import List
from pydantic import BaseModel

class MMLUSample(BaseModel):
    question: str
    A: str
    B: str
    C: str
    D: str
    correct_answer: str

def translate_mmlu_sample(sample: MMLUSample, target_language: str) -> MMLUSample:
    completion = nebius_client.chat.completions.create(
        model=llama_model,
        messages=[
            {
                "role": "user",
                "content": f"Translate this MMLU sample into {target_language}" \
                f"Question: {sample.question}\n" \
                f"A: {sample.A}\n" \
                f"B: {sample.B}\n" \
                f"C: {sample.C}\n" \
                f"D: {sample.D}\n" \
                f"Correct answer: {sample.correct_answer}\n" \
                f"Translated sample:"
            }
        ],
        extra_body={
            "guided_json": MMLUSample.model_json_schema()
        },
    )

    translated = MMLUSample.model_validate_json(completion.choices[0].message.content)
    if translated.correct_answer != sample.correct_answer:
        translated.correct_answer = sample.correct_answer
    return translated

In [None]:
mmlu_sample = MMLUSample(
    question = "Which of the following statements about Ethernets is typically FALSE?",
    A = "Ethernets use circuit switching to send messages.",
    B = "Ethernets use buses with multiple masters.",
    C = "Ethernet protocols use a collision-detection method to ensure that messages are transmitted properly.",
    D = "Networks connected by Ethernets are limited in length to a few hundred meters.",
    correct_answer = "A"
)

translate_mmlu_sample(mmlu_sample, target_language="German")

MMLUSample(question='Welche der folgenden Aussagen über Ethernets ist typischerweise FALSCH?', A='Ethernets verwenden Schaltauswahl, um Nachrichten zu senden.', B='Ethernets verwenden Busse mit mehreren Master-Geräten.', C='Ethernet-Protokolle verwenden ein Kollisions-Erkennungsverfahren, um sicherzustellen, dass Nachrichten ordnungsgemäß übertragen werden.', D='Netzwerke, die durch Ethernets verbunden sind, sind in ihrer Länge auf einige hundert Meter begrenzt.', correct_answer='A')

Now let's remember the code we've written for MMLU evaluator and add a little twist:

We'll have both topic and language in which we want to evaluate the model.

In [None]:
!pip install datasets -q

**Task**: Modify the following MMLUEvaluator code so that it can also translate the input question and evaluate the performance in a different language.

In [None]:
import pandas as pd
from typing import List, Dict, Tuple
import json
from pathlib import Path
import numpy as np
from tqdm import tqdm

from datasets import load_dataset

class MMLUEvaluator:
    def __init__(self, system_prompt: str = None, prompt: str = None,
                 topic: str = "high_school_mathematics"):
        """
        Initialize the MMLU evaluator.

        Args:
            system_prompt: Optional system prompt for the model
            prompt: Custom prompt for the model
            topic: Which topic to choose
        """

        self.topic = topic
        self.topic_prettified = topic.replace("_", " ")
        self.system_prompt = system_prompt or f"You are an expert in {self.topic_prettified}."

        self.prompt = """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
You need to ponder the question and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosen answer option A, B, C, D after #ANSWER:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
"""

        self.questions, self.choices, self.answers = self.load_mmlu_data(topic=self.topic)

    def load_mmlu_data(self, topic: str) -> pd.DataFrame:
        """
        Load MMLU test data on a given topic.

        Args:
            topic: Which topic to choose

        Returns:
            DataFrame with questions and answers
        """

        dataset = load_dataset("cais/mmlu", topic, split="test")

        dataset = dataset
        dataset = pd.DataFrame(dataset)

        # Load questions and choices separately
        questions = dataset["question"]
        choices = pd.DataFrame(
            data=dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
        )
        # In the dataset, true answer labels are in 0-3 format;
        # We convert it to A-D
        answers = dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

        return questions, choices, answers

    def extract_answer(self, solution: str) -> str:
        """
        Extract the letter answer from model's response.

        Args:
            response: Raw model response

        Returns:
            Extracted answer letter (A, B, C, D, or Failed to parse)
        """
        # Look for a single letter answer in the response
        try:
            answer = solution.split('#ANSWER:')[1].strip()
        except:
            answer = "Failed to parse"
        return answer

    def evaluate_single_question(self, question: str, choices: Dict[str, str],
                                 correct_answer: str,
                                 client, model) -> Tuple[bool, str]:
        """
        Evaluate a single question.

        Args:
            question: Formatted question string
            correct_answer: Correct answer letter

        Returns:
            Tuple of (is_correct, extracted_answer, model_response)
        """
        try:
            model_response = answer_with_llm(
                prompt=self.prompt.format(
                    client=client, model=model,
                    topic_prettified=self.topic_prettified,
                    question=question,
                    A=choices['A'], B=choices['B'], C=choices['C'], D=choices['D']
                ),
                system_prompt=self.system_prompt,
                prettify=False
            )
            answer = self.extract_answer(model_response)
            is_correct = (answer.upper() == correct_answer.upper())
            return is_correct, answer, model_response
        except Exception as e:
            print(f"Error evaluating question: {e}")
            return False, None, None

    def run_evaluation(self, client=nebius_client, model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                       n_questions=50) -> Dict:
        """
        Run evaluation of a given model on the first n_questions.

        Args:
            client: Which client to use (OpenAI or Nebius)
            model: Which model to use
            n_questions: How many first questions to take

        Returns:
            Dictionary with evaluation metrics
        """
        evaluation_log = []
        correct_count = 0

        if n_questions:
            n_questions = min(n_questions, len(self.questions))
        else:
            n_questions = len(self.questions)

        for i in tqdm(range(n_questions)):
            is_correct, answer, model_response = self.evaluate_single_question(
                question=self.questions[i],
                choices=self.choices.iloc[i],
                correct_answer=self.answers[i],
                client=client,
                model=model,
            )

            if is_correct:
                correct_count += 1

            evaluation_log.append({
                'answer': answer,
                'model_response': model_response,
                'is_correct': is_correct
            })

        accuracy = correct_count / n_questions
        evaluation_results = {
            'accuracy': accuracy,
            'evaluation_log': evaluation_log
        }

        return evaluation_results


### Solution

In [None]:
import pandas as pd
from typing import List, Dict, Tuple
import json
from pathlib import Path
import numpy as np
from tqdm import tqdm

from datasets import load_dataset

class MMLUEvaluator:
    def __init__(self, system_prompt: str = None, prompt: str = None,
                 topic: str = "high_school_mathematics",
                 language: str = "English"):
        """
        Initialize the MMLU evaluator.

        Args:
            system_prompt: Optional system prompt for the model
            prompt: Custom prompt for the model
            topic: Which topic to choose
        """

        self.topic = topic
        self.language = language
        self.topic_prettified = topic.replace("_", " ")
        self.system_prompt = system_prompt or f"You are an expert in {self.topic_prettified}."

        self.prompt = """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
You need to ponder the question and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosen answer option A, B, C, D after #ANSWER:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
"""

        self.questions, self.choices, self.answers = self.load_mmlu_data(topic=self.topic)

    def load_mmlu_data(self, topic: str) -> pd.DataFrame:
        """
        Load MMLU test data on a given topic.

        Args:
            topic: Which topic to choose

        Returns:
            DataFrame with questions and answers
        """

        dataset = load_dataset("cais/mmlu", topic, split="test")

        dataset = dataset
        dataset = pd.DataFrame(dataset)

        # Load questions and choices separately
        questions = dataset["question"]
        choices = pd.DataFrame(
            data=dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
        )
        # In the dataset, true answer labels are in 0-3 format;
        # We convert it to A-D
        answers = dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

        return questions, choices, answers

    def extract_answer(self, solution: str) -> str:
        """
        Extract the letter answer from model's response.

        Args:
            response: Raw model response

        Returns:
            Extracted answer letter (A, B, C, D, or Failed to parse)
        """
        # Look for a single letter answer in the response
        try:
            answer = solution.split('#ANSWER:')[1].strip()
        except:
            answer = "Failed to parse"
        return answer

    def evaluate_single_question(self, question: str, choices: Dict[str, str],
                                 correct_answer: str,
                                 client, model) -> Tuple[bool, str]:
        """
        Evaluate a single question.

        Args:
            question: Formatted question string
            correct_answer: Correct answer letter

        Returns:
            Tuple of (is_correct, extracted_answer, model_response)
        """
        try:
            if self.language != "English":
                sample = MMLUSample(
                    question=question,
                    A=choices['A'], B=choices['B'], C=choices['C'], D=choices['D'],
                    correct_answer=correct_answer
                )
                translated = translate_mmlu_sample(sample, target_language=self.language)
                question = translated.question
                choices = {"A": translated.A, "B": translated.B, "C": translated.C, "D": translated.D}
                correct_answer = translated.correct_answer
            model_response = answer_with_llm(
                prompt=self.prompt.format(
                    client=client, model=model,
                    topic_prettified=self.topic_prettified,
                    question=question,
                    A=choices['A'], B=choices['B'], C=choices['C'], D=choices['D']
                ),
                system_prompt=self.system_prompt,
                prettify=False
            )
            answer = self.extract_answer(model_response)
            is_correct = (answer.upper() == correct_answer.upper())
            return is_correct, answer, model_response
        except Exception as e:
            print(f"Error evaluating question: {e}")
            return False, None, None

    def run_evaluation(self, client=nebius_client, model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                       n_questions=50) -> Dict:
        """
        Run evaluation of a given model on the first n_questions.

        Args:
            client: Which client to use (OpenAI or Nebius)
            model: Which model to use
            n_questions: How many first questions to take

        Returns:
            Dictionary with evaluation metrics
        """
        evaluation_log = []
        correct_count = 0

        if n_questions:
            n_questions = min(n_questions, len(self.questions))
        else:
            n_questions = len(self.questions)

        for i in tqdm(range(n_questions)):
            is_correct, answer, model_response = self.evaluate_single_question(
                question=self.questions[i],
                choices=self.choices.iloc[i],
                correct_answer=self.answers[i],
                client=client,
                model=model,
            )

            if is_correct:
                correct_count += 1

            evaluation_log.append({
                'answer': answer,
                'model_response': model_response,
                'is_correct': is_correct
            })

        accuracy = correct_count / n_questions
        evaluation_results = {
            'accuracy': accuracy,
            'evaluation_log': evaluation_log
        }

        return evaluation_results


### Testing

In [None]:
evaluator = MMLUEvaluator(topic="medical_genetics", language="English")

results = evaluator.run_evaluation(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                         n_questions=50)
print(f'\nAccuracy: {results["accuracy"]}')

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md:   0%|          | 0.00/53.2k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/138k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/16.4k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/5.63k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

100%|██████████| 50/50 [08:23<00:00, 10.08s/it]


Accuracy: 0.9





In [None]:
evaluator_de = MMLUEvaluator(topic="medical_genetics", language="German")

results_de = evaluator_de.run_evaluation(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                         n_questions=10)
print(f'\nAccuracy: {results_de["accuracy"]}')

100%|██████████| 10/10 [02:16<00:00, 13.62s/it]


Accuracy: 0.5



