# Evaluating YandexGPT Pro (4th Gen) on the HellaSwag Benchmark

The goal of this notebook is to evaluate [YandexGPT Pro (4th Gen)](https://yandex.cloud/ru/docs/foundation-models/concepts/yandexgpt/models) on the [HellaSwag benchmark](https://github.com/rowanz/hellaswag).

You'll need to set up a [Yandex Cloud account](https://yandex.cloud/ru/docs/overview/quickstart) to run this notebook yourself.  

In [None]:
# Let's have a look at the HellaSwag dataset that we'll be using to evaluate our model
from datasets import load_dataset
dataset = load_dataset("hellaswag", split="validation")

# Modify this parameter to look at any one or more rows of the HellaSwag dataset
print(dataset[0])


{'ind': 24, 'activity_label': 'Roof shingle removal', 'ctx_a': 'A man is sitting on a roof.', 'ctx_b': 'he', 'ctx': 'A man is sitting on a roof. he', 'endings': ['is using wrap to wrap a pair of skis.', 'is ripping level tiles off.', "is holding a rubik's cube.", 'starts pulling up roofing on a roof.'], 'source_id': 'activitynet~v_-JhWjGDPHMY', 'split': 'val', 'split_type': 'indomain', 'label': '3'}


In [None]:
import os
import time
import requests
from datasets import load_dataset, Dataset
from dotenv import load_dotenv
import pandas as pd

# I've generated an IAM token and saved it as an environment variable
# Please follow these instructions to do the same: https://yandex.cloud/ru/docs/iam/operations/iam-token/create
load_dotenv()
token = os.environ["iamToken"]


## Mapping and Configuration

A large part of this exercise is evaluating the output of this model against a series of known values. Evaluating on HellaSwag means prompting a model with an incomplete sentence and asking it to pick the most logical completion of that sentence from a set of four (4) choices.

We'll use the `completion` endpoint for this task, and ask the model to return only a single choice (A, B, C, or D) as a response.

> Please note, it is highly recommended to use the `limit` parameter to ensure that your notebook returns the expected values in the expected format, etc. The `evaluation` dataset contains about 10,000 rows. Complete evaluation cost about 3,000 RUR or 35 USD, so please exercise caution when launching with `None` value (meaning evaluate against the entire dataset).

In [None]:
limit = None  # Set to None to use entire dataset

# Yandex GPT settings, please note, you'll need to replace the 'model_uri' parameter with a valid URI
# Please follow these instructions to generate a valid URI: https://yandex.cloud/ru/docs/foundation-models/concepts/yandexgpt/models#addressing-models
url = 'https://llm.api.cloud.yandex.net/foundationModels/v1/completion'
model_uri = 'gpt://b1glj00q90mpauq6mpft/yandexgpt/latest'

# Load dataset
dataset = load_dataset("hellaswag", split="validation")
total = len(dataset) if limit is None else min(limit, len(dataset))

# Answer mappings
index_to_letter = ["A", "B", "C", "D"]
letter_to_index = {l: i for i, l in enumerate(index_to_letter)}

results = []


## A Word about Bias

The prompt used in any experiment is one of the key sources of bias. The prompt used below was designed to minimize any biased evaluation that might occur as a result of imperfect fragment recall or similar. The model is prompted to return only the value of its choice. This is inline with other similar experiments evaluating LLMs based on zero-shot classification.

In [62]:
def query_yandex_model(context, endings):
    prompt = f"""Read the context and choose the most plausible continuation from the options below.

Context: {context}

A) {endings[0]}
B) {endings[1]}
C) {endings[2]}
D) {endings[3]}

Answer with only A, B, C, or D."""

    payload = {
        "modelUri": model_uri,
        "completionOptions": {
            "stream": False,
            "temperature": 0.3,
            "maxTokens": 100
        },
        "messages": [
            {
                "role": "system",
                "text": "You are a multiple-choice reasoning assistant. Pick the best answer based only on the context."
            },
            {
                "role": "user",
                "text": prompt
            }
        ]
    }

    headers = {"Authorization": f"Bearer {token}"}
    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()

    return response.json()["result"]["alternatives"][0]["message"]["text"].strip()


In [63]:
for i in range(total):
    sample = dataset[i]
    ctx = sample["ctx"]
    endings = sample["endings"]
    label = int(sample["label"])
    correct_letter = index_to_letter[label]

    try:
        model_response = query_yandex_model(ctx, endings)
        predicted_letter = model_response.strip().upper()[0]

        is_correct = predicted_letter == correct_letter

        print(f"[{i}] Predicted: {predicted_letter} | Correct: {correct_letter} | ✅ {is_correct}")

        results.append({
            "id": i,
            "context": ctx,
            "endings": endings,
            "model_answer": predicted_letter,
            "correct_answer": correct_letter,
            "is_correct": is_correct
        })

        time.sleep(0.5)

    except Exception as e:
        print(f"[{i}] Error: {e}")


[0] Predicted: B | Correct: D | ✅ False
[1] Predicted: D | Correct: D | ✅ True
[2] Predicted: C | Correct: C | ✅ True
[3] Predicted: A | Correct: C | ✅ False
[4] Predicted: B | Correct: B | ✅ True
[5] Predicted: B | Correct: B | ✅ True
[6] Predicted: B | Correct: C | ✅ False
[7] Predicted: A | Correct: A | ✅ True
[8] Predicted: B | Correct: B | ✅ True
[9] Predicted: B | Correct: B | ✅ True
[10] Predicted: B | Correct: D | ✅ False
[11] Predicted: C | Correct: D | ✅ False
[12] Predicted: C | Correct: C | ✅ True
[13] Predicted: C | Correct: C | ✅ True
[14] Predicted: A | Correct: A | ✅ True
[15] Predicted: C | Correct: D | ✅ False
[16] Predicted: C | Correct: C | ✅ True
[17] Predicted: A | Correct: A | ✅ True
[18] Predicted: B | Correct: B | ✅ True
[19] Predicted: A | Correct: B | ✅ False
[20] Predicted: A | Correct: B | ✅ False
[21] Predicted: A | Correct: A | ✅ True
[22] Predicted: D | Correct: D | ✅ True
[23] Predicted: D | Correct: D | ✅ True
[24] Predicted: D | Correct: A | ✅ False
[

In [64]:
hf_dataset = Dataset.from_list(results)
hf_dataset.save_to_disk("yandex-hellaswag-local")
print(f"\n✅ Saved {len(results)} examples to `yandex-hellaswag-local`")

# Show in a table
df = pd.DataFrame(results)
df[["id", "model_answer", "correct_answer", "is_correct"]]


Saving the dataset (1/1 shards): 100%|██████████| 10042/10042 [00:00<00:00, 1133150.41 examples/s]


✅ Saved 10042 examples to `yandex-hellaswag-local`





Unnamed: 0,id,model_answer,correct_answer,is_correct
0,0,B,D,False
1,1,D,D,True
2,2,C,C,True
3,3,A,C,False
4,4,B,B,True
...,...,...,...,...
10037,10037,B,B,True
10038,10038,D,D,True
10039,10039,D,D,True
10040,10040,C,C,True


## Upload to the Hub

As with any good open source experiemnt, let's upload our results to the Hugging Face Hub so that others can make use of the valuable data that we've generated here!

In [None]:
from huggingface_hub import login
from datasets import load_from_disk

# Login to HF Hub
login()  
# https://huggingface.co/settings/tokens

# Load the saved dataset from disk
hf_dataset = load_from_disk("yandex-hellaswag-local")

# Push to HF Hub
hf_dataset.push_to_hub("ZennyKenny/yandexgptpro_4th_gen-hellaswag")


Creating parquet from Arrow format: 100%|██████████| 11/11 [00:00<00:00, 217.23ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:03<00:00,  3.89s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/ZennyKenny/yandexgptpro_4th_gen-hellaswag/commit/395cb8788599f1f7a31df10254aca32f0df03ff9', commit_message='Upload dataset', commit_description='', oid='395cb8788599f1f7a31df10254aca32f0df03ff9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/ZennyKenny/yandexgptpro_4th_gen-hellaswag', endpoint='https://huggingface.co', repo_type='dataset', repo_id='ZennyKenny/yandexgptpro_4th_gen-hellaswag'), pr_revision=None, pr_num=None)