# Translating the HellaSwag Dataset to Russian

The goal of this notebook is to translate the [HellaSwag dataset](https://huggingface.co/datasets/Rowan/hellaswag) to Russian. See [README.md](../README.md) for more information about the HellaSwag benchmark, and how it can be used to assess common-sense sentence-completion in Large Language Models (LLMs).

In [None]:
# load the datasets lib
%pip install datasets huggingface_hub[hf_xet] ipywidgets --quiet

In [1]:
# pull the dataset from hugging face hub
from datasets import load_dataset

dataset = load_dataset("Rowan/hellaswag")

In [None]:
# cool to see that hellaswag already supports the new, more efficient xet file storage
# read more about xet here: https://huggingface.co/blog/xet-on-the-hub

# let's have a look at a few sample from the hellaswag dataset
sample = dataset["train"][0]
print("Context (ctx):", sample["ctx"])
print("\nEndings:", sample["endings"])
print("\nLabel (correct ending index):", sample["label"])


In [None]:
# print out the features of the dataset
print("Features:", dataset["train"].features)

# check number of rows for each split
print("Train samples:", len(dataset["train"]))
print("Validation samples:", len(dataset["validation"]))
print("Test samples:", len(dataset["test"]))

## Using Open Source Translation

For this exercise, we'll use the [Opus MT](https://huggingface.co/Helsinki-NLP/opus-mt-en-ru) translation model from the [University of Helsinki NLP department](https://huggingface.co/Helsinki-NLP).

Once we've translated the data, we'll revisit it with human annotators to confirm its accuracy.

In [2]:
# install additional libs
%pip install transformers sentencepiece --quiet

Note: you may need to restart the kernel to use updated packages.


In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch

# load model and tokenizer
model_name = "Helsinki-NLP/opus-mt-en-ru"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# use gpu if possible (this increases the speed of translation)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
