# Introduction

In this project I am going to explore finetuning a pre-trained LLM from huggingface for the purpoes of translating Russian to English. I will also use a different LLM to then summarize it. I will be using the pytorch library.

## Loading the Libraries

In [None]:
!pip install transformers datasets torch

import torch
from transformers import MarianMTModel, MarianTokenizer, Trainer, TrainingArguments, pipeline, BartForConditionalGeneration, BartTokenizer
from datasets import load_dataset
import requests
from bs4 import BeautifulSoup

## Loading the Dataset and Model for Finetuning

I am using the Helsinki-NLP/opus-mt-ru-en model for translation and the opus-100 dataset with English and Russian sentence pairs.

In [None]:
dataset = load_dataset("Helsinki-NLP/opus-100", "en-ru")
model_name = "Helsinki-NLP/opus-mt-ru-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Finetuning the LLM

## Processing the Data

I will begin  by setting the inputs and targets appropriately. Russian being the input and English being the target. Then it is time to properly tokenize the sentence pairs and pad the text for uniformity.


In [None]:
def preprocess_function(examples):
    inputs = [example['ru'] for example in examples['translation']]
    targets = [example['en'] for example in examples['translation']]

    model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
    labels = tokenizer(targets, padding="max_length", truncation=True, max_length=128, return_tensors="pt")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

## Spliting the Data for Testing and Training

In [None]:
train_dataset = tokenized_datasets["train"]
validation_dataset = tokenized_datasets["validation"]

## Setting the Parameters and Training the Model

I set the training arguments to best fit the GPU used and the dataset type. Experiment with adjusting the argumetns for different results.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

training_arguments = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=1000,
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    gradient_accumulation_steps=1,
    logging_dir="./logs",
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    fp16=True,
    fp16_opt_level="O2",
    weight_decay=0.01,
    learning_rate=5e-5,
    warmup_steps=1000,
    lr_scheduler_type="cosine",
    report_to="none",
    run_name="opus_mt_ru_en_translation",
    disable_tqdm=False,
    dataloader_num_workers=8,
    seed=42,
    gradient_checkpointing=True,
    dataloader_pin_memory=True
)

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer
)

trainer.train()

# Translation and Summarization

## Scraping the Text

Here the text is webscraped from the Russian language wikipedia article on machine learning. Then it is pasted and words counted.

In [None]:
def scrape_website_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    paragraphs = soup.find_all('p')
    website_text = " ".join([para.get_text() for para in paragraphs])
    return website_text

url = "https://ru.wikipedia.org/wiki/%D0%9C%D0%B0%D1%88%D0%B8%D0%BD%D0%BD%D0%BE%D0%B5_%D0%BE%D0%B1%D1%83%D1%87%D0%B5%D0%BD%D0%B8%D0%B5"

website_text = scrape_website_text(url)

print("Word Count:")
print(len(website_text.split()))
print("Scraped Text:")
print(website_text)

## Translating

The model is used to translate the scraped text.

In [None]:
def translate_text(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    translated = model.generate(**inputs)
    translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
    return translated_text

translated_text = translate_text(website_text)

print("Word Count:")
print(len(website_text.split()))
print("\nTranslated Text:")
print("\n".join(translated_text.split(". ")))

## Summarizing

Finally, the translated text is then summarized using the bart-large-cnn model

In [None]:
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

inputs = tokenizer(translated_text, return_tensors="pt", max_length=1024, truncation=True)

summary_ids = model.generate(
    inputs["input_ids"],
    num_beams=4,
    max_length=100,
    early_stopping=True,
    no_repeat_ngram_size=2,
    length_penalty=1.0
)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summary Word Count:")
print(len(summary.split()))
print("\nSummarized Text Text:")
print("\n".join(summary.split(". ")))

# Conclusion

Thank you for reading. Any feedback is welcome.