# Part 3: Translation with Flan-T5

In this part you will experiment with Google Flan-T5 models for machine translation. The original T5 model was a unifed sequence-to-sequence encoder-decoder architecture pretrained on a variety of tasks including machine translation. The Flan line of models improved on the performance of the original T5 series.

In this part we will only apply the models, not train them. We will evaluate our results using the [Bleu score](https://en.wikipedia.org/wiki/BLEU), a common (but not perfect) quantitative metric for evaluating the quality of translations. Flan-T5 also comes in several different model sizes: We will study the impact of model size on performance by considering several different versions of Flan-T5.

**Learning objectives.** You will:
1. Examine an encoder-decoder sequence-to-sequence Flan-T5 transformer model
2. Apply Flan-T5 models to perform machine translation
3. Evaluate the quality of machine translations by computing Bleu scores with respect to reference translations
4. Study the affect

While it is possible to complete this assignment using CPU compute, it may be slow. To accelerate your training, consider using GPU resources such as `CUDA` through the CS department cluster. Alternatives include Google colab or local GPU resources for those running on machines with GPU support.

First, ensure that you have the `transformers` and `datasets` modules installed. We will use these modules for importing tokenizers, pretrained models, and datasets. You can run the following cells to try to install them with `pip` if needed. If you are using ondemand, ideally you would simply include `module load transformers` and `module load datasets` when making your initial reservation.

In [None]:
#pip install transformers

In [None]:
#pip install datasets

First we import the `flan-t5` tokenizer (shared across all model sizes) and demonstrate its characteristics and usage. Note that the API is the same as we saw previously for the `BERT` model -- may want to review the extra details in that earlier part.

Note that the example contains both English and French text.

(If you have trouble downloading the tokenizer, it is possible that you need to install the `sentencepiece` module, for example by `pip install sentencepiece`).

In [None]:
# run but you do not need to modify this code

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

print("Vocabulary size: ", tokenizer.vocab_size)
tokenized = tokenizer(["The little black cat sleeps in the window", 
                       "Le petit chat noir dort dans la fenêtre"], padding='longest')
print(tokenized)

## Task 1

Below we import and preview the `flan-t5` model, beginning with the small version. You will also note that we are using 16-bit float representations to save on memory (this may be particularly relevant if you are using GPU compute for larger models, where the GPU may have limited memory available).

In [None]:
# run, but you do not need to modify this code
import torch
from transformers import T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small", torch_dtype=torch.float16)
print(model)

Examine the `model.parameters()`. How much memory (in kilobytes (KB), megabytes (MB), or gigabytes (GB)) should it take to store the model itself, given the 16-bit (or 2 byte) precision specified in the import? Briefly explain.

In [None]:
# write code for task 1 here

*Briefly explain for task 1 here*

## Task 2

Now we import and demonstrate the basic usage of the model `generate` method. This method autoregressively generates new text as we have discussed before in the context of causal language modeling, and supports different approaches (greedy, beam search, and sampling). The `generate` method API is [documented here](https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/text_generation#transformers.GenerationMixin.generate).

The example below demonstrates encoding a *batch* of inputs and passing them to the model for autoregressive generation. The printed output is generated by the model for the first and second input in the batch respectively.

In [None]:
# run but you do not need to modify this code

input_text = ["The little black cat sleeps in the window", "The dog runs in the field"]
encoded = tokenizer(input_text, return_tensors="pt", padding="longest")

outputs = model.generate(**encoded, max_new_tokens=100)
print(outputs)

for out in outputs:
    print(tokenizer.decode(out, skip_special_tokens=True))

Note that the model did not translate the inputs. That is by design: Flan-T5 was pretrained on several different tasks including but not limited to machine translation.

In order to use the model for translation, we need to **prompt it** to do so, providing context (literally) connecting to its pretraining. There are several different promptings that should work, we demonstrate two below:

In [None]:
# run but you do not need to modify this code

input_text = ["The little black cat sleeps in the window", "The dog runs in the field"]
prompt = "Translate English into French: "
prompted_text = [prompt + in_text for in_text in input_text]
encoded = tokenizer(prompted_text, return_tensors="pt", padding="longest")

outputs = model.generate(**encoded, max_new_tokens=100)

for out in outputs:
    print(tokenizer.decode(out, skip_special_tokens=True))

For this task, your goal is use the model to translate a large collection of German text into English, drawing from a paired translation of the novel Jane Eyre. Below we download and prepare the data.

In [None]:
# run but you do not need to modify this code

from datasets import load_dataset
from torch.utils.data import Dataset

class OpusDataset(Dataset):
    def __init__(self, dataset_stream, num_examples):
        # Convert streaming dataset to list for random access
        self.examples = list(dataset_stream.take(num_examples))
        
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):
        return self.examples[idx]

# Load the dataset in streaming mode
dataset_stream = load_dataset(
    "opus_books",
    "de-en",
    split="train",
    streaming=True
)

# Create instance of custom dataset
dataset = OpusDataset(dataset_stream, num_examples=500)

# Print a few examples to verify
for i in range(100, 103):
    print(f"\nExample {i+1}:")
    print(f"German: {dataset[i]['translation']['de']}")
    print(f"English: {dataset[i]['translation']['en']}")
print(f"\nTotal examples loaded: {len(dataset)}")

**Use the Flan-T5 model to translate all of the German text in the dataset into English.** Even with GPU compute, this may take several minutes, but should not take hours. We encourage you to add some output every 10 or 50 examples so that you can track the progress, though you are not required to do so.

**Select at least three examples from the dataset and print the model translation as well as the real English text.**

In [None]:
# write code for task 2 here

## Task 3

Translations are difficult to evaluate quantitatively and without expert human translators. One common metric is the [BLEU score](https://en.wikipedia.org/wiki/BLEU).

The below example demonstrates calculating BLEU scores with the `evaluate` module from Hugging Face. You can [see the documentation here](https://huggingface.co/spaces/evaluate-metric/bleu). The first value in the results dictionary gives the score. Normally the score is reported on a scale from 0-100; this implementation reports it on a 0-1 scale. Higher values are better, but scores of 1 are not necessarily expected given the many possible ways to translate.

Note that `predictions` is a list of strings, but `references` is a list of lists of strings. This is because a single predicted translation could potentially have multiple equally good reference translations. In our case however we just have the single translation per pair, so `references` will be a list of lists, each with a single element.

In [None]:
# run but you do not need to modify this code

import evaluate

predictions = ["the black cat is sleeping in the sun by the window", 
               "the dog runs in the field while it rains"]
references = [["the black cat is sleeps on the sun by the window"], 
              ["the dog run in the field while it rain"]]

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(results)

**Calculate the `BLEU` score of your translations from task 2 against the real English text`. Report your results.**

In [None]:
# write code for task 3 here

## Task 4



In this task, study the impact of model scale on the quality of the resulting translations. Earlier we used `model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small", torch_dtype=torch.float16)` to import a `flan-t5-small` model. 

**Use two additional models: `flan-t5-base` and `flan-t5-large` to generate translations of the same dataset. Evaluate and report the BLEU score of both translations.**

In [None]:
# write code for task 4 here