# Homework 3

**How to submit.** For this homework, submit this `.ipynb` file with your answers.

## BERT

### Task 1: BERT + text classification (3 points)

In [practice session 4](https://colab.research.google.com/drive/17wS7QHUHAqkn7Hay370Yo1TbGvDEIdyw?usp=sharing), we trained a text classification model on the IMDb dataset using BERT's sentence representations. We used a randomly selected part of the dataset, froze all layers of the pre-trained `bert-base-uncased` model, and only trained the classifier itself, which takes the final-layer representation of the [CLS] token as input. In this task, you will try to apply a different strategy to the same task.

Use the whole IMDb dataset. Use the `train` split (25,000 examples) as the training set, and the `test` split (25,000 examples) as the development set.

Train a binary classifier based on BERT embeddings. You may use `bert-base-uncased` or another model of the same class. You may employ any strategy except the one used in practice session: for example, you can fine-tune the whole model and have a classifier on top of the [CLS] token, or extract embeddings from the model first and then train an independent classifier using those (see [this post](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)). You can use the output of the last layer, any intermediate layer, or a combination of several layers. Instead of using the embeddings of the [CLS] token, you can use an average of embeddings of all words of the text. You can try changing the learning rate and number of training epochs.

The evaluation accuracy of your model should be over 83%. You get **1 bonus point** if it's over 93%.

**Subtask 1 (1 point).** Describe the details of your approach.

* What model are you using (e.g. `bert-base-uncased`, `distilbert-base-cased`, etc.)?
* Are you fine-tuning the parameters of the model or just using outputs of the pre-trained model?
* What does your classifier look like (e.g. `AutoModelForSequenceClassification`, a feed-forward neural network, logistic regression, etc.)?
* What is the classifier's input (the final-layer representation of the [CLS] token, the average of final-layer word representations, etc.)?
* What hyperparameters does your model have (learning rate, number of epochs)?

**Your answer:** ...

**Subtask 2 (2 points).** Train your model using `transformers` and `datasets` libraries (refer to practice session 4 materials for a detailed example).

In [None]:
### YOUR CODE ###

## Evaluating MT

In the following tasks, you will explore MT quality metrics and work on evaluating the Estonian$\rightarrow$English model that you trained in homework 2. 

### Task 2: MT evaluation metrics (1.5 points)

There are quite a few metrics for evaluating machine translation. BLEU is the most popular automatic metric, but it has its disadvantages, which other metrics try to overcome. In this task we will compare these different metrics.

**Subtask 1 (1.5 points).** Explain the main idea behind five MT quality metrics (2-3 sentences about each metric is enough).

**BLEU:** ...

**chrF:** ...

**METEOR:** ...

**TER:** ...

**BERTScore:** ...

### Task 3: Tricking BLEU (1 point)

The most popular automatic metric for evaluating machine translation quality is BLEU (bilingual evaluation understudy). It measures how close a translation is to a reference ("ground truth") translation produced by a human. We put "ground truth" in quotes, because, unlike with, say, classification, in machine translation there is no single correct answer. Several different translations can all be perfectly correct, while having very different wording.

BLEU claims high correlation with human judgements of how good a translation is. However, if we are looking at a BLEU score for one sentence, and not an average score over many sentences, that number can be misleading.

**Subtask 1 (0.5 points)**. Try to come up with examples of translations that can fool BLEU. (If you are unsure how BLEU works, check out practice session 5 materials or google around.) Bring an example of a sentence in some language you know, a good translation of this sentence into English, and a bad translation into English, which would have a decent BLEU score with the good translation as reference. (Please also explain what is happening in your non-English sentence, what it means and why the bad translation is bad.) 

**Your answer:** ...

**Subtask 2 (0.5 points).** Now do the same, but the other way around: come up with a sentence in your language, a good reference translation of that sentence into English, and another translation which is also good, but would have a low BLEU score when compared to the first translation. Explain.

**Your answer:** ...

### Task 4: Calculate your model's BLEU (1.5 points)

**Subtask 1 (0.75 points).** In the previous homework, you separated a test set of 2,000 lines. Preprocess this test set, translate it with the last checkpoint of your model, and postprocess the translation. Compare your translation to the reference translation (the English side of your test set) by calculating the BLEU score with `sacreBLEU`. (Do `pip install sacrebleu` if you don't have the package in your virtual environment.)

Use `sacreBLEU` in the following way:

`cat hypothesis.en | sacrebleu reference_translation.en`

Report the output.

**Your answer:** ...

**Subtask 2 (0.75 points).** Now let's see how well your model does on data that does not come from the corpora on which the model was trained. Copy the test set from `/gpfs/hpc/projects/nlpgroup/mt2021/data/test-set/`. There are two files. `test-src.et` contains the Estonian side of the test set. Preprocess this set, translate it with the last checkpoint of your model, postprocess the result. Use `sacreBLEU` to calculate BLEU with reference to `test-ref.en`. Report the output of `sacreBLEU`.

**Your answer:** ...

### Task 5: Manual analysis (3 points)

Even though automatic metrics are widely used to evaluate machine translation quality, they cannot show what kinds of errors the models make. A number provided by an automatic metric is not enough to make informed decisions about how to improve your model. It is always important to have an idea of what exactly your model is doing right and wrong.

That is why, in this task, you will manually evaluate your model's performance on the external test set (`/gpfs/hpc/projects/nlpgroup/mt2021/data/test-set/`) that you translated in task 4.

**Subtask 1 (2 points).** Analyse 30 sentences from the translated test set. For each of the 30, report:

1. Sentence ID (line number)
2. Source sentence (in Estonian)
3. Reference translation
4. Machine translation (by your model)
5. Description of errors in the translation. You may use any system that seems reasonable to you. For instance, you could classify errors as "word order errors", "untranslated words in source", etc. A description of a sentence can be something like "it tried to represent meaning, but made grammatical errors" or "hypothesis is fluent, but does not represent meaning correctly".

**Hint.** A convenient tool for comparing translations to references: [https://www.letsmt.eu/Bleu.aspx](https://www.letsmt.eu/Bleu.aspx)

**Your answer:** ...

**Subtask 2 (1 point).** Can you see any patterns and typical errors? Summarize your analysis.

**Your answer:** ...