# Teacher's Assignment - Extra Credit #2

***Author:*** *Ofir Paz* $\qquad$ ***Version:*** *22.07.2024* $\qquad$ ***Course:*** *22961 - Deep Learning* \
***Extra Assignment Course:*** *20999 - Extra Assignment 4*

Welcome to the second question of the extra assignment #2 as part of the course *Deep Learning*. \
In this question we will train an auto encoder to denoise a language dataset and afterwards use transfer learning on the trained model for a classification task.

## Imports

In [None]:
import torch  # pytorch.
import torch.nn as nn  # neural network module.
import torch.nn.functional as F  # neural network functional module.
from torch.utils.data import DataLoader, Dataset  # data handling.
import torchtext; torchtext.disable_torchtext_deprecation_warning()
from torchtext.vocab import build_vocab_from_iterator  # vocabulary builder.
import matplotlib.pyplot as plt  # plotting module.
import datasets as ds  # public dataset module.
from base_model import BaseModel  # base model class.

# Type hinting.
from torch import Tensor
from torchtext.vocab import Vocab
from typing import Tuple

## Adding Noise

To add noise to a language dataset, I thought of Four options:
1. Make duplicates of random words that appear in the sentence. For example: 
$$\text{"The princess is beautiful"} \rightarrow \text{"The princess is is beautiful"}$$
2. Add a random word somewhere in the sentence. For example:
$$\text{"The princess is beautiful"} \rightarrow \text{"The house princess is beautiful"}$$
3. Changing the order of words in the sentence. For example:
$$\text{"The princess is beautiful"} \rightarrow \text{"The beautiful is princess"}$$
4. Changing a random token in the sentence to the unknown token. For example:
$$\text{"The princess is beautiful"} \rightarrow \text{"The <unk> is beautiful"}$$

The simplest option and the one that seems like it would work best in language processing is option 4, so I will implement that only.

In [None]:
def add_noise(sentence_tokens: list[int], vocab: Vocab, noise_term: float = 0.1) -> list[int]:
    """
    Add noise to the sentence tokens.

    Args:
        sentence_tokens (list[int]): Sentence tokens.
        vocab (Vocab): Vocabulary.
        noise_term (float): Noise term.

    Returns:
        list[int]: Noisy sentence tokens.
    """
    return [token if torch.rand(1) > noise_term else vocab["<unk>"] for token in sentence_tokens]

## Loading & Pre-Processing

In [None]:
# class AutoDenoiser(nn.Module):
#     def __init__(self) -> None:
#         super().__init__()
#         self.encoder = nn.TransformerEncoderLayer()

In [None]:
# Load the SST-2 dataset.
dataset: ds.DatasetDict = ds.load_dataset("glue", "sst2")  # type: ignore

train_dataset = dataset["train"]
validation_dataset = dataset["validation"]
test_dataset = dataset["test"]

In [None]:
# Create the SST-2 dataset.
class SST2Dataset(Dataset):
    def __init__(self, dataset: ds.Dataset, vocab: Vocab) -> None:
        self.sentences = list(map(lambda seq: torch.tensor(vocab(seq.split())), dataset["sentence"]))
        self.labels = torch.tensor(dataset["label"], dtype=torch.long)

    def __len__(self) -> int:
        return len(self.sentences)

    def __getitem__(self, idx) -> Tuple[Tensor, Tensor]:
        return self.sentences[idx], self.labels[idx]