$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
$$

# Part 3: Mini-Project
<a id=part3></a>

In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.

You must **choose one** of the project options specified below.

### Guidelines

- You should implement the code which displays your results in this notebook, and add any additional code files for your implementation in the `project/` directory. You can import these files here, as we do for the homeworks.
- Running this notebook should not perform any training - load your results from some output files and display them here. The notebook must be runnable from start to end without errors.
- You must include a detailed write-up (in the notebook) of what you implemented and how. 
- Explain the structure of your code and how to run it to reproduce your results.
- Explicitly state any external code you used, including built-in pytorch models and code from the course tutorials/homework.
- Analyze your numerical results, explaining **why** you got these results (not just specifying the results).
- Where relevant, place all results in a table or display them using a graph.
- Before submitting, make sure all files which are required to run this notebook are included in the generated submission zip.
- Try to keep the submission file size under 10MB. Do not include model checkpoint files, dataset files, or any other non-essentials files. Instead include your results as images/text files/pickles/etc, and load them for display in this notebook. 

## Sentiment Analysis with Self-Attention and Word Embeddings

Based on Tutorials 6 and 7, we'll implement and train an improved sentiment analysis model.
We'll use self-attention instead of RNNs and incorporate pre-trained word embeddings.

In tutorial 6 we saw that we can train word embeddings together with the model.
Although this produces embeddings which are customized to the specific task at hand,
it also greatly increases training time.
A common technique is to use pre-trained word embeddings.
This is essentially a large mapping from words (e.g. in english) to some
high-dimensional vector, such that semantically similar words have an embedding that is
"close" by some metric (e.g. cosine distance).
Use the [GloVe](https://nlp.stanford.edu/projects/glove/) 6B embeddings for this purpose.
You can load these vectors into the weights of an `nn.Embedding` layer.

In tutorial 7 we learned how attention can be used to learn to predict a relative importance
for each element in a sequence, compared to the other elements.
Here, we'll replace the RNN with self-attention only approach similar to Transformer models, roughly based on [this paper](https://www.aclweb.org/anthology/W18-6219.pdf).
After embedding each word in the sentence using the pre-trained word-embedding a positional-encoding vector is added to provide each word in the sentence a unique value based on it's location.
One or more self-attention layers are then applied to the results, to obtain an importance weighting for each word.
Then we classify the sentence based on the average these weighted encodings.


Now, using these approaches, you need to:

- Implement a **baseline** model: Use pre-trained embeddings with an RNN-based model.
You can use LSTM/GRU or bi-directional versions of these, in a way very similar to what we implemented in the tutorial.
-  Implement an **improved** model: Based on the self-attention approach, implement an attention-based sentiment analysis model that has 1-2 self-attention layers instead of an RNN. You should use the same pre-trained word embeddings for this model.
- You can use pytorch's built-in RNNs, attention layers, etc.
- For positional encoding you can use the sinosoidal approach described in the paper (first proposed [here](https://arxiv.org/pdf/1706.03762.pdf)). You can use existing online implementations (even though it's straightforward to implement). 
- You can use the SST database as shown in the tutorial.

Your results should include:
- Everything written in the **Guidelines** above.
- A comparative analysis: compare the baseline to the improved model. Compare in terms of overall classification accuracy and show a multiclass confusion matrix.
- Visualize of the attention maps for a few movie reviews from each class, and explain the results.

In [1]:
# Setup
%matplotlib inline
import os
import sys
import time
import torch
import matplotlib.pyplot as plt
import warnings
import torch.nn as nn
warnings.simplefilter("ignore")
plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

**Dataset**<br/>
We will use torchtext package and the SST dataset as was used in tutorial 6 (Sequence Models). 

In [2]:
import torchtext.data
from torchtext.vocab import Vectors, GloVe

review_parser = torchtext.data.Field(
    sequential=True, use_vocab=True, lower=True,
    init_token='<sos>', eos_token='<eos>', dtype=torch.long,
    tokenize='spacy', tokenizer_language='en_core_web_sm'
)

# This Field object converts the text labels into numeric values (0,1,2)
label_parser = torchtext.data.Field(
    is_target=True, sequential=False, unk_token=None, use_vocab=True
)
import torchtext.datasets

ds_train, ds_valid, ds_test = torchtext.datasets.SST.splits(
    review_parser, label_parser, root=data_dir,fine_grained=False
)

n_train = len(ds_train)
print(f'Number of training samples: {n_train}')
print(f'Number of test     samples: {len(ds_test)}')


Number of training samples: 8544
Number of test     samples: 2210


We will build a vocabulary:

In [3]:
review_parser.build_vocab(ds_train,vectors=GloVe(name='6B', dim=300))
label_parser.build_vocab(ds_train)
word_embeddings = review_parser.vocab.vectors
word_embeddings = word_embeddings.to(device=device)

**Dataloader**: we will use bucketIterator as dataloader to deal with reviews of different length

In [16]:
BATCH_SIZE = 32 #hyper parameter, could be changed

# BucketIterator creates batches with samples of similar length
# to minimize the number of <pad> tokens in the batch.
dl_train, dl_valid, dl_test = torchtext.data.BucketIterator.splits(
    (ds_train, ds_valid, ds_test), batch_size=BATCH_SIZE,
    shuffle=True, device=device)


#train_iter, valid_iter, test_iter = torchtext.data.BucketIterator.splits((ds_train, ds_valid, ds_test), batch_size=BATCH_SIZE, sort_key=lambda x: len(x.text), repeat=False, shuffle=True)


**Model:** we load the model and create an instance

In [35]:
from project.Analyzer import SentimentAnalyzer
from project.Attention import AttentionAnalyzer

INPUT_DIM = len(review_parser.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 128
OUTPUT_DIM = 3 #5

model = SentimentAnalyzer(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, word_embeddings, layers=2)
attnModel = AttentionAnalyzer(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, word_embeddings, layers=2)
#model = RNN(32, 2, 256, len(review_parser.vocab), 300, word_embeddings)
attnModel

AttentionAnalyzer(
  (embd): Embedding(15482, 300)
  (rnn): GRU(300, 128, num_layers=2, bidirectional=True)
  (W_s1): Linear(in_features=256, out_features=350, bias=True)
  (W_s2): Linear(in_features=350, out_features=30, bias=True)
  (sentiment): Linear(in_features=7680, out_features=3, bias=True)
  (log_softmax): LogSoftmax(dim=1)
)

In [6]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The RNN model has {count_parameters(model):,} trainable weights.')
print(f'The Attention model has {count_parameters(attnModel):,} trainable weights.')

The RNN model has 627,459 trainable weights.
The Attention model has 750,211 trainable weights.


Defining the model, hyperparameters, device and checkpoints file

In [36]:

checkpoint = 'checkpoints/RNN'
checkpoint_attn = 'checkpoints/ATTN'
learning_rate = 1e-4
batch_size = BATCH_SIZE
output_size = OUTPUT_DIM
hidden_size = HIDDEN_DIM
embedding_length = EMBEDDING_DIM
EPOCHS = 20

model = model.to(device)
checkpoint_file = os.path.join(os.getcwd(), checkpoint)

In [None]:
def train_and_eval(model,train_iter, valid_iter, optimizer, loss_fn =nn.NLLLoss() , epochs=20,checkpoint_file_final='final.pt'):
    #TRAIN!!!!
    
    for epoch_idx in range(epochs):
        model.train()
        total_loss, num_correct = 0, 0
        total_samples = 0
        start_time = time.time()

        for train_batch in train_iter:
            X, y = train_batch.text.cuda(), train_batch.label.cuda()

            # Forward pass
            y_pred_log_proba = model(X)

            # Backward pass
            optimizer.zero_grad()
            loss = loss_fn(y_pred_log_proba, y)
            loss.backward()

            # Weight updates
            optimizer.step()

            # Calculate accuracy
            total_loss += loss.item()
            y_pred = torch.argmax(y_pred_log_proba, dim=1)
            num_correct += torch.sum(y_pred == y).float().item()
            total_samples+= len(train_batch) 

                
        print(f"Train: Epoch #{epoch_idx}, loss={total_loss /(len(train_iter)):.3f}, accuracy={num_correct /(total_samples):.3f}, elapsed={time.time()-start_time:.1f} sec")
        
        total_loss, num_correct = 0, 0
        total_samples = 0
        model.eval()
        with torch.no_grad():
            for val_batch in valid_iter:
                X, y = val_batch.text.cuda(), val_batch.label.cuda()
                y = torch.autograd.Variable(y).long()

                y_pred_log_proba = model(X)
                loss = loss_fn(y_pred_log_proba, y)
                total_loss += loss.item()
                y_pred = torch.argmax(y_pred_log_proba, dim=1)
                num_correct += torch.sum(y_pred == y).float().item()
                total_samples+= len(train_batch)

        print(f"Val: Epoch #{epoch_idx}, loss={total_loss /(len(train_iter)):.3f}, accuracy={num_correct /(total_samples):.3f}, elapsed={time.time()-start_time:.1f} sec")
    #return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter)
        
        
        
        
        
        
def clip_gradient(model, clip_value):
    params = list(filter(lambda p: p.grad is not None, model.parameters()))
    for p in params:
        p.grad.data.clamp_(-clip_value, clip_value)

        

        
loss_fn = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
train_and_eval(attnModel.cuda(),train_iter, valid_iter, optimizer, loss_fn =loss_fn , epochs=20,checkpoint_file_final='final.pt')

Train: Epoch #0, loss=1.092, accuracy=0.423, elapsed=3.9 sec
Val: Epoch #0, loss=0.143, accuracy=0.396, elapsed=4.1 sec
Train: Epoch #1, loss=1.092, accuracy=0.423, elapsed=3.9 sec
Val: Epoch #1, loss=0.143, accuracy=0.396, elapsed=4.0 sec
Train: Epoch #2, loss=1.092, accuracy=0.423, elapsed=4.0 sec
Val: Epoch #2, loss=0.143, accuracy=0.396, elapsed=4.1 sec
Train: Epoch #3, loss=1.092, accuracy=0.423, elapsed=3.9 sec
Val: Epoch #3, loss=0.143, accuracy=0.396, elapsed=4.1 sec
Train: Epoch #4, loss=1.092, accuracy=0.423, elapsed=3.9 sec
Val: Epoch #4, loss=0.143, accuracy=0.396, elapsed=4.1 sec
Train: Epoch #5, loss=1.092, accuracy=0.423, elapsed=3.9 sec
Val: Epoch #5, loss=0.143, accuracy=0.396, elapsed=4.1 sec
Train: Epoch #6, loss=1.092, accuracy=0.423, elapsed=3.9 sec
Val: Epoch #6, loss=0.143, accuracy=0.396, elapsed=4.0 sec
Train: Epoch #7, loss=1.092, accuracy=0.423, elapsed=3.9 sec
Val: Epoch #7, loss=0.143, accuracy=0.396, elapsed=4.0 sec
Train: Epoch #8, loss=1.092, accuracy=0.

In [8]:
from project.Trainer import train,evaluate_model

In [33]:
#not to run
#####evaluate_model(model, nn.NLLLoss(), dl_valid, max_epochs=EPOCHS,batch_size=BATCH_SIZE)

*Training the baseline model*

In [32]:
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
loss_fn = nn.NLLLoss()
# Train, unless final checkpoint is found
checkpoint_file_final = f'{checkpoint}_final.pt'
if os.path.isfile(checkpoint_file_final):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} (not training)')
    saved_state = torch.load(checkpoint_file_final, map_location=device)
    model.load_state_dict(saved_state['model_state'])
else:
    try:
        #RNN_train_loss_arr, RNN_train_acc_arr, RNN_val_loss_arr, RNN_val_acc_arr = 
        train(model, optimizer, loss_fn, dl_train, max_epochs=EPOCHS,batch_size=BATCH_SIZE)

    except KeyboardInterrupt as e:
        print('\n *** Training interrupted by user')


Epoch #0, loss=1.055, accuracy=0.412, elapsed=2.8 sec
Epoch #1, loss=1.048, accuracy=0.420, elapsed=2.8 sec
Epoch #2, loss=1.049, accuracy=0.415, elapsed=2.8 sec
Epoch #3, loss=1.045, accuracy=0.422, elapsed=2.8 sec
Epoch #4, loss=1.045, accuracy=0.430, elapsed=2.8 sec
Epoch #5, loss=0.988, accuracy=0.511, elapsed=2.8 sec
Epoch #6, loss=0.865, accuracy=0.628, elapsed=2.8 sec
Epoch #7, loss=0.821, accuracy=0.647, elapsed=2.8 sec
Epoch #8, loss=0.802, accuracy=0.660, elapsed=2.8 sec
Epoch #9, loss=0.789, accuracy=0.666, elapsed=2.8 sec
Epoch #10, loss=0.788, accuracy=0.667, elapsed=2.8 sec
Epoch #11, loss=0.771, accuracy=0.673, elapsed=2.8 sec
Epoch #12, loss=0.754, accuracy=0.678, elapsed=2.8 sec
Epoch #13, loss=0.746, accuracy=0.680, elapsed=2.8 sec
Epoch #14, loss=0.741, accuracy=0.683, elapsed=2.8 sec
Epoch #15, loss=0.737, accuracy=0.686, elapsed=2.8 sec
Epoch #16, loss=0.718, accuracy=0.693, elapsed=2.8 sec
Epoch #17, loss=0.710, accuracy=0.698, elapsed=2.8 sec
Epoch #18, loss=0.70

*Training the model with attention*

In [33]:
optimizer = optim.Adam(attnModel.parameters(), lr=learning_rate)
loss_fn = nn.NLLLoss()
# Train, unless final checkpoint is found
checkpoint_file_final = f'{checkpoint_attn}_final.pt'
if os.path.isfile(checkpoint_file_final):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} (not training)')
    saved_state = torch.load(checkpoint_file_final, map_location=device)
    attnModel.load_state_dict(saved_state['model_state'])
else:
    try:
        #RNN_train_loss_arr, RNN_train_acc_arr, RNN_val_loss_arr, RNN_val_acc_arr = 
        train(attnModel, optimizer, loss_fn, dl_train, max_epochs=EPOCHS,batch_size=BATCH_SIZE)

    except KeyboardInterrupt as e:
        print('\n *** Training interrupted by user')

Epoch #0, loss=1.516, accuracy=0.324, elapsed=3.1 sec
Epoch #1, loss=1.342, accuracy=0.415, elapsed=3.1 sec
Epoch #2, loss=1.275, accuracy=0.443, elapsed=3.0 sec
Epoch #3, loss=1.238, accuracy=0.464, elapsed=3.0 sec
Epoch #4, loss=1.222, accuracy=0.468, elapsed=3.0 sec
Epoch #5, loss=1.192, accuracy=0.480, elapsed=3.0 sec
Epoch #6, loss=1.177, accuracy=0.487, elapsed=3.0 sec
Epoch #7, loss=1.165, accuracy=0.503, elapsed=3.0 sec
Epoch #8, loss=1.158, accuracy=0.501, elapsed=3.1 sec
Epoch #9, loss=1.126, accuracy=0.521, elapsed=3.1 sec
Epoch #10, loss=1.111, accuracy=0.523, elapsed=3.1 sec
Epoch #11, loss=1.082, accuracy=0.541, elapsed=3.0 sec
Epoch #12, loss=1.065, accuracy=0.547, elapsed=3.1 sec
Epoch #13, loss=1.057, accuracy=0.551, elapsed=3.1 sec
Epoch #14, loss=1.040, accuracy=0.558, elapsed=3.1 sec
Epoch #15, loss=1.017, accuracy=0.569, elapsed=3.1 sec
Epoch #16, loss=0.993, accuracy=0.584, elapsed=3.0 sec
Epoch #17, loss=0.970, accuracy=0.598, elapsed=3.0 sec
Epoch #18, loss=0.93

## Spectrally-Normalized Wasserstein GANs

In HW3 we implemented a simple GANs from scratch, using an approach very similar to the original GAN paper. However, the results left much to be desired and we discovered first-hand how hard it is to train GANs due to their inherent instability.

One of the prevailing approaches for improving training stability for GANs is to use a technique called [Spectral Normalization](https://arxiv.org/pdf/1802.05957.pdf) to normalize the largest singular value of a weight matrix so that it equals 1.
This approach is generally applied to the discriminator's weights in order to stabilize training. The resulting model is sometimes referred to as a SN-GAN.
See Appendix A in the linked paper for the exact algorithm. You can also use pytorch's `spectral_norm`.

Another very common improvement to the vanilla GAN is known a [Wasserstein GAN](https://arxiv.org/pdf/1701.07875.pdf) (WGAN). It uses a simple modification to the loss function, with strong theoretical justifications based on the Wasserstein (earth-mover's) distance.
See also [here](https://developers.google.com/machine-learning/gan/loss) for a brief explanation of this loss function.

One problem with generative models for images is that it's difficult to objectively assess the quality of the resulting images.
To also obtain a quantitative score for the images generated by each model,
we'll use the [Inception Score](https://arxiv.org/pdf/1606.03498.pdf).
This uses a pre-trained Inception CNN model on the generated images and computes a score based on the predicted probability for each class.
Although not a perfect proxy for subjective quality, it's commonly used a way to compare generative models.
You can use an implementation of this score that you find online, e.g. [this one](https://github.com/sbarratt/inception-score-pytorch) or implement it yourself.

Based on the linked papers, add Spectral Normalization and the Wassertein loss to your GAN from HW3.
Compare between:
- The baseline model (vanilla GAN)
- SN-GAN (vanilla + Spectral Normalization)
- WGAN (using Wasserstein Loss)
- Optional: SN+WGAN, i.e. a combined model using both modifications.

As a dataset, you can use [LFW](http://vis-www.cs.umass.edu/lfw/) as in HW3 or [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html), or even choose a custom dataset (note that there's a dataloder for CelebA in `torchvision`). 

Your results should include:
- Everything written in the **Guidelines** above.
- A comparative analysis between the baseline and the other models. Compare:
  - Subjective quality (show multiple generated images from each model)
  - Inception score (can use a subset of the data).
- You should show substantially improved subjective visual results with these techniques.

## Implementation

**TODO**: This is where you should write your explanations and implement the code to display the results.
See guidelines about what to include in this section.