### Compare the ShallowBiLSTM class with the regular unidirectional LSTM class provided (5 pts)

Note: the training for these two models may take a while.

Load in the SNLI dataset using Pytorch’s DataLoader class. Use `batch_size=32` and
`shuffle=True`.

In [1]:
from code import *

batch_size=32
num_layers=1
num_classes=3
hidden_dim=512
num_epochs = 100

In [2]:
(emb_dict, emb_dim), (dataloader_train, dataloader_valid, dataloader_test) = load_snli_dataset(batch_size)

word_counts = build_word_counts(dataloader_train)
index_map = build_index_map(word_counts)

## Initialise the model with Glove embeddings

Use `create_embedding_matrix` to create the embedding matrix initialized with Glove embeddings that you will use to initialize your `embedding_layer`. Use `from_pretrained` for this purpose (consult PyTorch `nn.Embedding` documentation if you need to).

Overwrite the embedding layer of your model (outside the class, in the training code) after initializing it. Pass `freeze=False` and `padding_idx=0`.

We do not want the Glove embeddings to be frozen, as having them trainable can only boost our performance. Due to our `embedding_size == hidden_dim` simplification, you need to modify your classes in the following way:

- Decouple the `hidden_dim` from the `embedding_size` in your model, by passing another parameter (e.g. `embedding_size`) to your `model.__init__` method and setting it separately, to initialize your Embedding layer within `__init__`. This will require you to tweak the input features to the `LSTM`(s) as well.
- Make sure any tensors you make are moved to the GPU with `.to(device)` where `device` is `torch.device("cuda" if torch.cuda.is_available() else "cpu")`
- Use `evaluate` method to evaluate dataset performance on validation and test set.

To make a fair comparison between the two models, halve the `hidden_dim` parameter on the `BiLSTM` for your comparison 
(e.g. 100-dim for `UniLSTM`, 50-dim for `ShallowBiLSTM`, see note about Glove embeddings above).

Modify your `UniLSTM` (or alternatively your `ShallowBiLSTM`) by passing `bidirectional=True` and `num_layers=2` with a single `LSTM` initialization (modifying the incoming feature size to the Linear layers accordingly) to create a true `BiLSTM` to compare against. Compare this model against the equivalent `UniLSTM` and `ShallowBiLSTM`.
- You will need to modify the concatenation as well, to concatenate 4 different cell states instead of just 2. The bidirectional `LSTM` with `num_layers=2` will output a tensor of `[4, BATCH_SIZE, hidden_size]` for the final cell state.
- In this case, `final_cell_state[-1,:,:]` and `final_cell_state[-2,:,:]` are the now 2 outputs (forward and backward LSTMs) that we care about, similar to `ShallowBiLSTM`. The difference is that now we have a deeper network, with interconnections between layers – PyTorch takes care of this for us (true `BiLSTM` vs. shallow `BiLSTM`)

Start with the `num_layers` as 1. Feel free to play around with increasing it, but make sure that you’re making a fair comparison always (approx same number of params overall). 

Increase it to beyond 1, making sure you’re taking the cell state from the final layer.

In [None]:
progress = {}

progress["unilstm"] = run_snli(
    UniLSTM,
    num_epochs=num_epochs,
    vocab_size=len(index_map),
    hidden_dim=hidden_dim,
    embedding_size=emb_dim,
    num_layers=num_layers,
    num_classes=num_classes,
    use_glove=False
)

Found cached dataset snli (/home/user/.cache/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 840.88it/s]
Loading cached processed dataset at /home/user/.cache/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-d1fc920dc7ec80e4.arrow
Loading cached processed dataset at /home/user/.cache/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-828f37787a2fb016.arrow
Loading cached processed dataset at /home/user/.cache/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-7a32c32b24afac81.arrow
Loading cached processed dataset at /home/user/.cache/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-31606576c379c50e.arrow
  0%|                                                   | 0/100 [00:00<?, ?it/s]

In [None]:
progress["unilstm_glove"] = run_snli(
    UniLSTM,
    num_epochs=num_epochs,
    vocab_size=len(index_map),
    hidden_dim=hidden_dim,
    embedding_size=emb_dim,
    num_layers=num_layers,
    num_classes=num_classes,
    use_glove=True
)

In [None]:
progress["shallow_bilstm"] = run_snli(
    ShallowBiLSTM,
    num_epochs=num_epochs,
    vocab_size=len(index_map),
    hidden_dim=hidden_dim,
    embedding_size=emb_dim,
    num_layers=num_layers,
    num_classes=num_classes,
    use_glove=False
)

In [None]:
progress["shallow_bilstm_glove"] = run_snli(
    ShallowBiLSTM,
    num_epochs=num_epochs,
    vocab_size=len(index_map),
    hidden_dim=hidden_dim,
    embedding_size=emb_dim,
    num_layers=num_layers,
    num_classes=num_classes,
    use_glove=True
)

In [None]:
progress["bilstm"] = run_snli(
    BiLSTM,
    num_epochs=num_epochs,
    vocab_size=len(index_map),
    hidden_dim=hidden_dim,
    embedding_size=emb_dim,
    num_layers=num_layers,
    num_classes=num_classes,
    use_glove=False
)

In [None]:
progress["bilstm_glove"] = run_snli(
    BiLSTM,
    num_epochs=num_epochs,
    vocab_size=len(index_map),
    hidden_dim=hidden_dim,
    embedding_size=emb_dim,
    num_layers=num_layers,
    num_classes=num_classes,
    use_glove=True
)

In [None]:
import json
with open('data/snli_progress.json', 'w') as f:
    json.dump(progress, f)

Compare 2 versions of each model: 1 with Glove embeddings and 1 without.
- What do you observe in these two cases?
- What is the difference in performance between the two models in the two cases for each? 
- Does this correspond to what you expected? Provide a potential explanation for what you observe.

- Report final accuracy on the validation and test set for each model (of the same epoch), with and without Glove embeddings.
- Comment on the efficacy of the Glove embeddings.
- Plot accuracy over time on the validation and test set for each model, with and without Glove embeddings.
- Train the model until validation accuracy stops going up.
- Do you observe any other differences between the two models? Training time?
- You should have 5 different configurations being compared in the end. UniLSTM with Glove, UniLSTM without Glove, ShallowBiLSTM with Glove, ShallowBiLSTM without Glove, True BiLSTM (modified UniLSTM).