# Project 3: Emotion detection with Neural Networks
## CS4740/5740 Fall 2020

Names: Ikra Monjur, Komukill Loganathan

Netids: im324, kl866

### Project Submission Due: November 13th
Please submit **pdf file** of this notebook on **Gradescope**, and **ipynb** on **CMS**. For instructions on generating pdf and ipynb files, please refer to project 1 instructions.



## Introduction
In this project we will consider **neural networks**: first a Feedforward Neural Network (FFNN) and second a Recurrent Neural Network (RNN), for performing a 5-class emotion detection task.

The project is divided into parts. In **Part 1**, you will be given an implementation for a FFNN and be asked to debug it in a specific way. In **Part 2**, you will then implement an RNN model for performing the same task. In **Part 3**, you will analyze these two models in two types of comparative studies and in **Part 4** you will answer questions describing what you have learned through this project. You also will be required to submit a description of libraries used, how your group divided up the work, and your feedback regarding the assignment (**Part 5**).

## Advice 🚀
As always, the report is important! The report is where you get to show
that you understand not only what you are doing but also why and how you are doing it. So be clear, organized and concise; avoid vagueness and excess verbiage. Spend time doing error analysis for the models. This is how you understand the advantages and drawbacks of the systems you build. The reports should read more like the papers that we have been writing critiques for.

All throughout the report you may be asked to place images, plots, etc. Feel free to write code that will generate the plots for you and use those or generate them some other way and insert into the colab. To add images in your colab, these are a few possible ways to do it!

1. Copy and paste the image in markdown! Yes this really does work

2. Upload to google drive, get a shareable link. It will be something like:

```
https://drive.google.com/file/d/1xDrydbSbijvK2JBftUz-5ovagN2B_RWH/view?usp=sharing
```
We want just the id which is `1xDrydbSbijvK2JBftUz-5ovagN2B_RWH` and the link we will use is:

```
https://drive.google.com/uc?export=view&id=your_id
```

Then in markdown you'd write the following:

```markdown
![image](https://drive.google.com/uc?export=view&id=1xDrydbSbijvK2JBftUz-5ovagN2B_RWH)
```

3. Using IPython!
```python
from IPython.display import Image
Image(filename="drive/GPU/data/iris.PNG")
```

4. Using your connected GDrive
```markdown
![iris](drive/GPU/data/iris.PNG)
```

## Dataset
You are given access to a set of tweets. These tweets have an associated
emotion $y \in Y := \{anger, fear, joy, love, sadness\}$. For this project, given the review text, you will
need to predict the associated rating, y. This is sometimes called fine-grained sentiment analysis in the literature; we will simply refer to it as sentiment analysis in this project.

We will minimally preprocess the reviews and handle tokenization in what we re-
lease. For this assignment, we do not anticipate any further preprocessing to be done by you. Should you choose to do so, it would be interesting to hear about in the report (along with whether or not it helped performance), but it is not a required aspect of the assignment.


In [1]:
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=True)

train_path = os.path.join(os.getcwd(), "drive", "My Drive", "CS4740", "Project3", "p3-cs4740-2020fa","p3_train.txt") # replace based on your Google drive organization
val_path = os.path.join(os.getcwd(), "drive", "My Drive", "CS4740", "Project3", "p3-cs4740-2020fa","p3_val.txt") # replace based on your Google drive organization
test_path = os.path.join(os.getcwd(), "drive", "My Drive", "CS4740", "Project3", "p3-cs4740-2020fa","p3_test_no_labels.txt") # replace based on your Google drive organization

Mounted at /content/drive


# Part 1: Feedforward Neural Network

In this section, there are two main components relevant to **Part 1**.

1. `Data loader`\
As the name suggests, this section loads the data from the dataset files and handles other preprocessing and setup. You will **not** need to change this file and should **not** change this file throughout the assignment.

2. `ffnn`\
This contains the model and code that uses the model for **Part 1**

In the `ffnn` section, you will find a Feedforward Neural Net serving as the underlying model for performing emotion detection.



## Part 1: Tips

We do not assume you have **any** experience working with neural networks and/or debugging them. You may discover this process, while similar, is quite different from debuging in general software engineering and from debugging in other domains such as algorithms and systems.

We suggest you systematically step through the code and simultanously (perhaps by physically drawing it out) describe what the computations _mean_. What you are looking for is where the code differs from what is expected.

## Part 1: Rules

For **Part 1**, you will not be able to ask any questions on Piazza and we will be unable to provide any meaningful advice in office hours. Unfortunately, this is the nature of debugging, it is unlikely anyone can give you specific advice for most problems you encounter and we have already provided general tips in the preceding section, If you absolutely must ask a question or you believe there is some kind of issue with the assignment for this part, please submit a private Piazza post and we will respond swiftly.

As a reminder **communication about the assignment _between_ distinct groups is not permissed and is a violation of the Academic Integrity policy** For this assignment, we will be _extremely_ stringent about this, given that debugging is entirely pointless if someone else in a different group tells you where the error is.

## Import libraries and connect to Google Drive

In [2]:
import json
import math
import os
from pathlib import Path
import random
import time
from tqdm.notebook import tqdm, trange
from typing import Dict, List, Set, Tuple

import numpy as np
import torch
import torch.nn as nn
from torch.nn import init
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
from tqdm.notebook import tqdm, trange

## Data loader

In [3]:
emotion_to_idx = {
    "anger": 0,
    "fear": 1,
    "joy": 2,
    "love": 3,
    "sadness": 4
}
idx_to_emotion = {v: k for k, v in emotion_to_idx.items()}
UNK = "<UNK>"

In [4]:
def fetch_data(train_data_path, val_data_path, test_data_path):
    """fetch_data retrieves the data from a json/csv and outputs the validation
    and training data

    :param train_data_path:
    :type train_data_path: str
    :return: Training, validation pair where the training is a list of document, label pairs
    :rtype: Tuple[
        List[Tuple[List[str], int]],
        List[Tuple[List[str], int]],
        List[List[str]]
    ]
    """
    with open(train_data_path) as training_f:
        training = training_f.read().split("\n")[1:-1]
    with open(val_data_path) as valid_f:
        validation = valid_f.read().split("\n")[1:-1]
    with open(test_data_path) as testing_f:
        testing = testing_f.read().split("\n")[1:-1]
	
    # If needed you can shrink the training and validation data to speed up somethings but this isn't always safe to do by setting k < 10000
    # k = #fill in
    # training = random.shuffle(training)
    # validation = random.shuffle(validation)
    # training, validation = training[:k], validation[:(k // 10)]

    tra = []
    val = []
    test = []
    for elt in training:
        if elt == '':
            continue
        txt, emotion = elt.split(",")
        tra.append((txt.split(" "), emotion_to_idx[emotion]))
    for elt in validation:
        if elt == '':
            continue
        txt, emotion = elt.split(",")
        val.append((txt.split(" "), emotion_to_idx[emotion]))
    for elt in testing:
        if elt == '':
            continue
        txt = elt
        test.append(txt.split(" "))

    return tra, val, test

In [5]:
def make_vocab(data):
    """make_vocab creates a set of vocab words that the model knows

    :param data: The list of documents that is used to make the vocabulary
    :type data: List[str]
    :returns: A set of strings corresponding to the vocabulary
    :rtype: Set[str]
    """
    vocab = set()
    for document, _ in data:
        for word in document:
            vocab.add(word)
    return vocab 


def make_indices(vocab):
	"""make_indices creates a 1-1 mapping of word and indices for a vocab.

	:param vocab: The strings corresponding to the vocabulary in train data.
	:type vocab: Set[str]
	:returns: A tuple containing the vocab, word2index, and index2word.
		vocab is a set of strings in the vocabulary including <UNK>.
		word2index is a dictionary mapping tokens to its index (0, ..., V-1)
		index2word is a dictionary inverting the mapping of word2index
	:rtype: Tuple[
		Set[str],
		Dict[str, int],
		Dict[int, str],
	]
	"""
	vocab_list = sorted(vocab)
	vocab_list.append(UNK)
	word2index = {}
	index2word = {}
	for index, word in enumerate(vocab_list):
		word2index[word] = index 
		index2word[index] = word 
	vocab.add(UNK)
	return vocab, word2index, index2word 


def convert_to_vector_representation(data, word2index, test=False):
	"""convert_to_vector_representation converts the list of strings into a vector

	:param data: The dataset to be converted into a vectorized format
	:type data: Union[
		List[Tuple[List[str], int]],
		List[str],
	]
	:param word2index: A mapping of word to index
	:type word2index: Dict[str, int]
	:returns: A list of vector representations of the input or pairs of vector
		representations with expected output
	:rtype: List[Tuple[torch.Tensor, int]] or List[torch.Tensor]

	List[Tuple[List[torch.Tensor], int]] or List[List[torch.Tensor]]
	"""
	if test:
		vectorized_data = []
		for document in data:
			vector = torch.zeros(len(word2index)) 
			for word in document:
				index = word2index.get(word, word2index[UNK])
				vector[index] += 1
			vectorized_data.append(vector)
	else:
		vectorized_data = []
		for document, y in data:
			vector = torch.zeros(len(word2index)) 
			for word in document:
				index = word2index.get(word, word2index[UNK])
				vector[index] += 1
			vectorized_data.append((vector, y))
	return vectorized_data

In [6]:
class EmotionDataset(Dataset):
    """EmotionDataset is a torch dataset to interact with the emotion data.

    :param data: The vectorized dataset with input and expected output values
    :type data: List[Tuple[List[torch.Tensor], int]]
    """
    def __init__(self, data):
        self.X = torch.cat([X.unsqueeze(0) for X, _ in data])
        self.y = torch.LongTensor([y for _, y in data])
        self.len = len(data)
    
    def __len__(self):
        """__len__ returns the number of samples in the dataset.

        :returns: number of samples in dataset
        :rtype: int
        """
        return self.len
    
    def __getitem__(self, index):
        """__getitem__ returns the tensor, output pair for a given index

        :param index: index within dataset to return
        :type index: int
        :returns: A tuple (x, y) where x is model input and y is our label
        :rtype: Tuple[torch.Tensor, int]
        """
        return self.X[index], self.y[index]

def get_data_loaders(train, val, batch_size=16):
    """
    """
    # First we create the dataset given our train and validation lists
    dataset = EmotionDataset(train + val)

    # Then, we create a list of indices for all samples in the dataset
    train_indices = [i for i in range(len(train))]
    val_indices = [i for i in range(len(train), len(train) + len(val))]

    for i in train_indices:
      if (torch.any(torch.isnan(train[i][0]))):
        print("NAN TRUE")

    # Now we define samplers and loaders for train and val
    train_sampler = SubsetRandomSampler(train_indices)
    train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
    
    val_sampler = SubsetRandomSampler(val_indices)
    val_loader = DataLoader(dataset, batch_size=batch_size, sampler=val_sampler)

    return train_loader, val_loader

In [7]:
train, val, test = fetch_data(train_path, val_path, test_path)

In [8]:
vocab = make_vocab(train)
vocab, word2index, index2word = make_indices(vocab)
train_vectorized = convert_to_vector_representation(train, word2index)
val_vectorized = convert_to_vector_representation(val, word2index)
test_vectorized = convert_to_vector_representation(test, word2index, True)

In [9]:
train_loader, val_loader = get_data_loaders(train_vectorized, val_vectorized, batch_size=1)

In [None]:
# Note: Colab has 12 hour limits on GPUs, also potential inactivity may kill the notebook. Save often!

## 1.1 FFNN Implementation

### 1.1 Task
Assume that an onmiscient oracle has told you there are **4 fundamental errors** in the **FFNN** implementation. They may be anywhere in this section unless otherwise indicated. Your objective is to _find_ and _fix_ each of these errors and to include in the report a description of the original error along with the fix. To help your efforts, the oracle has provided you with additional information about the properties of the errors as follows:

* _Correctness_ \
Each error causes the code to be strictly incorrect. There is absolutely no ambiguity that the errant code (or missing code) is incorrect. This means errors are not due to the code being inefficient (in run-time or in memory).

* _Localized_ \
Each error can be judged to be erroneous by strictly looking at the code (along with your knowledge of machine learning as taught through this course). The errors therefore are not due to the model being uncompetitive in terms of performance with state-of-the-art performance for this task nor are they due to the amount of data being insufficient for this task in general.

* _General_ \
Each error is general in nature. They will not be triggered by the model receiving a pathological input, i.e. they will not be something that is triggered specifically when NLP is referenced with negative sentiment.

* _Fundamental_ \
Each error is a fundamental failure in terms of doing what is intended. This means that errors do not hinge on nuanced understanding of specific PyTorch functionality. This also means they will not exploit properties of the dataset in
a subtle way that could only be realized by someone who has comprehensively studied the data.

The bottom line: the errors should be fairly obvious. The oracle further reminds you that performance/accuracy of the (resulting) model should not be how you ensure you have debugged successfully. For example, if you correct some, but not all, of the errors, the remaining errors may mask the impact of your fixes. Further, performance is not guaranteed to improve by fixing any particular error. Consider the case where the training set is also employed as the test set; performance will be very high but there is something very wrong. And fixing the problem will reduce performance.
In fixing each error, the oracle provides some further insight about the fixes:

* _Minimal_ \
A reasonable fix for each error can be achieved in < 5 lines of code being changes. We do not require you to make fixes of 4 of fewer lines, but it should be a cause for concern if your fixes are far more elaborate

* _Ill-posed_ \
While the errors are unambiguous, the method for fixing them is under-specified: You are free to implement any reasonable fix and all such fixes will equally recieve full credit.

In [10]:
# Lambda to switch to GPU if available
get_device = lambda : "cuda:0" if torch.cuda.is_available() else "cpu"
get_device()

'cuda:0'

In [11]:
unk = '<UNK>'

# Consult the PyTorch documentation for information on the functions used below:
# https://pytorch.org/docs/stable/torch.html

class FFNN(nn.Module):
	def __init__(self, input_dim, h, output_dim):
		super(FFNN, self).__init__()
		self.h = h
		self.W1 = nn.Linear(input_dim, h)
		self.activation = nn.ReLU() # The rectified linear unit; one valid choice of activation function
		## ERROR 1 occurs in the line below. The fix was to change the second h to output_dim
		self.W2 = nn.Linear(h, output_dim) 
    # The below two lines are not a source for an error
		self.softmax = nn.LogSoftmax(dim=1) # The softmax function that converts vectors into probability distributions; computes log probabilities for computational benefits
		self.loss = nn.NLLLoss() # The cross-entropy/negative log likelihood loss taught in class

	def compute_Loss(self, predicted_vector, gold_label):
		return self.loss(predicted_vector, gold_label)

	def forward(self, input_vector):
		# The z_i are just there to record intermediary computations for your clarity
		##ERROR 2 occur in the line below. The fix was to apply the activation function to the W1 layer.
		z1 = self.activation(self.W1(input_vector)) 
		z2 = self.W2(z1)
		predicted_vector = self.softmax(z2)
		return predicted_vector
	
	def load_model(self, save_path):
		self.load_state_dict(torch.load(save_path))
	
	def save_model(self, save_path):
		torch.save(self.state_dict(), save_path)


def train_epoch(model, train_loader, optimizer):
	model.train()
	total = 0
	loss = 0
	acc_loss = 0
	correct = 0
	for (input_batch, expected_out) in tqdm(train_loader, leave=False, desc="Training Batches"):
		output = model(input_batch.to(get_device()))
		total += output.size()[0]
		_, predicted = torch.max(output, 1)
		correct += (expected_out == predicted.to("cpu")).cpu().numpy().sum()
	
		loss = model.compute_Loss(output, expected_out.to(get_device()))
		acc_loss += loss
		##ERROR 3 was the accumulated gradients. The fix was to add the line below to zero out the gradients.
		optimizer.zero_grad() 
		loss.backward()
		optimizer.step()
	# Print accuracy
	acc_loss /= len(train_loader)
	print("Training loss", acc_loss)
	print("Training accuracy", correct/total)

	return


def evaluation(model, val_loader, optimizer):
	model.eval()
	loss = 0
	correct = 0
	total = 0
	for (input_batch, expected_out) in tqdm(val_loader, leave=False, desc="Validation Batches"):
		output = model(input_batch.to(get_device()))
		total += output.size()[0]
		_, predicted = torch.max(output, 1)
		correct += (expected_out.to("cpu") == predicted.to("cpu")).cpu().numpy().sum()

		loss += model.compute_Loss(output, expected_out.to(get_device()))
	loss /= len(val_loader)
	# Print validation metrics
	print("Validation loss", loss)
	print("Validation accuracy", correct/total)
	pass

def train_and_evaluate(number_of_epochs, model, train_loader, val_loader, lr=0.001):
	optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
	for epoch in trange(number_of_epochs, desc="Epochs"):
		## ERROR 4 occurs in the line below. The fix was to train using the training data instead of the validation.
		train_epoch(model, train_loader, optimizer)
		evaluation(model, val_loader, optimizer)
	print("")
	return

In [None]:
h = 512
model = FFNN(len(vocab), h, len(emotion_to_idx)).to(get_device())
train_and_evaluate(2, model, train_loader, val_loader)
model.save_model("ffnn_fixed.pth") # Save our model!

In [None]:
# Example of how to load
loaded_model = FFNN(len(vocab), h, len(emotion_to_idx))
loaded_model.load_model("ffnn_fixed.pth")

## 1.2 Part 1 Report
Please include a description of the error, a description of your fix, and a python comment indicating the fix for each of the 4 errors.

### Error 1: 
Error code: `self.W2 = nn.Linear(h, h)`.

The output layer (W2 layer) in the FFNN had the wrong output dimension. In the code, it was h which is the dimension of the hidden layer while it should be the output_dim given as the argument. The output dimension of the hidden layer and the output layer are not the same. 

We fixed this error by changing the second argument of the nn.Linear function to be the output_dim as shown below.

Fixed code: `self.W2 = nn.Linear(h, output_dim)`

### Error 2: 
Error code: `z1 = self.W1(input_vector)`

All the hidden layers in a FFNN should have an activation function applied to them to introduce non-linearity. In the code, the hidden layer (W1 layer) did not have the activation function applied to it which was an error.

To fix this error, we simply applied the activation function (which is the ReLu function in our case) to the hidden layer.

Fixed code: `z1 = self.activation(self.W1(input_vector))`

### Error 3: 

In the train_epoch function, the error was that the gradient of the optimizer (SGD in our case) was not zeroed out before each iteration. This is needed to ensure the gradient is pointing in the right direction and is not influenced by the previous accumulated gradients.

In order to fix this error, we used the pytorch function zero_grad() to zero out the gradients at the beginning of each iteration.

Fixed code: `optimizer.zero_grad()`

### Error 4: 
Error code: `train_epoch(model, val_loader, optimizer)`

The error was in train_and_evaluate function where the training was being done using the validation set. This is wrong because we should not have the same training set as the validation set since we will always get a high validation accuracy. We will not know if the model overfits if the train and validation sets are the same.

To fix this, we used the training dataset to train the model.

Fixed code: `train_epoch(model, train_loader, optimizer)`

# Part 2: Recurrent Neural Network
Recurrent neural networks have been the workhorse of NLP for a number of years. A fundamental reason for this success is they can inherently deal with _variable_ length sequences. This is axiomatically important for natural language; words are formed from a variable number of characters, sentences from a variable number of words, paragraphs from a variable number of sentences, and so forth. This differs from a field like Computer Vision where images are (generally) of a fixed size.
<br></br>
This is also very different scenario than that of the classifiers we have studied (e.g.Naive Bayes, Perceptron Learning, Feedforward Neural Networks), which take in a
fixed-length vector.
<br></br>
To clarify this, we can think of the _types_ of the mathematical functions described by a FFNN and an RNN. What is pivotal in what follows is that k need not be constant
across examples.

$\textbf{FFNN.}$ \
$Input: \vec{x} \in \mathcal{R}^d$ \
$Model\text{ }Output: \vec{z} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$
$Final\text{ }Output: \vec{y} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the contraint of being a probability distribution, ie $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}[i] \leq 1$, which is achieved via _Softmax_ applied to $\vec{z}$.
<br></br>
$\textbf{RNN.}$ \
$Input: \vec{x}_1,\vec{x}_2, \dots, \vec{x}_k; \vec{x}_i \in \mathcal{R}^d$ \
$Model\text{ }Output: \vec{z}_1,\vec{z}_2, \dots, \vec{z}_k; \vec{z}_i \in \mathcal{R}^{h}$
$Final\text{ }Output: \vec{y} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the contraint of being a probability distribution, ie $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}[i] \geq 0$, which is achieved by the process described later in this report and as you have seen in class

Intuitively, an RNN takes in a sequence of vectors and computes a new vector corresponding to each vector in the original sequence. It achieves this by processing the input sequence one vector at a time to (a) compute an updated representation of the entire sequence (which is then re-used when processing the next vector in the input sequence), and (b) produce an output for the current position. The vector computed in (a) therefore not only contains information about the current input vector but also about the previous input vectors. Hence, $\vec{z}_j$ is computed after having observed $\vec{x}_1, \dots, \vec{x}_j$. As such, a simple observation is we can treat the last vector computed by the RNN, ie $\vec{z}_k$ as a representation of the entire sequence. Accordingly, we can use this as the input to a single-layer linear classifier to compute a yector $\vec{y}$ as we will need for classification.

$$\vec{y} = Softmax(W\vec{z}_k); W\in \mathcal{R}^{\mid \mathcal{Y}\mid \times h}$$

## Part 2: Rules
**Part 2** requires implementing a rudimentary RNN in PyTorch for text classification. Countless blog posts, internet tutorials and other implementations available publicly (and privately) do precisely this. In fact, almost every student in [Cornell NLP](https://nlp.cornell.edu/people/) likely has some code for doing this on their Github. You **cannot** use any such code (though you may use anything you find in course notes or course texts) irrespective of whether you cite it or do not.

Submissions will be passed through the MOSS system, which is a sophisticated system for detecting plagiarism in code and is robust in the sense that it tries to find alignments in the underlying semantics of the code and not just the surface level syntax. Similarly, the course staff are also quite astute with respect to programming neural models for NLP and we will strenuously look at your code. We flagged multiple groups for this last year, so we strongly suggest you resist any such temptation (if the Academic Integrity policy alone is insufficient at dissuading you).

## 2.1 RNN Implementation

Similar to **Part 1**, we have the previous `Data loader` section and the new `RNN` component. We don't envision that it will be useful to modify the `Data loader`. We have included some stubs to help give you a place to start for the RNN.

Additionally, we remind you that Part 1 furnishes a near-functional implementation of a similar neural model for the same task. If you successfully do Part 1 correctly, it will be wholely functional. Using it as a template for Part 2 is both prudent and suggested.

In [12]:
# get max document length for vector dimension
max_len = 0
for document,_ in train:
  if max_len < len(document):
    max_len = len(document)

#pre-trained GloVe word embeddings used is of dimension 50
word_vec_dim = 50

def rnn_preprocessing(data, test=False, max_len=max_len):
    """rnn_preprocessing creates a list of tensors that have word vectors for each document in data.

    :param data: data to be preprocessed
    :type data: list of string lists, or list of 2-tuples with elements, string list and int
    :param test: if pre-processing is done for test data
    :type test: boolean
    :param max_len: maximum length a document can be (to create word vectors tensor)
    :type max_len: int
    """
    # Do some preprocessing similar to convert_to_vector_representation
    # For the RNN, remember that instead of a single vector per training
    # example, you will have a sequence of vectors where each vector
    # represents some information about a specific token.
    docs = []
    if test:
      max_len = 0
      for document in data:
        docs.append(document)
        if max_len < len(document):
          max_len = len(document)
    else:
      for document,_ in data:
        docs.append(document)

    # Use Pretrained GloVe from Twitter Data with 50 dimensions to extract word embeddings 
    glove_path = os.path.join(os.getcwd(), "drive", "My Drive", "CS4740", "Project3", "p3-cs4740-2020fa","glove.twitter.27B.50d.txt")
    embeddings_dict = {}
    with open(glove_path, 'r') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            embeddings_dict[word] = vector

    vector = []
    if test:
      for document in data:
        doc_vector = torch.zeros([max_len, word_vec_dim], dtype=torch.float32) 
        for i in range(len(document)):
          word = document[i]
          if word in embeddings_dict:
            word_vec = embeddings_dict[word]
          doc_vector[i] = torch.from_numpy(word_vec)
        vector.append(doc_vector)
    else:
      for document,y in data:
        doc_vector = torch.zeros([max_len, word_vec_dim], dtype=torch.float32) 
        for i in range(len(document)):
          word = document[i]
          if word in embeddings_dict:
            word_vec = embeddings_dict[word]
          doc_vector[i] = torch.from_numpy(word_vec)
        vector.append((doc_vector, y))
    return vector

In [13]:
# pre-process train and validation data
train_vectorized_rnn = rnn_preprocessing(train)
val_vectorized_rnn = rnn_preprocessing(val)

In [14]:
train_loader_rnn, val_loader_rnn = get_data_loaders(train_vectorized_rnn, val_vectorized_rnn, batch_size=1)

In [15]:
from torch import autograd
class RNN(nn.Module):
	def __init__(self, input_dim, h, output_dim): # Add relevant parameters
		super(RNN, self).__init__()
		self.h = h
		self.W = nn.Linear(input_dim, h)
		self.V = nn.Linear(h, output_dim)
		self.U = nn.Linear(h, h)
		self.activation = nn.LeakyReLU()	
		# using LeakyReLU to ensure gradients to ensure small vector values are not affecting hidden vectors 
		# Ensure parameters are initialized to small values, see PyTorch documentation for guidance
		self.softmax = nn.LogSoftmax(dim=1)
		self.loss = nn.NLLLoss()

	def compute_Loss(self, predicted_vector, gold_label):
		return self.loss(predicted_vector, gold_label)

	def forward(self, inputs):
	# begin code
		h_vec = torch.zeros(self.h).to(get_device())
		zero_inp = torch.zeros((1,word_vec_dim)).to(get_device())
		for i in range(0, inputs.size()[1]):
			inp = torch.reshape(inputs[0][i], (1,word_vec_dim))
	 		# stop training on document if no more words
			if torch.all(torch.eq(inp, zero_inp)): 
				break
			h_vec = self.activation(self.U(h_vec) + self.W(inp))
			
		y = self.V(h_vec)
		# remember to include the predicted unnormalized scores which should be normalized into a (log) probability distribution
		# end code
		return self.softmax(y)

	def load_model(self, save_path):
		self.load_state_dict(torch.load(save_path))
	
	def save_model(self, save_path):
		torch.save(self.state_dict(), save_path)
	
def train_epoch_rnn(model, train_loader, optimizer):
	model.train()
	total = 0
	loss = 0
	correct = 0
	acc_loss = 0
	for (input_batch, expected_out) in tqdm(train_loader, leave=False, desc="Training Batches"):
		output = model(input_batch.to(get_device()))
		total += output.size()[0]
		_, predicted = torch.max(output, 1)
		correct += (expected_out == predicted.to("cpu")).cpu().numpy().sum()
		loss = model.compute_Loss(output, expected_out.to(get_device()))
		acc_loss += loss
		optimizer.zero_grad() 
		loss.backward()
		torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
		optimizer.step()

	acc_loss /= len(train_loader)
	# Print accuracy
	print("Training accuracy", correct/total)
	print("Training loss", acc_loss.item())
	return acc_loss

def evaluation_rnn(model, val_loader, optimizer):
	model.eval()
	loss = 0
	correct = 0
	total = 0
	for (input_batch, expected_out) in tqdm(val_loader, leave=False, desc="Validation Batches"):
		output = model(input_batch.to(get_device()))
		total += output.size()[0]
		_, predicted = torch.max(output, 1)
		correct += (expected_out.to("cpu") == predicted.to("cpu")).cpu().numpy().sum()
		loss += model.compute_Loss(output, expected_out.to(get_device()))
	loss /= len(val_loader)
	# Print validation metrics
	print("Validaiton loss", loss.item())
	print("Validation accuracy", correct/total)
	return loss
	
def train_and_evaluate_rnn(number_of_epochs, model, train_loader, val_loader, lr=0.001):
	optimizer = optim.Adam(model.parameters(), lr=lr)
	scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
	for epoch in trange(number_of_epochs, desc="Epochs"):
		print("Epoch", epoch+1)
		tr_loss = train_epoch_rnn(model, train_loader, optimizer)
		val_loss = evaluation_rnn(model, val_loader, optimizer)
		scheduler.step()
	return


In [None]:
h = 250
model = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())
train_and_evaluate_rnn(10, model, train_loader_rnn, val_loader_rnn)
model.save_model("rnn.pth") # Save our model!

In [18]:
# Example of how to load
loaded_model = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())
loaded_model.load_model("rnn.pth")

## 2.2 Part 2 Report
For Part 2, your report should have a description of each major step of implementing the RNN accompanied by the associated code-snippet. Each step should have an explanation for why you decided to do something (when one could reasonably do the same step in a different way); your justification will not be based on empirical results in this section but should relate to something we said in class, something mentioned in any of the course texts, or some other source (i.e. literature in NLP or official PyTorch documentation). **Unjustified, vague, and/or under-substantiated explanations will not receive credit.**

Things to include:

1. _Representation_ \
Each $\vec{x}_i$ needs to be produced in some way and should correspond to word $i$ in the text. This is different from the text classification approaches we have studied previously (BoW for example) where the entire document is represented with a single vector. Where and how is this being done for the RNN?

2. _Initialization_ \
There will be weights that you update in training the RNN. Where and how are these initialized?

3. _Training_ \
You are given the entire training set of N examples. How do you make use of this training set? How does the model modify its weights in training (this likely entails somewhere where gradients are computed and somehwere else where these gradients are used to update the model)?

4. _Model_ \
This is the core model code, ie. where and how you apply the RNN to the $\vec{x}_i$

5. _Linear Classifier_ \
Given the outputs of the RNN, how do you consume these to actually compute $\vec{y}$?

6. _Stopping_ \
How does your training procedure terminate?

7. _Hyperparameters_ \
To run your model, you must fix some hyperparameters, such as $h$ (the hidden dimensionality of the $\vec{z}_i$ referenced above). Be sure to exhaustively describe these hyperparameters and why you set them as you did ( this almost certainly will require some brief exploration: we suggest the course text by Yoav Goldberg as well as possibly the PyTorch official documentation). Be sure to accurately cite either source.



### 2.2.1 Representation


Our input data is created to be of a specific form in the rnn_preprocessing function. We represent our input data in the following way:

Each word in each of the documents is represented by a vector. So, each document is represented by multiple vectors. And, the size of the vectors for each document is set to be the maximum document length in order to have uniform dimension across all document vectors. This avoids dimension mismatch errors for further computations. Smaller documents are padded with zeros, which does not affect further calculations.

To create the word vectors for each document, we used pretrained GloVe with Twitter data with 50 dimension word embeddings. We chose a dimension of 50 since it is enough to get the embedding of each word since tweets are generally not too long. 


### 2.2.2 Initialization


We use 3 weights in our RNN and these are initialized in the constructor of the RNN class. We initialized our weights W, V, and U using PyTorch's Linear transformation function.

` self.W = nn.Linear(input_dim, h)`

` self.V = nn.Linear(h, output_dim)`

` self.U = nn.Linear(h, h) `

Weight matrix W is initialized with a dimension of input_dim by h, where input_dim is the size of the word vector and h is the size of the hidden layer. 

Weight matrix U is initialized with a dimension of h by h, where h is the size of the hidden layer. U is then multiplied with the previous hidden layer, to get the context of the previous parts of the sequence. Adding this product to the product of W and input, and applying the activation function gives the current hidden layer. The activation function we use is the LeakyReLU function. This function is similar to ReLu function but takes care of the dying ReLU problem.

Weight matrix V is initialized with a dimension of h by output_dim, where h is the size of hidden layer and output_dim is the size of the output, which is the number of emotions we have. 

We used pytorch Linear function to ensure that the weights are initialized to be small random values as mentioned in the pytorch documentation. This is important because random values break symmetry in the weights to make sure that the hidden layer values are not the same for the first round of calculations. Small values are important so that the slope of the gradient does not change too slowly which would make the learning take a long time.

### 2.2.3 Training


We train our model using the train_epoch_rnn function. In here, we are looping through each of our inputs and calculating the predicted value using our RNN model. The output from the model returns a vector of probabilities for each of the emotions and to get the predicted emotion, we take the emotion corresponding to the maximum probability in the vector. The loss is calculated using cross entropy loss function. The backward function is called on this loss which calculates the new gradient. This gradient is then used to update the weights when the step function is called on the optimizer, which is the Adam optimizer from pytorch in our case. We tried training our model with SGD but the loss was decreasing at a very slow rate and did not improve much over multiple epochs. Adam optimizer, on the other hand, had a good pattern of reduction in loss and the model learned faster. Before we calculate the new gradient using `loss.backward()`, we zero out the gradients in order to avoid accumulation of the gradients from previous training iterations. We are also using a scheduler in our training process which changes the learning rate after 5 epochs. The learning rate is decreased by a factor of 10 making the model learn slower so it can learn things it missed previously.

### 2.2.4 Model


We apply our RNN to the inputs in forward function in RNN class.
```
h_vec = torch.zeros(self.h).to(get_device())
		zero_inp = torch.zeros((1,word_vec_dim)).to(get_device())
		for i in range(0, inputs.size()[1]):
			inp = torch.reshape(inputs[0][i], (1,word_vec_dim))
			if torch.all(torch.eq(inp, zero_inp)):
				break
			h_vec = self.activation(self.U(h_vec) + self.W(inp))
		y = self.V(h_vec)
return self.softmax(y)
```
We initialize our h vector to be a zero vector for the first time step. We then loop through the word vectors of the input. For each word vector, we calculate the current h vector by adding the product of weight matrix W and the input word vector to the product of weight matrix U and the previous h vector, and applying the activation function on this sum. The activation function used is LeakyReLU function in order to take care of the dying ReLU problem. We made all the input vectors the same size (the size of the maximum tweet) during our preprocessing, but many of the inputs are actually smaller than that making them contain many zero word vectors at the end of the actual input. Using these zero vectors would affect our model and we only want to take into account the last actual word of the document so we added a check to exit the for-loop when we encounter a word vector that is filled with zeros. 

Outside the loop, we calculate our output by multiplying the last hidden layer vector and the weight matrix V and applying the softmax function on this product to normalize the probabilities.

### 2.2.5 Linear Classifier

We compute the final prediction vector by multiplying the last hidden layer by the weight matrix V in RNN forward function. Then, softmax (log softmax in our case) is applied on this product in order to normalize it. RNN returns this normalized prediction vector ($\vec{y}$) for each input. Lastly, while training and evaluating, we are getting one emotion label per document by taking the emotion corresponding to the maximum probability from the prediction vector.

### 2.2.6 Stopping


Our training procedure terminates after the given number of epochs. 

### 2.2.7 Hyperparameters


**Number of hidden dimension:**  We chose our hidden dimension to be 250  given that we have about 10,000 inputs and the maximum length of one document is 66 words where one word vector is of dimension 50. Given these input information, we need to choose a hidden dimension value that is not too large nor too small. The number of hidden dimension corresponds to the amount of information being passed into each of the layers in the network. Since our dataset consists of tweets which are not too large, a hidden dimension smaller than the text size would mean not enough information is being propagated. We chose 250 after some experimentation and with 250 as the hidden dimension, the network seemed very stable and seemed to learn properly.


**Epochs:** For the number of epochs, we chose it to be 10. After trying out multiple epochs (we tried 5 epochs, 10 epochs, and 15 epochs). We noticed that 5 epochs was too little since when we changed the number of epochs to be 10, both the training loss and the validation loss decreased. However, when we changed the number of epochs to be 15, it seemed like there was definitely overfitting occuring since the training loss decreased but the validation loss stayed around the same (sometimes increased negligibly). Thus, after such experimentation, we chose our number of epochs to be 10 so the model has enough time to learn well without overfitting. 

**Learning Rate:**
For the learning rate, we initialized it to be 0.001. According to the text by Yoav Goldberg, a big learning rate means that the model will have trouble converging to a proper solution while if the learning rate is too small, the model will be learning very slowly. We are also using a scheduler which decreases learning rate after 5 epochs since after 5 epochs we noticed that the model starts to not learn as much (the validation loss starts to plateau).

# Part 3: Analysis
From **Part 1** and **Part 2**, you will have two different models in hand for performing the same emotion detection task. In **Part 3**, you will conduct a comprehensive analysis of these models, focusing on two comparative settings.

## Part 3 Note
You will be required to submit the code used in finding these results on CMSX. This code should be legible and we will consult it if we find issues in the results. It is worth noting that in **Part 1** and **Part 2**, we primarily are considering the correctness of the code-snippets in the report. If your model is flawed in a way that isn’t exposed by those snippets, this will likely surface in your results for **Part 3**. We will deduct points for correctness in this section to reflect this and we will try to localize where the error is (or think it is, if it is opaque from your code). That said, we will be lenient about absolute performance (within reason) in this section.

## 3.1: Across-Model Comparison
In this section, you will report results detailing the comparison of the two models. Specifically, we will consider the issue of _fair comparison_<sup>5</sup>, which is a fundamental notion in NLP and ML research and practice. In particular, given model $A$, it is likely the case we can make a model $B$ that is computationally more complex and, hence, more costly and achieves superior performance. However, this makes for an unfair comparison. For our purposes, we want to study how the FFNN and RNN compare when we try to control for hyperparameters and other configurable values being of similar computational cost<sup>6</sup>. That said, it is impossible to have identical configurations as these are different models, i.e. the RNN simply has hyperparameters for which there are no analogues in the FFNN.


In the report you will need to begin by describing 3 pairs of configurations, with each pair being comprised of a FFNN configuration and a RNN configuration that constitute a _fair comparison_. You will need to argue for why the two parts of each pair are a fair comparison. Across the pairs, you should try different types of configurations (e.g. trying to resolve like questions of the form: _Does the FFNN perform better or worse when the hidden dimensionality is small as opposed to when it is large?_) and justify what you are trying to study by having the results across the pairs.


Next, you will report the quantitative accuracy of the 6 resulting models. You will
analyze these results and then move on to a more descriptive analysis.

The descriptive analysis can take one of two forms<sup>7</sup>:

1. _Nuanced quantitative analysis_ \
If you choose this option, you will need to further break down the quantitative statistics you reported initially. We provide some initial strategies to prime you for what you should think about in doing this: one possible starting point is to consider: if model $X$ achieves greater accuracy than model $Y$, to what extent is $X$ getting everything correct that $Y$ gets correct? Alternatively, how is model performance affected if you measure performance on a specific strata/subset of the reviews?

2. _Nuanced qualitative analysis_ \
If you choose this option, you will need to select individual examples and try to explain or reason about why one model may be getting them right whereas the other isn’t. Are there any examples that all 6 models get right or wrong and, if so, can you hypothesize a reason why this occurs?

In [None]:
#@markdown ⠀
display(HTML('''<hr><p style="font-family:verdana; font-size:90%;">
5. This term takes on different meanings in different settings. Here we simply mean that we are trying to
compare different models while controlling for similar “complexity”/computational cost. <br></br>

6. We have not taught you how to do this rigorously and the theory for doing this is still underdeveloped. We only expect a reasonable attempt. <br></br>

7. This is the minimal requirement, if you provide other, more elaborate, analyses, we certainly welcome this.
</p>'''))

### 3.1.1 Configuration 1
Modify the code below for this configuration.

In [26]:
from collections import defaultdict

"""
eval_preds evaluates the predictions of [model] for a [data_vectorized].
Prints various evaluation metrics for each emotion.

model: the trained NN model 
data_vectorized: vectorized data and its associated labels
"""
def eval_preds(model, data_vectorized):
  correct = 0
  tp_dict = defaultdict(int)  # if an emotion is predicted right, it is true positive (tp)
  fp_dict = defaultdict(int)  # if the prediction is wrong, it is false positive for predicted emotion (fp)
  emo_dict = defaultdict(int) # to store the count of emotions in training set
  preds = []
  total=len(data_vectorized)
  for idx, (input_vector, expected) in tqdm(enumerate(data_vectorized), total=len(data_vectorized)):
    output = model(torch.Tensor(input_vector).unsqueeze(0).to(get_device())).cpu()
    _, pred = torch.max(output, 1)
    pred = int(pred)
    correct += (expected == pred)
    preds.append((idx, pred, expected))
    if pred == expected:
      tp_dict[pred] += 1
    else:
      fp_dict[pred] += 1
    emo_dict[expected] += 1
  
  print("Accuracy", correct/total)
  print("Emotion counts in training set", emo_dict)
  print("True positive counts:", tp_dict)
  print("False positive counts:", fp_dict)
  print("")

  for emo in emotion_to_idx:
    idx = emotion_to_idx[emo]
    # true positive rate
    tp_rate = tp_dict[idx] / emo_dict[idx]
    # predictive power
    pred_power = tp_dict[idx] / (fp_dict[idx] + tp_dict[idx] + 1)
    print("Emotion:", idx, emo)
    print("True positive rate:", tp_rate)
    print("Predictive power:", pred_power)
    print("")
  return preds

In [None]:
# constants for both model
h = 512
epochs = 10
lr = 0.001
ffnn_config_1 = FFNN(len(vocab), h, len(emotion_to_idx)).to(get_device())
rnn_config_1 = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())

In [None]:
#training FFNN
train_and_evaluate(epochs, ffnn_config_1, train_loader, val_loader, lr=lr)
ffnn_config_1.save_model("ffnn_config_1.pth")

In [None]:
#training RNN
train_and_evaluate_rnn(epochs, rnn_config_1, train_loader_rnn, val_loader_rnn, lr=lr)
rnn_config_1.save_model("rnn_config_1.pth") # Save our model!

In [None]:
#get first 10 predictions using ffnn and rnn
print("--- FFNN Predictions")
ffnn_preds = eval_preds(ffnn_config_1, val_vectorized)
print("\n--- RNN Predictions")
rnn_preds = eval_preds(rnn_config_1, val_vectorized_rnn)

print("--- Some examples of predictions and expected output ---\n")
for i in range(10):
  print("INPUT:", " ".join(val[i][0]))
  print("ffnn pred", ffnn_preds[i][1])
  print("rnn pred", rnn_preds[i][1])
  print("correct", ffnn_preds[i][2])
  print("")

### 3.1.1 Report
Describe configurations, report the results, and then perform a nuanced analysis

For the first and second configurations, we control the number of epochs and learning rate while varying the dimension of hidden layer to study how dimension of hidden layer influences the performance of the neural networks.

For the first configuration, we have 512 hidden dimensions, a learning rate of 0.001, and 10 epochs for both FFNN and RNN models. This is a fair comparison because when we control the number of epochs, the computations done on both models amounts to about the same cost as both models need to be trained with 10 epochs. And since we are using a batch size of 1, the number of computations and iterations done on training are the same. Controlling the learning rate ensures that both models learn at the same rate, converging from a similar starting point.

## Result
When this configuration is used to train FFNN and RNN models, we found that the FFNN learns slightly better than RNN. FFNN has a final training accuracy of 0.99 and loss of 0.0307, which is slightly better than RNN that has training accuracy of 0.95 and loss of 0.26. With a hidden dimension of 512, both models learn well gradually, with the training loss decreasing in a stable manner over the 10 epochs.

The trained models also predict the validation set pretty well, where both FFNN and RNN models have 0.89 as the validation accuracy. However, the loss of validation is higher in RNN with 0.78 whereas FFNN has a loss of 0.34 on validation after 10 epochs. These loses are low which indicates that 512 is a good hidden dimension, such that both models FFNN and RNN do not overfit, yet still perform well on the validation set. More on how hidden dimension affects model performance will be explored in the next configuration, as we will have a comparison point there.

## Nuanced analysis - Qualitative Analysis

Firstly, we'd like to explore how the FFNN model learns and predicts in a qualitative sense. We see, from the evaluation metrics printed above, that FFNN predicts anger emotion the best, followed by fear and joy. Some examples that FFNN gets right are:

*   `i didn t know that i would feel so completely exhausted` as sadness
*   `i couldn t help but feel slightly skeptical and apprehensive as i realized the tough task funes was taking on that night` as fear

We see that these sentences are predicted correctly because our FFNN model picks on the meaning of words like "exhausted" which is more related to sadness and "skeptical" and "apprehensive" which are more likely to be fear.

RNN, on the other hand, predicts anger the best as well, followed by sadness. Some correct examples:

* `i feel excluded and worthless my connection to everyone summarily cut off` as sadness

* `i shouldnt feel threatened by that` as fear

These make sense as the RNN preprocessed input would give the meanings of words like "threatened" to be similar to fear and "excluded" and "worthless" to be sadness. The RNN model picks up these patterns and learns to predict as such.

Some examples that FFNN gets right but RNN does not for this configuration are:

* `i know she feels helpless but that kiss that cuddle the hug every morning and the love you every night` is sadness but RNN predicts as fear. Here, it could be that RNN picks up words like "helpless" to get a prediction of fear, which might be weighted more by the RNN model than the rest of the sentence. FFNN correctly gets the semantic of the sentence that the sentence has a sad tone.

* `i just act how i feel im becoming what ive always hated` is anger but RNN predicts as fear. RNN is again picking up words like "hated" to mean something sad but FFNN captures the entire semantic of the sentence, perhaps remembering words like "feel" and "becoming", which are words that one generally does not say in an angry statement.

Some examples that RNN gets right but FFNN does not:

* `i enjoy going to churches acquired there feeling is always so peaceful and tranquil thats why ive had a wish to visit pochayiv monastery and without comments it was really worthy` is joy but FFNN predicts as sadness. RNN correctly picks up the meaning of words like "enjoy", whereas FFNN is not performing very well in this with the appearance of possible negative words like "without", "wish" which one tends to say more when they are sad.

* `i guess we would naturally feel a sense of loneliness even the people who said unkind things to you might be missed` is anger but FFNN predicts as sadness. RNN here understands the semntic of the sentence, that the sentence has a passive aggressive tone. FFNN might interpret words like "loneliness", "unkind", "missed" to mean sadness.

An example that both models get right:

* `i get why she is concerned because i have been pretty honest about feeling shitty about all of it` as sadness.

This shows that both models are able to capture words like "concerned", "shitty" to be sadness as there are no words that can have ambiguous meaning when they appear in a different context. And, hence, both models predict well in this case.

For how hidden dimension influences individual predictions, we can look at the next configuration.



### 3.1.2 Configuration 2
Modify the code below for this configuration.

In [22]:
#constants for both model
h = 50
epochs = 10
lr = 0.001
ffnn_config_2 = FFNN(len(vocab), h, len(emotion_to_idx)).to(get_device())
rnn_config_2 = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())

In [None]:
#training FFNN
train_and_evaluate(epochs, ffnn_config_2, train_loader, val_loader, lr=lr)
ffnn_config_2.save_model("ffnn_config_2.pth")

In [None]:
#training RNN
train_and_evaluate_rnn(epochs, rnn_config_2, train_loader_rnn, val_loader_rnn, lr=lr)
rnn_config_2.save_model("rnn_config_2.pth") # Save our model!

In [None]:
#get first 10 predictions using ffnn and rnn
print("--- FFNN Predictions")
ffnn_preds = eval_preds(ffnn_config_2, val_vectorized)
print("\n--- RNN Predictions")
rnn_preds = eval_preds(rnn_config_2, val_vectorized_rnn)

print("--- Some examples of predictions and expected output ---\n")
for i in range(10):
  print("INPUT:", " ".join(val[i][0]))
  print("ffnn pred", ffnn_preds[i][1])
  print("rnn pred", rnn_preds[i][1])
  print("correct", ffnn_preds[i][2])
  print("")

### 3.1.2 Report
Describe configurations, report the results, and then perform a nuanced analysis

For the second configuration, we have 50 hidden dimensions, a learning rate of 0.001, and 10 epochs for both FFNN and RNN models. This is a fair comparison because, as described in the previous report, when we control the number of epochs, the computations done on both models amounts to about the same cost as both models need to be trained with 10 epochs. And since we are using a batch size of 1, the number of computations and iterations done on training are the same. Controlling the learning rate ensures that both models learn at the same rate, converging from a similar starting point.

## Result
When this configuration is used to train FFNN and RNN models, we found that the FFNN still learns better than RNN, where FFNN has a training accuracy of 0.98 and loss of 0.04, and RNN has a training accuracy of 0.85 and loss of 0.76. Both models perform fairly well on the validation set as well where FFNN has a validation accuracy of 0.89 and loss of 0.33, and RNN has a validation accuracy of 0.81 and loss of 1.009.

However, the FFNN and RNN models with a higher hidden dimension (h=512) in the previous configuration perform better than both of these model. The FFNN model with h=512 performs better than both the FFNN and RNN in this configuration as the loss of the previous FFNN is lower than the models in this configuration, whereas the RNN model performs better than the RNN in this configuration but not better than the FFNN model in this configuration.

## Nuanced Analysis - Qualitative Analysis

The FFNN model in this configuration predicts anger and sadness the best, whereas the RNN model predicts anger the best. The overall predictive power for both of these models have dropped from the previous configuration with a higher hidden dimension. This shows that a higher hidden dimension produces a higher performing model. A higher hidden dimension makes a better performing model because more information or pattern is captured from the training data when there is a bigger dimension of hidden layer.

However, we want to make sure that the model does not overfit the training data, and a really, really big hidden dimension could make the model overfit and hence, it would not produce a good model. At the same time, we don't want a small hidden dimension either because as we see in this configuration, h=50 performs worse than h=512. A good dimension of hidden layer needs to be picked.

As explored in the previous section, the variations between just two FFNN models and two RNN models of different configurations, respectively, are not the most interesting part that we want to analyze. Like any improvement on a model, the FFNN with higher hidden dimension (config 1) will predict better than the FFNN with lower hidden dimension (config 2). The same goes to the RNN models.  For example,

* `i guess we would naturally feel a sense of loneliness even the people who said unkind things to you might be missed` is anger. RNN of config 1 predicts this correctly, but RNN of config 2 predicts this as sadness. FFNN of both configs predict this as sadness. This shows that RNN with a smaller dimension loses some information on preserving the semantics of passive aggressive tone in this document, which the RNN with higher dimension preserved.

* `i know she feels helpless but that kiss that cuddle the hug every morning and the love you every night` is sadness. RNN of both cofigs predict this incorrectly as anger. FFNN of config 1 predicts correctly but FFNN of config 2 predicts this incorrectly as anger as FFNN with a smaller hidden dimension loses information on longing semantics of the document which makes it to have sadness as the emotion. The FFNN with higher dimension has this information preserved.

Across these 2 pairs of configurations, some examples are always predicted correctly by all 4 models. Some exmaples of those are:

* `i shouldnt feel threatened by that` is fear. 
* `i feel excluded and worthless my connection to everyone summarily cut off` is sadness.

These sort of documents are predicted correctly by all of these 4 models regardless of the hidden dimension configured because these sentences are simple, and have only words that are neutral or emotionally direct words like "exhausted" or "excluded" and "worthless". There is no ambiguity which requires a well-learned, nicely fit model to get correct predictions.

Some examples where RNN has predicted incorrectly across configurations:

* `i just act how i feel im becoming what ive always hated` is anger but is predicted as sadness in both config 1 and config 2. 
* `i know she feels helpless but that kiss that cuddle the hug every morning and the love you every night` is sadness but is predicted as fear in both configs.

This is because RNN captures words like "hated" and "helpless" to mean emotions that we generically mean as opposed to taking the semantic of the entire sentences. The context of these words is lost in RNN no matter how the hidden dimension changes in these two configurations. 

An example that FFNN consistently gets incorrect across these two configuration is:
* `i guess we would naturally feel a sense of loneliness even the people who said unkind things to you might be missed` which is anger but is predicted as sadness. Here, the FFNN is not capturing the passive aggressive context of the document in either configuration. But, the RNN model captures this information when a larger hidden dimension is used.

Hence, we can say that a higher hidden dimension is better than a smaller hidden dimension to create a better performing model, for both FFNN and RNN architecture. However, we want to refrain from overfitting the model, so a really large hidden dimension should be avoided. 

Next, we will study how learning rate affects the performance of a model for both FFNN and RNN. The result of configuration 2 will be used to compare with the result of configuration 3 for this study.

### 3.1.3 Configuration 3
Modify the code below for this configuration.

In [None]:
#constants for both model
h = 50
epochs = 10
lr = 0.01 # increased learning rate
ffnn_config_3 = FFNN(len(vocab), h, len(emotion_to_idx)).to(get_device())
rnn_config_3 = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())

In [None]:
#training FFNN
train_and_evaluate(epochs, ffnn_config_3, train_loader, val_loader, lr=lr)
ffnn_config_3.save_model("ffnn_config_3.pth")

In [None]:
#training RNN
train_and_evaluate_rnn(epochs, rnn_config_3, train_loader_rnn, val_loader_rnn, lr=lr)
rnn_config_3.save_model("rnn_config_3.pth") # Save our model!

In [None]:
#get first 10 predictions using ffnn and rnn
print("--- FFNN Predictions")
ffnn_preds = eval_preds(ffnn_config_3, val_vectorized)
print("\n--- RNN Predictions")
rnn_preds = eval_preds(rnn_config_3, val_vectorized_rnn)

print("--- Some examples of predictions and expected output ---\n")
for i in range(10):
  print("INPUT:", " ".join(val[i][0]))
  print("ffnn pred", ffnn_preds[i][1])
  print("rnn pred", rnn_preds[i][1])
  print("correct", ffnn_preds[i][2])
  print("")

### 3.1.3 Report
Describe configurations, report the results, and then perform a nuanced analysis

In the third configuration, we want to study the importance of choosing a good learning rate. How does learning rate influence the performance of FFNN and RNN? We will then compare this pair of configuration with Config 2 that also has a hidden dimension of 50 and 10 epochs, but 0.001 as it's learning rate for both FFNN and RNN. 

So, for the third configuration, we have 50 hidden dimensions, a learning rate of 0.01, and 10 epochs for both FFNN and RNN models. This is a fair comparison because when we control the number of epochs, the computations done on both models amounts to about the same cost as both models need to be trained with 10 epochs, as said in the previous two parts. Keeping the hidden dimension the same across FFNN and RNN ensures that the same amount of information is preserved in each forward and backward propagations of the models from epoch to epoch. 

## Result
When this configuration is used to train FFNN and RNN models, we found that both of the trained models perform much worse than the previous two configurations. For a fair comparison, we can compare this configuration with the previous configuration that has h=50, number of epochs=10 and learning rate=0.001, where the learning rate is the variable studied in this round. 

We see that with a higher learning rate, both models perform worse than those with a lower learning rate. The FFNN model has a final training accuracy of 0.9 and loss of 0.31. The validation accuracy is 0.77 and the loss is 0.74. This performance is lower than that of config 2. The RNN model, however, is especially really bad at learning as it has a final training accuracy of 0.21 and loss of NaN, which indicates that the model has an explosive gradient problem. As a result the validation accuracy is 0.21 and loss is NaN as well.Such scenario is possible because the learning rate determines how fast the gradient of the model optimizer advances, and hence how fast and how optimally the model learns. With a large learning rate, the optimizer can miss the optimal point and jump to the wrong direction and can cause an explosion in loss as the model gets worse. For the case of RNN, the scheduler that we use only reduces the learning rate after 5 epochs, there is an texplosion that happens in the 4th epoch, before the scheduler reduces the given learning rate by a factor of 10. From there on, the model gets worse and stays as such as the weight vectors have really small numbers now. The FFNN model, however, is not as bad as the RNN but still has worse performance than the FFNN and RNN of the previous configuration (with smaller learning rate), which says that the FFNN does not learn as well with a higher learning rate as well.

## Nuanced Analysis - Qualitative Analysis
FFNN performs pretty badly compared to the previous configuration which had a lower learning rate. This shows that a larger learning rate affects performance of the trained model negatively as a lot of information will be stepped over when the model advances at a fast rate. Some exmaples of predictions that did not work well:

* `i just act how i feel im becoming what ive always hated` is anger. The FFNN model predicts this to be sadness. This is because information about the nuanced semantic of the document is not captured by the fast advancing optimizer in the model. 

* `i guess we would naturally feel a sense of loneliness even the people who said unkind things to you might be missed` is anger but FFNN has been consistently getting this incorrect by predicting it to be sadness in all three configurations. This is an example that cannot be captured by the studied hidden dimensions and learning rates.

With a really bad RNN model in this configuration, the RNN model always predicts anger as the emotion for all documents. This means that no information or pattern is captured by the model from the training data after 10 epochs, and all documents are predicted as anger as that is the first emotion in the emotion dictionary and probability vector of the output of RNN.




## Part 3.2: Within-model comparison
To complement **Part 3.1: Across-Model Comparison**, in **Part 3.2: Within-Model Comparison**, you will need to study what happens when you change parameters within a model. To limit your workload, you need only do this for the RNN; and you may use at most one RNN model from the prior section.

In the prior section, we discussed _fair comparison_. Anothr aspect of rigorous experimentation in NLP (and other domains) is the _ablation study_. In this, we _ablate_ or remove aspects of a more complex model, making it less complex, to evaluate whether each aspect was neccessary. To be concrete, for this part, you should train 4 variants of the RNN model and describe them as we do below:

1. Baseline model
2. Baseline model made more complex by modification $A$ (e.g. changing the hidden dimensionality from $h$ to $2h$).
3. Baseline model made more complex by modification $B$ (where $B$ is an entirely distinct/different update from $A$).
4. Baseline model with both modificatons $A$ and $B$ applied.

Under the framing of an ablation study, you woud describe this as beginning with model 4 and then ablating (i.e. removing) each of the two modifications, in turn; and then removing both to see if they were genuinely neccessary for the performance you observe.

Once you describe each of the four models, report the quantitative accuracy as in the previous section. Conclude by performing the **opposite** nuanced analysis from the one you did in the previous section (i.e. if in **Part 3.1: Across-Model Comparison** you did _Nuanced quanitative analysis_, for **Part 3.2: Within-Model Comparison** perform a _Nuanced qualitative analysis_ and vice versa).

### 3.2.1 Configuration 1
Modify the code below for this configuration.

In [None]:
# hyperparameters from fixed rnn model
h = 250
lr = 0.001
epochs = 10
baseline_rnn = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())
train_and_evaluate_rnn(epochs, baseline_rnn, train_loader_rnn, val_loader_rnn, lr=lr)
baseline_rnn.save_model("baseline_rnn.pth")

In [None]:
print("Baseline RNN")
baseline_preds = eval_preds(baseline_rnn, val_vectorized_rnn)

### 3.2.1 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

Report written in section 3.2 below.

### 3.2.2 Configuration 2
Modify the code below for this configuration.

In [None]:
# train and evaluate with early stopping
def train_and_evaluate_rnn_es(number_of_epochs, model, train_loader, val_loader, lr=0.001):
	optimizer = optim.Adam(model.parameters(), lr=lr)
	scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
	prev_loss = 0
	for epoch in trange(number_of_epochs, desc="Epochs"):
		tr_loss = train_epoch_rnn(model, train_loader, optimizer)
		val_loss = evaluation_rnn(model, val_loader, optimizer)
    #early stopping when validation loss starts becoming larger than training loss
		if tr_loss <= val_loss:
			return
		scheduler.step()
	return

In [None]:
# early stopping
h = 250
lr = 0.001
epochs = 10
mod_a_rnn = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())
train_and_evaluate_rnn_es(epochs, mod_a_rnn, train_loader_rnn, val_loader_rnn, lr=lr)

In [None]:
print("Early Stop RNN")
mod_a_preds = eval_preds(mod_a_rnn, val_vectorized_rnn)

### 3.2.2 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

Report written in section 3.2 below. 

### 3.2.3 Configuration 3
Modify the code below for this configuration.

In [None]:
# reverse inputs
h = 250
lr = 0.001
epochs = 10
mod_b_rnn = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())
train_vectorized_rnn_rev = train_vectorized_rnn[::-1]
val_vectorized_rnn_rev = val_vectorized_rnn[::-1]
train_loader_rnn_rev, val_loader_rnn_rev = get_data_loaders(train_vectorized_rnn_rev, val_vectorized_rnn_rev, batch_size=1)
train_and_evaluate_rnn(epochs, mod_b_rnn, train_loader_rnn_rev, val_loader_rnn_rev, lr=lr)

In [None]:
print("Reversed Input RNN")
mod_b_preds = eval_preds(mod_b_rnn, val_vectorized_rnn_rev)

### 3.2.3 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

Report written in section 3.2 below.

### 3.2.4 Configuration 4
Modify the code below for this configuration.

In [None]:
# early stopping and reverse input
h = 250
lr = 0.001
epochs = 10
both_mod_rnn = RNN(word_vec_dim, h, len(emotion_to_idx)).to(get_device())
train_and_evaluate_rnn_es(epochs, both_mod_rnn, train_loader_rnn_rev, val_loader_rnn_rev, lr=lr)

In [None]:
print("Early Stopping + Reversed Input RNN")
both_mod_preds = eval_preds(both_mod_rnn, val_vectorized_rnn_rev)

### 3.2.4 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

Report written in section 3.2 below.

### 3.2 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

**Modification 1:** The first modification we chose to use for our model was including early stopping. Since if we are training our model for too many epochs, it can cause overfitting where the training accuracy increases while the validatio accuracy stays the same or decreases. Thus, introducing early stopping can prevent that and improve the model. So, we wanted to see if our number of epochs was good or early stopping actually helps with the accuracy.

**Modification 2:** The second modification we chose to do was reversing the inputs when feeding them into the RNN. We chose this modification since for many NLP applications, feeding the input backward into the RNN tends to improve the result (our reasoning is mentioned in 4.2). We wanted to see if for our sentiment analysis task, if reversing the input actually makes much of a difference.

**Ablation Study with Nuanced Quantitative Analysis:** In order to perform a quantitative analysis, we are using accuracy, true positive rate and predictive power across each of the emotion classes. A higher true positive rate means for that specific emotion, the model did well. A higher predictive power suggests that when the model predicts that emotion, the predictions are correct more often than not.

Below is the table containing the **accuracy** of each model:

| Model | Accuracy | 
| --- | --- | 
| Baseline RNN | 0.875  |  
| RNN with early stopping| 0.791 |
| RNN with reversed input | 0.877 |
| RNN with early stopping and reversed input| 0.806|

Below is the table containing the **true positive rate** for each of the models for each emotion:

| Emotion Class | Baseline RNN  | RNN with early stopping | RNN with reversed inputs | RNN with both |
| --- | --- | --- | --- | --- |
|Anger | 0.89 | 0.83 | 0.89 | 0.86 |
| Fear | 0.84 | 0.67 | 0.87 | 0.64 |
| Joy | 0.87 | 0.83 | 0.88 | 0.89 |
| Love| 0.88 | 0.78 | 0.88 | 0.80 |
| Sadness | 0.89 | 0.81 | 0.87 | 0.80 |


Below is the table containing the **predictive power** for each of the models for each emotion:

| Emotion Class | Baseline RNN  | RNN with early stopping | RNN with reversed inputs | RNN with both |
| --- | --- | --- | --- | --- |
|Anger | 0.88 | 0.80 | 0.88 | 0.79 |
| Fear | 0.86 | 0.90 | 0.84 | 0.91 |
| Joy | 0.86 | 0.74 | 0.85 | 0.76 |
| Love| 0.86 | 0.75 | 0.88 | 0.82 |
| Sadness | 0.90 | 0.79 | 0.92 | 0.81 |


Model 4 (with both modifications) is the baseline model with both early stopping and reversed input. When we ablate one of these modifications we have our model 2 with only the early stopping and model 3 with only reversed inputs. Our model 1 is simply without any of these modifications (i.e the baseline model). For all these models the hyperparameters (number of hidden dimensions, learning rate, and number of epochs) were kept the same in order to make sure we only get the effect of the modification we are making. 

*Model 4 (both) vs Model 2 (early stopping):*

In terms of accuracy, model 2 does slightly worse than model 4. Thus, it means that ablating the reversed inputs makes the model less accurate meaning it is an important feature of the model. When looking across all the emotion classes, model 4 does much better in predicting texts that are labeled as joy with a true positive rate of 0.89 while model 2 has a true positive rate of 0.83. However, for fear, model 2 does slightly better than model 4 but they both have a true positive rate below 0.7. In terms of predicitve power, both models have similar quantities. The largest difference is for texts labeled as love. The predictive power is much higher for model 4 compared to model 2 which suggests that when model 4 predicts a text to be the label love, it is more often accurate than when model 2 predicts a text to be of label love. 

*Model 4 (both) vs Model 3 (reversed inputs):*

Model 3 has a an accuracy of 87.7% while model 4 has an accuracy of 80.6%. The 7% difference is a big difference. In terms of true positive rate across all the emotions, model 3 does much better than model 4. The biggest difference is for texts that are labeled with fear, the difference is true positive rate is 0.23 which is quite a lot. This means that when we are stopping early, the model has not learned enough to detect texts are labeled as fear. When looking at the predictive power for the emotion fear, model 4 has a higher predictive power of 0.91 while model 3 has a predictive power of 0.84. This implies that whenever model 4 predicts a text as fear, it gets it right more often than model 3 but since we also know that the true positive rate is low for model 4 for fear, it must be the case that model 4 does not classify texts as fear too often compared to model 2. Besides fear, model 3 has a higher predictive power for all the other emotions compared to model 4.

*Model 4 (both) vs Baseline:* 

The baseline model is much more accurate than the model 4 with both modification. The baseline accuracy is 87.5% while model 4 accuracy is 80.6%. Similar to model 3, baseline model has much higher true positive rate for fear compared to model 4. The difference is 0.2. However, model 4 has a higher predictive power than baseline for fear. This, again, means that model 4 does not classify texts as fear as often as the baseline, hence having a lower true positive rate but higher predictive power. For the other emotions, the baseline has a higher true positive rate meaning it classifies them correctly more often than model 4. Baseline also has a higher predictive power for the other emotions besides fear.

*Model 2 (early stopping) vs Baseline:* 

Model 2 accuracy is 79.1% while the baseline has an accuracy of 87.5%. The difference in accuracy indicates that removing early stopping makes the model much more accurate. Across all the emotions, model 2 has a lower true positive rate compares to the baseline. Specifically, for both fear and joy, model 2 has a much lower true positive rate meaning that model 2 has a harder time classifying texts that are of label fear or joy correctly compared to the other emotions. Model 2 has lower predictive power for all the emotions except fear compared to the baseline. This suggests that when model 2 predicts a text to be an emotion, it is predicting them wrong more often than the baseline model. For fear, the predictive power is higher while the true positive rate is lower, which implies that model 2 does not predict a text to be of label fear as often. So, when it does, it gets them correct resulting in a higher predictive power.

*Model 3 (reversed inputs) vs Baseline:*

The baseline model and model 3 have very similar accuracy, differing in only 0.2%. Looking across all emotions, they both have high true positive rates. For fear and joy, however, model 3 seems to be doing slightly better compared to the baseline model. For sadness, model 3 does slightly worse than the baseline model. The predictive powers across all the emotions are also very high for both of the models. However, the baseline model predictive powers seem more stable across all the emotions than model 3. These quantities suggest that ablating reversed inputs does not make too much of a difference.

Overall, it seems to be the case that removing early stopping makes the model much better, especially for the emotion fear. When the model is stopped early, it does not seem to have acquired enough information to classify a text as fear. Since our baseline models only included 10 epochs, it is most likely not the case that it is overfitting but if the validation loss is slightly higher than the training loss after an iteration, the training will be stopped. Choosing a different early stopping condition might have been better but since we only have 10 epochs to begin with, early stopping does not seem to be necessary. Removing the Reversal of the inputs makes the model slightly worse but when looking across all the statistics, it seems to indicate that it does not make too much of a difference.


# Part 4: Questions
In **Part 4**, you will need to answer the three questions below. We expect answers tobe to-the-point; answers that are vague, meandering, or imprecise **will receive fewer points** than a precise but partially correct answer.

## 4.1 Q1
Earlier in the course, we studied models that make use of _Markov_ assumptions. Recurrent neural networks do not make any such assumption. That said, RNNs are known to struggle with long-distance dependencies. What is a fundamental reason for why this is the case?

RNNs struggle with long-distance dependencies because when we backpropagate through time to calculate weight updates for each of the previous time steps, we take the gradient of a later layer and multiply it with the previous layer to get the gradient of the previous layer. So, as we keep multiplying across layers like this, until we reach the initial layer, we now get a really small gradient for the initial layer because of the repeated multiplications of smaller and smaller numbers across layers. This means that the weights for the previous layers are not updated properly which diminishes the dependencies on the earlier information. This is also known as the vanishing gradient problem.

## 4.2 Q2
In applying RNNs to tasks in NLP, we have discovered that (at least for tasks in English) feeding a sentence into an RNN backwards (i.e. inputting the sequence of vectors corresponding to ($course$, $great$, $a$, $is$, $NLP$) instead of ($NLP$, $is$, $a$, $great$, $course$)) tends to improve performance. Why might this be the case?

For a language like English, where the sentences take the form of subject-verb-object, the subject and the verb of the sentence give us more information about tenses, gender, singularity/plurality of the subject, as well as other part-of-speech details than the object does. So, when we input such sentences in the original order into RNN, we tend to lose these information acquired at the beginning of the sentence as the model advances, especially if the input sentence is long. So, if we input a reversed sentence, we will be able to preserve the essential information we get at the beginning of the sentence because we process them later, and these information will have higher weights when calculating the output in the model. Hence, reversed inputs tend to improve the performance of RNN.

## 4.3 Q3
In using RNNs and word embeddings for NLP tasks, we are no longer required to engineer specific features that are useful for the task; the model discovers them automatically. Stated differently, it seems that neural models tend to discover better features than human researchers can directly specify. This comes at the cost of systems having to consume tremendous amounts of data to learn these kinds of patterns from the data. Beyond concerns of dataset size (and the computational resources required to process and train using this data as well as the further environmental harm that results from this process), why might we disfavor RNN models?

We might disfavor RNN models because learning only through a large amount of dataset will make the model be biased which we do not want. This is because datasets from the real world reflects real world biases and the model will pick these patterns. For example, gender-role stereotypes can be generally reflected in large datasets, where these data usually come from historically collected information. If the RNN learns features by itself only through the dataset given, it will pick up these stereotypes creating a gender-biased model.






# Part 5: Miscellaneous
List the libraries you used and sources you referenced and cited (labelled with the section in which you referred to them). Include a description of how your group split
up the work. Include brief feedback on this asignment.

**References and Citation**

We used a pretrained GloVe word embeddings dataset from [Stanford NLP GloVe project site](https://nlp.stanford.edu/projects/glove/) to pre-process the input for RNN training. We specifically used the Twitter data with vector dimensions of 50, which is available for public use. (used in section 2.1)

We used pytorch to implement our RNN model. We used the nn module in pytorch to initialize our weight matrices. (used in section 2.1)

We refered to Yoav Goldberg, Neural Network Methods for Natural Language Processing (Chapter 5), to understand how to work with the hyperparameters of the neural network. (used in section 2.2.7)

**Work**

We met up and worked on the project through pair-programming. 

**Assignment Feedback**

This project was very interesting and taught us a lot about neural networks and how to use pytorch properly to implement them. 

**Each section must be clearly labelled, complete, and the corresponding pages should be correctly assigned to the corresponding Gradescope rubric item.** If you follow these steps for each of the 4 components requested, you are guaranteed full credit for this section. Otherwise, you will receive no credit for this section.

# Part 6: Kaggle Submission

In [19]:
# Create Kaggle submission function
kaggle_model = loaded_model
#rnn_document_preprocessor = lambda x: rnn_preprocessor(x, True) # This is for your RNN
file_name = "submission.csv"
ffnn_document_preprocessor = lambda x: convert_to_vector_representation(x, word2index, True)
rnn_document_preprocessor = lambda x: rnn_preprocessing(x, True)

In [20]:
def generate_submission(filename, model, document_preprocessor, test):
    test_vectorized = document_preprocessor(test)
    with Path(file_name).open("w") as fp:
        fp.write("Id,Predicted\n")
        for idx, input_vector in tqdm(enumerate(test_vectorized), total=len(test_vectorized)):
            output = model(torch.Tensor(input_vector).unsqueeze(0).to(get_device())).cpu()#.squeeze(0)
            _, pred = torch.max(output, 1)
            fp.write(f"{idx},{int(pred)}\n")
    return

In [None]:
generate_submission(file_name, kaggle_model, rnn_document_preprocessor, test)

# Live running demo

In [None]:
#@title Emotion Detection
#@markdown Enter a sentence to see the emotion
input_string = "I am so joyful!" #@param {type:"string"}
model_type = "ffnn_config_1" #@param ["baseline_ffnn", "baseline_rnn", "mod_a_rnn", "mod_b_rnn", "both_mods_rnn", "ffnn_config_1", "rnn_config_1", "ffnn_config_2", "rnn_config_2", "ffnn_config_3", "rnn_config_3"]
from IPython.display import HTML

output = ""

# BAD THING TO DO BELOW!!
model_used = globals()[model_type]

with torch.no_grad():
    if "ffnn" in model_type:
        vec_in = ffnn_document_preprocessor([[input_string]])[0]
        model_output = model_used(torch.Tensor(vec_in).unsqueeze(0)).cpu().squeeze(0)
    else:
        # RUN MODEL
        vec_in = rnn_document_preprocessor([[input_string]])[0]
        model_output = model_used(torch.Tensor(vec_in).unsqueeze(0)).cpu().squeeze(0)
    #print(torch.cat([torch.Tensor(z).unsqueeze(0) for z in model_inputs]).unsqueeze(0).shape)
    #model_output = model_used(torch.cat([torch.Tensor(z).unsqueeze(0) for z in model_inputs]).unsqueeze(0))
    #print(model_output.shape)
predicted = torch.argmax(model_output)
# MAP BACK TO EMOTION
# print(int(predicted))
emotion = idx_to_emotion[int(predicted)]

# Generate nice display
output += '<p style="font-family:verdana; font-size:110%;">'
output += " Input sequence: "+input_string+"</p>"
output += '<p style="font-family:verdana; font-size:110%;">'
output += f" Emotion detected: {emotion}</p><hr>"
output = "<h3>Results:</h3>" + output

display(HTML(output))

In [37]:
%%capture
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc

In [38]:
%%capture
# the red text is a placeholder! Change it to your directory structure!
!cp 'drive/My Drive/Colab Notebooks/4740_FA20_p3_im324_kl866.ipynb' ./ 

In [39]:
# the red text is a placeholder! Change it to the name of this notebook!
!jupyter nbconvert --to PDF "4740_FA20_p3_im324_kl866.ipynb"

[NbConvertApp] Converting notebook 4740_FA20_p3_im324_kl866.ipynb to PDF
[NbConvertApp] Writing 181635 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 165666 bytes to 4740_FA20_p3_im324_kl866.pdf
