# Assignment 1 - Sentiment analysis

## 0. Instructions

To complete the assignment please do the following  steps (all the three are requred to get the full credits): 


### 0.1. Notebook (*.ipynb)

Upload to Canvas a filled notebook with answers (this file). 
Please enter the questions inside this notebook where appropriate in the designated cells.

### 0.2. Python Scripts (*.py)

In *Practical* part of this notebook you will be asked to implement different classifiers' architectures. Upload to Canvas your code which implements solutions to these parts in the form of *.py files (not *.ipynb notebooks) of the models:

  - ``classifier_lr.py`` -  an LR based classifier
  - ``classifier_ffnn.py`` - a FFNN based classifier
  - ``classifier_rnn.py`` - an RNN based classifier

  These scripts should have the specific struction as it is shown in the baseline soultion [here](https://github.com/skoltech-nlp/filimdb_evaluation/blob/master/classifier.py). So, you should implement your ``train`` and ``classify`` functions (``pretrain`` if needed). Your model should be implemented as a special class/function in this script (be sure if you add any outer dependencies that everything is improted correctly and can be reproducable).

  Each of these Python classifiers will be renamed to "classifier.py" and automatically evaluated using the evaluate.py script. Please make sure that they did not contain any dependencies which are specific to your system.

  *Important*: to make sure everything works, please use ONLY the following software configuration (no matter which operating system you use): Anaconda 2020.07  distribution for Python 3.8 and PyTorch 1.3. The preferred way to install PyTorch is "conda install -c pytorch pytorch" and Torchtext is "conda install -c pytorch torchtext". There should be no additional libraries used: Anaconda already provides a sufficient number of them. If you need something just select from these available. Test for no the absence of dependencies by creating a virtual environment. 



### 0.3 Shared task at Codalab

After the implementation of models' architectures you are asked to participate in the [competition](https://competitions.codalab.org/competitions/30517) to solve **Sentiment Analysis for IMDb Movie Review** task using your implemented code. 

  You should use your classifier scripts from the previous part to train, validate, and generate predictions for the public and prevate test sets. For this you should use [``evaluate.py``](https://github.com/skoltech-nlp/filimdb_evaluation/blob/master/evaluate.py) script. It will produce predictions (preds.tsv) for the dataset and score them if the true labels are present. You can use these scores to evaluate your model on dev set and choose the best one. Be sure:

  - To download the [dataset](https://github.com/skoltech-nlp/filimdb_evaluation/blob/master/FILIMDB.tar.gz) and unzip it in the same folder where ``evaluate.py`` is.
  - To put your ``classifier.py`` and ``evaluate.py`` scripts in the same folder. 

  The models can be trained on your local machines on CPU. However, if you work in Colab you can dowload data and scripts with ``wget`` command and run them from notebook cells. 

Upload obtained TSV file with your predictions (preds.tsv) in ``.zip`` for the best results on the dev set using LR, FFNN, and RNN  respectively to the public leaderboard of the competition. 

  *Important*: You have to upload predictions based on LR model in the sub-task for LR (https://competitions.codalab.org/competitions/25623), predictions based on FFNN model in the sub-task for FFNN (https://competitions.codalab.org/competitions/25623), and predictions based on RNN model in the sub-task for RNN (https://competitions.codalab.org/competitions/25623). So in each track there is a fair competition (only the same models are compared). Your scores will not be taken into account if you submit it in the wrong sub-task, e.g. LR preditions to FFNN or RNN sub-task!

Please, provide here in the notebook your user name in Codalab competition that we can recognize you in the leaderboard.


**YOUR USERNAME IN THE CODALAB LEADERBOARD:**

```

ENTER HERE

```

## 1. Theoretical part

This part contains some questions about the models and concepts.

### 1.1 Logistic Regression

Let us introduce the following notation:

$(x_{\{1\}}, y_{\{1\}}), \ldots, (x_{\{N\}}, y_{\{N\}})$ --- train set of size N, 

$x_{\{i\}} \in\mathbb{R}^M$ --- feature vector of the $i^{th}$ sample from train set, $M$ --- number of features, 

$y_{\{i\}} \in \{0, 1\}$ --- label (class) of the $i^{th}$ sample, 

$w\in\mathbb{R}^{M+1}$ --- weight vector in LogReg.

_**NB:**_ linear transofrmations on $x_{\{i\}}$ is as follows: 
$$
w_0 + w^Tx_{\{i\}} = w_0+w_1*x_{\{i\},1}+\ldots+w_M*x_{\{i\},M},
$$

where $w_0$ stands for intercept term (bias).

For the convenience of implementation we will set $x_{\{i\},0} = 1$. In other words, we will add 1 to vectors $x_{\{i\}}$. Therefore, linear transformations will be the following:

$$w_0*1+w_1*x_{\{i\},1}+\ldots+w_M*x_{\{i\},M} \equiv w^T[1;x_{\{i\}}]$$


1. Find the derivative of the sigmoid function $\sigma(z)$ and express it in terms of sigmoid, considering $z$ to be scalar 
$$ 
\sigma(z) = {\frac {1}{1+e^{-z}}}
$$

```

PLEASE ENTER HERE YOUR ANSWER 

```

2. Prove that:  

$$ \sigma(-z) = 1 - \sigma(z)$$

```

PLEASE ENTER HERE YOUR ANSWER 

```

3. Write out the formula of hypothesis $h_w(x)$ for logistic regression.

```

PLEASE ENTER HERE YOUR ANSWER 

```

4. Plot the values of Binary Cross-Entropy error function for one sample from positve class and one sample from negative, depending on the logreg output $\hat y=h_w(x)$. What the loss function value will be equal to, given zero weights (right after the initialization)?
$$ bce(y, \hat y)= -y \log \hat y - (1-y) \log (1 - \hat y)$$

```

PLEASE ENTER HERE YOUR ANSWER 

```

5. Calculate gradient for cost function $\nabla_w L(w, x_{\{1\}},\ldots,x_{\{N\}})$ for binary (2 class) logistic regression. As evaluation function use cross-entropy with l2 regularization:
$$ L(w,x_{\{1\}},\ldots,x_{\{N\}}) = - \frac1{N} \sum_{i=1}^N(y_{\{i\}} \log h_w(x_{\{i\}}) + (1 - y_{\{i\}}) \log (1 - h_w(x_{\{i\}}))) + \alpha\sum_{j=1}^M(w_j)^2$$

```

PLEASE ENTER HERE YOUR ANSWER 

```

_**NB:**_ Regularization component $\sum_{j=1}^M(w_j)^2$ does not incluse $w_0$, as it is responsible for overall shift only. It may be of any value, whereas $L_2$ regularization seeks to minimmize $w_i$ values.

6. Write out the formula for the update of vector $w$ with parameters using stochastic gradient descent optimization

```

PLEASE ENTER HERE YOUR ANSWER 

```

7. Prove that binary cross-entropy evaluation function for binary logistic regression has the only one minimum

```

PLEASE ENTER HERE YOUR ANSWER 

```

8. Show that minimization of Binary Cross Entropy loss function for logistic regression is equivalent to the following function (sum over samples and regularization component is omitted):
$$ softplus(-tw^Tx),$$

where

$$
softplus(x)=log(1+e^x)$$

$$t=2y-1 \in \{-1+1\}$$


```

PLEASE ENTER HERE YOUR ANSWER 

```

### 1.2 Feed-forward Neural Network

Let use the fllowing notation:

$(x_{\{1\}}, y_{\{1\}}), \ldots, (x_{\{N\}}, y_{\{N\}})$ --- train set of size $N$

$x_{\{i\}} \in\mathbb{R}^M$, where $i$ is the review index, 

$M=s^{(0)}$ --- number of features or dictionary size, 

$y_{\{i\}} \in \{0, 1\}$, $s^{(l)}$ - number of neurons in the layer $l$, 

$W^{(l)}$ --- parameter matrix of $l^{th}$ layer of size $s^{l} \times (s^{l-1}+1)$ (as we add bias), where

$l=\{1,2,\cdots,L\}$, $L$ --- number of layers (number of hidden layers is equal to $L-1$)


Feed-forward Neural Network example with two layers (one hidden layer):

![img](http://panchenko.me/figures/nn.jpg)


Feed-forward propagation:

$a^{(0)} = x_{\{i\}} $

$z^{(1)} = W^{(1)} [1, a^{(0)}] $

$a^{(1)} = tanh(z^{(1)})$

$z^{(2)} = W^{(2)}[1, a^{(1)}] $

$a^{(2)} = softmax(z^{(2)}) $


Backpropagation:
...

1. Calculate the derivative for $\tanh(z)$ function and express it in terms of $tanh(z)$, considering $z$ to be a scalar. Transform your answer, so that the exponent will be used only once while computing $\tanh(z)$ and its derivative.
$$ \tanh(z) = {\frac {e^{z}-e^{-z}}{e^{z}+e^{-z}}}$$


```

PLEASE ENTER HERE YOUR ANSWER 

```

2. Write out the cross entropy loss function $L(W^{(1)}, \ldots, W^{(L)},x_{\{1\}},\ldots,x_{\{N\}})$ for neural network with one hidden layer ($L=2$), and then generalize it for neural network with $L-1$ hidden layers and for multiclass classification (with $K$ classes). Use $\tanh(z)$ activation function for hidden layer,  $softmax(z)$ for output layer.

```

PLEASE ENTER HERE YOUR ANSWER 

```

3. Demonstrate that $softmax(z+c)=softmax(z)$, where ${c}$ -- vector with equal components

```

PLEASE ENTER HERE YOUR ANSWER 

```

4. How many parameters does the neural network have? Inputs vectors size is $M$, output vectors size is $K$ and the number of neurons is $H$.

```

PLEASE ENTER HERE YOUR ANSWER 

```

5. Provide the formula for the $\delta^{(L)}$ --- gradient of loss function based on pre-activation on the last layer. $z^{(L)}$.

```

PLEASE ENTER HERE YOUR ANSWER 

```

6. Provide the formula for $\delta^{(l)}$ --- gradient of loss function on $z^{(l)}$ through $\delta^{(l+1)}$.

```

PLEASE ENTER HERE YOUR ANSWER 

```

7. Provide the formula for $\nabla_{W^{(l)}} L$ --- gradient of loss function on weights $W^{(l)}$, using $\delta^{(l)}$.

```

PLEASE ENTER HERE YOUR ANSWER 

```

### 1.3 Word Embeddings

1. Write down objective functions of the Skip-Gram word embedding models assuming negative sampling (SGNS).

```

PLEASE ENTER HERE YOUR ANSWER 

```

2. Write down derivatives with respect to the parameters (weights) of this loss function.

```

PLEASE ENTER HERE YOUR ANSWER 

```

### 1.4 Recurrent Neural Networks

#### 1.4.1 Computing probability

Consider a sequence $x_1, x_2, ..., x_T$, where $x_i$ is an index of token in the vocabulary $V$ and a two-layer language model based on the LSTM neural network. Let's assume that this network generates an estimation $\hat{y_i}[w] = \hat{P}(x_i=w|x_1, ..., x_{i-1})$ of probability that token $w$ follows the sequence of tokens $x_1, ..., x_{i-1}$. 

1. Write formulas for the forward pass of this network: how to estimate the $\hat{y_i}[w] = \hat{P}(x_i=w|x_1, ..., x_{i-1})$? For simplicity, write first formulas for the individual layers, then use them to write the full formula for the forward pass.


```

PLEASE ENTER HERE YOUR ANSWER 

```

2. How to estimate the probability of a sequence of tokens $x_1, x_2, ..., x_T$ ie what is the probability of $\hat{P}(x_1,...,x_T)$?

```

PLEASE ENTER HERE YOUR ANSWER 

```

#### 1.4.2 Vanishing gradient problem



What is the vanishing / exploiding gradient problem in Elman recurrent neural networks? Write down update equations for Elman RNN and explain what is causing the vanishing / exploiding gradient issue.

```

PLEASE ENTER HERE YOUR ANSWER 

```

How does LSTM help prevent the vanishing (and exploding) gradient problem in a recurrent neural network? Write down the equations of LSTM and explain how technically this schema is better than the Elman recurrent neural networks.

```

PLEASE ENTER HERE YOUR ANSWER 

```

## 2. Practical part





The goal of this part is to implement text classifiers based on logistic regression (LR), feed-forward neural netword (FFNN) or recurrent neural network (RNN) where pre-trained word embeddings are used as features. For LR and FFNN you need to simply average word embeddings of a sentence (perform average pooling of word vectors) and they apply the LR/FFNN to the output representation. In case of RNN you feed one embedding per input token.

### 2.1 Logistic Regression

2.1.1 Implementation of the model 

The goal of this section to implement using **pytorch** a text categorization model using logistic regression. Following the steps below, you can complete corresponding parts in your ``classifier_lr.py`` script and apply the code for sentiment classification task. 

**Important**: You are not expected to implement logistic regression training from scratch in this (updated) version of the task to get full scores. You are free to use optimization package of pytorch. The implementations in this assignment could and should be based on the models from the second and third seminars (and their variants by you). 

To implement the model, you can follow the steps below. 

1. Load the dataset, preprocess and tokenize it. Build a dictionary with unique tokens from the train set.

2. To train our logistic regression, our train and test data should be converted to matrices of size $N * M$ ($N$ -- number of reviews, $M$ -- feature). You can use either bag-of-words feature representation, in this case features are words or you can use word embeddings as the initial embedding matrix. In the latter case, first, load embeddings from the disk, then select the correct subset of embeddings for the words that are actually present in the data, and finally setting the Embedding layer’s weight matrix as the loaded
subset.

3. Add a sigmoid function.

4. Initialization. Write weights initialization function.

5. Forward pass. Write a function that compute the objective function of logistic regression.

6. Training loop. Write the function that makes a gradient descent. It is recommended to use mini-batch gradient descent: split your train data to mini-batches (100-500) samples, on each iteration calculate the gradient not for all train set, but on the current batch only. It will speed up one iterqtion computation time and the model will converge faster.
 

#### 2.1.2 Learning rate 

Set up the learning rate equal to **1e-3**, regularizer coefficient of $L_2$ equal to  $\alpha$=**1e-5**. Train logistic regression on train set. Build the plots for the loss function values and accuracies on train and validation sets during training.

To plot this curves use matplotlib library ([very short](http://cs231n.github.io/python-numpy-tutorial/#matplotlib) и [not so very short](https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py) tutorials). Your may draw these plots in Jupyter notebook as well (see [example](http://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Part\%203\%20-\%20Plotting\%20with\%20Matplotlib.ipynb)). 


_Question: plot the training curves. In how many epoch does yout algorithm converge? What accuracy do you get on train, dev and test sets? Do you observe underfitting or overfitting?_

```

PLEASE ENTER HERE YOUR ANSWER 

```

Try to set different learning rates.

*Question: Plot different training curves for different learning rate parameters. Which conclusions could be made on this?*

```

PLEASE ENTER HERE YOUR ANSWER 

```

#### 2.1.3 Regularization 

For $\alpha$ coefficient of $L_2$ regularizer we used a random value. Wrong/inappropriate $\alpha$ causes underfitting ($\alpha$ is too large) or overfitting ($\alpha$ is too small). Choose the appropriate $\alpha$ that helps to perform better on validation set. Be careful: $\alpha$ changes the objective fiunction, so it is possible that the learning rate and the number of epochs should be changed too. Use plots to choose the appropriate values!

_Question: plot training curves for several $\alpha$ values. What conclusions could be made? How many epochs and which learning rate do you need until it converges? How long does it take to train and to label test data?_

```

PLEASE ENTER HERE YOUR ANSWER 

```

### 2.2 Feed-forward Neural Network



The goal of this section to implement using **pytorch** a text categorization model using Feed-forward Neural Network. Following the steps below, you can complete corresponding parts in your ``classifier_ffnn.py`` script and apply the code for sentiment classification task.

1. Repeat the steps 2.1.1 - 2.1.2 from the Logistic regression subsection.

2. Implement Feed-forward Neural Network with **one** hidden layer using **pytorсh**.

3. Similarly to 2.1.10 and 2.1.11, do the finetuning for the learning rate and $\alpha$ coefficient of $L_2$ regularizer hyperparameters on validation set.


_Question: plot learning curves for different $\alpha$. What is the optimal value of $\alpha$? of learning rate? How many epoch does it take to converge?_


```

PLEASE ENTER HERE YOUR ANSWER 

```

Using $\alpha$ and learning rate from 2.2.3 train the classifier on the whole train set. 

### 2.3 Recurrent Neural Networks

The goal of this section to implement using **pytorch** a text categorization model using Recurrent Neural Networks. Following the steps below, you can complete corresponding parts in your ``classifier_rnn.py`` script and apply the code for sentiment classification task. Choose between all proposed configuration that one that gives you the best validation score and implement it for the final submission.

#### 2.3.1 Use LSTM and word embeddings for text classification 


Implement a text classifier based on Bi-LSTM network. Use hidden state(s) to represent an input text document.  If you use ``torch`` use the ``torch.nn.Embedding`` to load pre-trained word embeddings. Use the [GloVe](http://nlp.stanford.edu/data/wordvecs/glove.6B.zip) embeddings in the input layer of your network.

#### 2.3.2 Use LSTM and ELMo for text classification 

Use ``allennlp`` and the model  ``elmo_2x2048_256_2048cnn_1xhighway_weights`` (used in the RNN seminar) to build a text classification system. The only difference from the previous point is the use of ELMo contextualized word embeddings. Do not use any additional dependencies or versions of the ELMo model. Make sure that the model is located in the same directory with the classification Python script.

#### 2.3.3 Use of document embeddings for text classification 



Use ``gensim`` to obtain document embeddings for all reviews. Build a model based on logistic regression using ``sklearn`` which load these embeddings for each document and performs a classification. 

_Discuss: with which configuration have you achieved the best score? What was the results? What is your opinion -- why some model has performed worse and some better?_

```

PLEASE ENTER HERE YOUR ANSWER 

```

## 3. Research part

### 3.1 Logistic Regression

Apart form classical gradient descent approach there are a lot of SGD variations, such as Adam, Adagrad, RMSProp, that are frequently used in the real-word cases. ([short](http://cs231n.github.io/neural-networks-3/#update) and [long](http://ruder.io/optimizing-gradient-descent/index.html#momentum) reviews on these approaches). The key thing about those approaches: apart from the gradients they use second derivative (momentum) for the next step.


1. Test Momentum or Adagrad and use it to train your LR. You are asked to test various optimizers to observe how they and their meta-parameters influence the peroformance of an NLP model at this point, not to implement them from scratch. Draw several plots for training with different hyperparameter values. What can be observed from the results and what conclusions could be made?

```

PLEASE ENTER HERE YOUR ANSWER

```


2. In order to improve classification  performance it is important to understand why and where the classifier makes mistakes. Find some examples, define, which features lead to the errors (e.g., in positive reviews $w^Tx_{\{i\}}$ they are used with mostly negative weights making the scalar product negative and wrong classification as the result). Are there any common features or tendencies in the errors found? If so, tell which steps could be made to improve the classifier?

```

PLEASE ENTER HERE YOUR ANSWER 

```

3. Experiment with various types of representations.

  **In case your model was based on bag-of-words**: vector component other different features may be used: absolute word counts, relative counts, binary features (whether the word appears in text), etc. We can transform each feature somehow: take logarithm, transform that each feature would belong to the [0, 1] range or normalize (substract mean value and divide by std. 

  **In case your model was based on word embeddings**: Try other types of embeddings, e.g. character/subword-aware embeddings if you used word2vec, or just other types of embedding: if you tried word2vec use GloVe or fastText or Flare embeddings. 

  Try different approaches and describe the results.


```

PLEASE ENTER HERE YOUR ANSWER 

```

### 3.2 Feed-forward Neural Network

1. Try to improve your neural classifier performance expermenting with model architecture (change number of layers and their sizes). Draw the training curves (loss and accuracy) showing dependency of 1) layer size 2) number of layers. According to your experiments, which structure should be considered as the most efficient?

```

PLEASE ENTER HERE YOUR ANSWER 

```

2. Try other gradient descent algorithms to your model (e.g. Adam, Adagrad, RMSProp).

```

PLEASE ENTER HERE YOUR ANSWER 

```

### 3.3 Recurrent Neural Networks

#### 3.3.1 Different types of embeddings


Compare performance of [GloVe](http://nlp.stanford.edu/data/wordvecs/glove.6B.zip), [word2vec](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) models to the model which has randomly initialized embedding layer (no pre-traied embeddings are used). Plot the results depending on the type of used embeddings. 

```

PLEASE ENTER HERE YOUR ANSWER 

```

#### 3.3.2 Impact of hyper-parameter choice

Try different numbers of hidden layers, LSTM cells used in each layers, learning rates, and other meta-parameters. Present plots which demonstrate performance of the model depending of values of these meta-parameters. Does bi-directional LSTM works better than uni-directioanl LSTM for this task? 


```

PLEASE ENTER HERE YOUR ANSWER

```


## 4. Bonus practical part: using BERT for text categorization



This additional task can yield you extra points. In this task, you will need to create a sentiment text categorization model using a transformer-based pre-trained language model such BERT, ELECTRA, RoBERTa, etc. 

To complete the task you need to complate two tasks below.


###4.1 Text classifier

Write in the cell below the complete executable code of your solution (you do not need to provide the ``classifier.py`` script in this case. 

1. Please enter your code below.
2. Perform the required downloads of the data for training of the model and generation of the TSV file.
3. Your model has to be trained and generate the file for the Colab susbmission.

In [None]:
# Enter code here 

###4.2 Submission to Colab


Upload the file to Codalab. Write below how it compared to scores of your submissions with simpler models in this assignment (LR, FFNN, RNN). 

```

PLEASE ENTER TEXT HERE

```