# Today's Problem
#### The first problem of today is a classic in ML: sentiment analysis.
#### First thing to do is to load and look at the dataset, we use some support functions that you can find in utils/data_utils.py

In [1]:
trainFolder = '../data/sentiment/train/'
validFolder = '../data/sentiment/valid/'
testFolder = '../data/sentiment/test/'

#### some positive reviews

In [2]:
with open(trainFolder + 'positive.txt', 'r') as fp:
    positives = fp.readlines()[:5]
    
print("\n".join(positives))

excellent food .

superb customer service .

they also have daily specials and ice cream which is really good .

it 's a good toasted hoagie .

the staff is friendly .



#### ... and some negative ones

In [3]:
with open(trainFolder + 'negative.txt', 'r') as fp:
    negatives = fp.readlines()[:5]
    
print("\n".join(negatives))

i was sadly mistaken .

so on to the hoagies , the italian is general run of the mill .

minimal meat and a ton of shredded lettuce .

nothing really special & not worthy of the $ _num_ price tag .

second , the steak hoagie , it is atrocious .



#### Unfortunately, words are not the right representation to carry out Machine Leraning.
There are many possible ways to look at sentences in order to let a computer do some maths on it, we start from the classical ones:

One-Hot encoding represents each word as a vector $[0, 0, \dots 0, 1, \dots, 0]$ of all zeros except of a one in j-th position, and to each word is associated a unique position.

This is the basic representation (Vector Space) that most ML algorithms use, it has some drawbacks, the first one is that it requires a lot of memory to store the dataset.

### Luckily, we have already implemented some of these functionalities for you

In [4]:
from utils.data_utils import Corpus
corpus = Corpus(trainFolder, validFolder, testFolder, limit=10000)

In [5]:
corpus.train_positive[:5]

[tensor([ 0,  1,  2,  3]),
 tensor([ 4,  5,  6,  2,  3]),
 tensor([  7,   8,   9,  10,  11,  12,  13,  14,  15,  16,  17,  18,
           2,   3]),
 tensor([ 19,  20,  21,  18,  22,  23,   2,   3]),
 tensor([ 24,  25,  16,  26,   2,   3])]

In [6]:
corpus.dictionary.idx2word[0]

'excellent'

##### In this training set, a sentence is represented by a sequence of indices.
This is not the final version of the training set yet, as our algorithms do not understand sequences (yet), instead each input must be a vector of features.
We can convert a sequence of indices to a vector of features by using one-hot encoding, in particular a sentence can be represented by the sum of its one-hot encoded vectors

In [7]:
numDistincWords = len(corpus.dictionary.idx2word)

In [8]:
# TODO Remove this in the lesson version
import numpy as np
#def indicesToFeatures(seq, vecLen):
#    out = np.zeros(vecLen)
#    for ind in seq:
#        out[ind] += 1
#    return out

In [9]:
# V2: without error
#import torch
#def indicesToFeatures(seq, vecLen):
#    out = torch.zeros(vecLen)
#    for ind in seq:
#        out[ind] += 1
#    return out

In [10]:
indicesToFeatures(corpus.train_positive[0], numDistincWords)

tensor([ 1.,  1.,  1.,  ...,  0.,  0.,  0.])

## Time for some machine learning!
We're going to go back to the roots, some good old logistic regression
... with some Deep Learning flavour!

We are going to use a powerful DL package called pyTorch https://pytorch.org/, it has a wonderful community and really easy sintax.

We consider the following model:

$logit(p) = \beta_0 + \beta_1 x_1 + ... \beta_n x_n$

or in a DL syntax:

$y = logSoftmax(Ax + b)$

The problem is now to find the right coefficients (A, b) in order to "fit" the model

<img src="files/images/Logistic_Classifier.png">

## The basics of DL: an optimization problem
in order to pose & solve an optimization problem, we need the following ingredients
- an objective function
- some parameters against which minimize (or maximize) the objective function
- an algorithm to optimize

In this setting we have
- objective function: log likelihood ($ll$)
- parameters: (A, b)
- algorithm: ???

### Gradient Descent
$\nabla ll = \sum x_i (y_i - \hat y _i)$

#### Gradient descent is nice, but...
Requires a lot of memory (need to store all the datapoints) and evaluate the model at all the datapoints on each update of the parameters.
We can relax this and use at each step only a batch (or so called minibatch) of data.
### --> This is the Stochastic Gradient Descent (SGD)

<img src="files/images/sgd.jpeg">

In [11]:
import torch.nn as nn
import torch.nn.functional as F

class LRClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(LRClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec))


In [12]:
indicesToFeatures(corpus.train_positive[0], numDistincWords)

tensor([ 1.,  1.,  1.,  ...,  0.,  0.,  0.])

In [13]:
clf = LRClassifier(2, numDistincWords)
clf(indicesToFeatures(corpus.train_positive[0], numDistincWords))



tensor([-0.6833, -0.7031])

### Nice error! 
This is due to the fact that Torch functions talk only to Torch tensors.
Luckily, we can pass from numpy's ndarrays to torch tensors smoothly, just by calling torch.LongTensor(a)

Go back and modify the indicesToFeatures function in order to return torch longvectors

### what happens if instead we pass a minibatch?

In [14]:
def createMiniBatch(positives, negatives, batch_size, batch_num, vecLen):
    data = positives[batch_num: batch_num + batch_size // 2]
    data.extend(negatives[batch_num: batch_num + batch_size // 2])
    data = list(map(lambda x: indicesToFeatures(x, vecLen), data))
    data = torch.stack(data)
    labels = [1] * (batch_size // 2) + [0] * (batch_size // 2)
    labels = torch.tensor(labels)
    return data, labels

In [15]:
data, labels = createMiniBatch(
    corpus.train_positive, corpus.train_negative,
    12, 0, numDistincWords)

In [16]:
clf(data)



tensor([[-0.6833, -0.7031],
        [-0.6710, -0.7158],
        [-0.6724, -0.7143],
        [-0.6758, -0.7108],
        [-0.6837, -0.7027],
        [-0.6913, -0.6950],
        [-0.6768, -0.7098],
        [-0.6837, -0.7026],
        [-0.6786, -0.7079],
        [-0.6682, -0.7188],
        [-0.6758, -0.7108],
        [-0.6855, -0.7009]])

#### We're still missing out the optimization part
- Option 1: write by hand the partial derivatives and the update schema of the weights.
- Option 2: use pyTorch automatic differentiation's toolbox

<img src="files/images/automatic_differentiation.png" height="250" width="400">

### What we need:
- loss function: provided by pyTorch torch.nn package, in this case nn.NLLLoss() (standard negative log-likelihood)
- optimizer: provided by torch.optim, in this case optim.SGD
- a way to differentiate, this is the best part! just call the method backward() on the loss and it will propagate the gradients up to the input data
- an update rule: just call optimizer.setp() after the loss has been backpropagated

In [17]:
import torch.optim as optim

loss_function = nn.NLLLoss()
optimizer = optim.SGD(clf.parameters(), lr=0.01)
batchSize = 64
numBatches = 300

for batchNum in range(numBatches):
    # Step 1 PyTorch accumulates gradients.
    # We need to clear them out before each instance
    clf.zero_grad()

    # Step 2. Get the datapoints and labels
    dataBatch, labelsBatch = createMiniBatch(
        corpus.train_positive, corpus.train_negative, batchSize, batchNum, numDistincWords)

    # Step 3. Run our forward pass.
    logProbs = clf(dataBatch)

    # Step 4. Compute the loss, gradients, and update the parameters by
    # calling optimizer.step()
    loss = loss_function(logProbs, labelsBatch)
    if batchNum % 20 == 0:
        print('\rBatch: {0}, loss:{1:.5f}'.format(batchNum, loss), flush=True, end=" ")
    loss.backward()
    optimizer.step()

Batch: 20, loss:0.68277 



Batch: 280, loss:0.62338 

#### This is what is called an epoch
Usually, we run the same for many epochs

TODO: try to improve the results by running several epochs
does shuffling the data across epochs improve the results?

In [None]:
def trainEpoch(epochNum, model, loss_function, optimizer, cuda=False):
"""
    for batchNum in range(numBatches):
        # Step 1 PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get the datapoints and labels
        dataBatch, labelsBatch = createMiniBatch(
            corpus.train_positive, corpus.train_negative, batchSize, batchNum, numDistincWords)
        if cuda:
            dataBatch = dataBatch.cuda()
            labelsBatch = labelsBatch.cuda()
        # Step 3. Run our forward pass.
        logProbs = model(dataBatch)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(logProbs, labelsBatch)
        if batchNum % 20 == 0:
            print('\rEpoch: {0}, Batch: {1}, loss:{2:.5f}'.format(epochNum, batchNum, loss), flush=True, end=" ")
        loss.backward()
        optimizer.step()
"""

In [32]:
numEpochs = 10

def trainModel(numEpochs, model, cuda=False):
"""
    loss_function = nn.NLLLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    for epochNum in range(numEpochs):
        trainEpoch(epochNum, model, loss_function, optimizer, cuda)
"""        
        
trainModel(numEpochs, clf)

Epoch: 0, Batch: 20, loss:0.60953 



Epoch: 9, Batch: 280, loss:0.37426 

## When to stop???

We can imagine that at each epoch a "new" model is trained, so we need an evaluation set to compare

We already stored that in corpus.valid.positive & corpus.valid.negative

In [18]:
numValidBatches = 300

def validate(clf, loss_function, cuda=False):
    clf.eval() # turn off gradient propagation
    totalLoss = 0
    for batchNum in range(numValidBatches):
        dataBatch, labelsBatch = createMiniBatch(
            corpus.valid_positive, corpus.valid_negative, batchSize, batchNum, numDistincWords)
        if cuda:
            dataBatch = dataBatch.cuda()
            labelsBatch = labelsBatch.cuda()
        # Step 3. Run our forward pass.
        logProbs = clf(dataBatch)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        totalLoss += loss_function(logProbs, labelsBatch)
    return totalLoss / numValidBatches

In [19]:
validate(clf, loss_function)



tensor(0.6251)

<img src="files/images/we-need-to-go-deeper.jpg">

What we've seen so fare is an example of a shallow feed-forward neural network.
In this context, a layer is a pass of matrix multiplication + nonlinearity --> logSoftmax(Ax + b) is a layer.
Deep Neural networks are just sequences of layers of this kind, with different sizes and activation functions.
most famous activation functions are:
- tanh
- softmax
- relu
- leaky relu
And many more to be found here: https://pytorch.org/docs/stable/nn.html?highlight=activation#relu

Complete the following code to get a 3 layer neural network with relu activation functions and final softmax layer to perform binary classification

In [20]:
class MyDNN(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.firstLayer = nn.Linear(input_dim, 200)
        self.secondLayer = nn.Linear(200, 100)
        self.thirdLayer = nn.Linear(100, 2)
        self.relu = nn.ReLU()
        
    def forward(self, bow_vec):
    """
        out = self.firstLayer(bow_vec)
        out = self.relu(out)
        out = self.secondLayer(out)
        out = self.relu(out)
        out = self.thirdLayer(out)
        return F.log_softmax(out)
    """

In [None]:
# If we have cuda installed, we should use it!

In [41]:
useCuda = torch.cuda.is_available()
useCuda

True

In [21]:
# if you don't have cuda installed, just remove the .cuda()
dnn = MyDNN(numDistincWords)
dnn = dnn.cuda()
dnn(data.cuda())

  from ipykernel import kernelapp as app


tensor([[-0.6721, -0.7146],
        [-0.6711, -0.7157],
        [-0.6663, -0.7208],
        [-0.6716, -0.7152],
        [-0.6715, -0.7153],
        [-0.6717, -0.7151],
        [-0.6729, -0.7138],
        [-0.6719, -0.7149],
        [-0.6711, -0.7157],
        [-0.6731, -0.7137],
        [-0.6691, -0.7177],
        [-0.6718, -0.7149]], device='cuda:0')

In [24]:
### What about training it?

In [25]:
numEpochs = 20

trainModel(numEpochs, dnn, cuda=True)

Epoch: 0, Batch: 20, loss:0.69353 

  from ipykernel import kernelapp as app


Epoch: 19, Batch: 280, loss:0.01337 

### Nice training error, but what about validation??

In [26]:
validate(dnn, loss_function, cuda=True)

  from ipykernel import kernelapp as app


tensor(0.3534, device='cuda:0')

This is a clear indication of overfitting!

In [27]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(dnn)

1667102

The model is overparametrized, we have two options:
- reduce model complexity
- add regularization

In [28]:
np.sum(dnn.firstLayer.weight.data.cpu().numpy()**2) + \
    np.sum(dnn.firstLayer.weight.data.cpu().numpy()**2)

150.8012

#### $L^p$ regularization
Is a technique to prevent weights in the nerual network to explode, by adding a penalization term in the loss function
the new loss will be
$loss = NLL + \sum (|w_i|^p)^{\frac{1}{p}}$

Two famous lossess are with p=2 and p=1
$L^2$ loss is already included in most of the optimization algorithms https://pytorch.org/docs/stable/optim.html#torch.optim.SGD
For $L^1$ we need to go deeper in the code and insert it by hand as a penalization term

In [22]:
def trainEpochWithPenalization(epochNum, model, loss_function, optimizer, factor, cuda=False):
    for batchNum in range(numBatches):
        # Step 1 PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get the datapoints and labels
        dataBatch, labelsBatch = createMiniBatch(
            corpus.train_positive, corpus.train_negative, batchSize, batchNum, numDistincWords)
        if cuda:
            dataBatch = dataBatch.cuda()
            labelsBatch = labelsBatch.cuda()
        # Step 3. Run our forward pass.
        logProbs = model(dataBatch)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(logProbs, labelsBatch)
        l1_crit = nn.L1Loss(size_average=False)
        reg_loss = 0
        for param in model.parameters():
            target = torch.zeros(param.shape)
            if cuda:
                target = target.cuda()
            reg_loss += l1_crit(param, target)

        loss += factor * reg_loss
        if batchNum % 20 == 0:
            print('\rEpoch: {0}, Batch: {1}, loss:{2:.5f}'.format(epochNum, batchNum, loss), flush=True, end=" ")
        loss.backward()
        optimizer.step()
        
        
def trainModelWithPenalization(numEpochs, model, factor, cuda=False):
    loss_function = nn.NLLLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    for epochNum in range(numEpochs):
        trainEpochWithPenalization(epochNum, model, loss_function, optimizer, factor, cuda)

In [24]:
dnn = MyDNN(numDistincWords)
dnn = dnn.cuda()
trainModelWithPenalization(20, dnn, 0.00001, useCuda)

Epoch: 0, Batch: 20, loss:0.79110 

  from ipykernel import kernelapp as app


Epoch: 19, Batch: 280, loss:0.10534  

In [25]:
np.sum(dnn.firstLayer.weight.data.cpu().numpy()**2) + \
    np.sum(dnn.firstLayer.weight.data.cpu().numpy()**2)

130.2332

In [26]:
validate(dnn, loss_function, cuda=useCuda)

  from ipykernel import kernelapp as app


tensor(0.3531, device='cuda:0')

### A different perspective: Dropout regularization
http://jmlr.org/papers/v15/srivastava14a.html
https://arxiv.org/pdf/1207.0580.pdf

simply speaking, dropout sets randomly to zero some of the weights of the network ad each pass.
The rationale behind it is that it should force the network to understand the nature of data without overfitting it. 

https://pytorch.org/docs/stable/nn.html#dropout

#### Your Turn! use dropout to regularize the neural network, does it improve its ability to generalize?

In [38]:
class MyDNNWithDropout(nn.Module):
    """
    def __init__(self, input_dim):
        super().__init__()
        self.model = nn.Sequential(
            torch.nn.Linear(input_dim, 500),
            torch.nn.ReLU(),
            nn.Dropout(p=0.5),
            torch.nn.Linear(500, 100),
            torch.nn.ReLU(),
            nn.Dropout(p=0.5),
            torch.nn.Linear(100, 2)
        )
        
    def forward(self, bow_vec):
        out = self.model(bow_vec)
        return F.log_softmax(out)
    """

In [39]:
dnn = MyDNNWithDropout(numDistincWords)
trainModel(numEpochs, dnn)

Epoch: 0, Batch: 0, loss:0.69391 

  app.launch_new_instance()


Epoch: 9, Batch: 280, loss:0.14762 

In [40]:
validate(dnn, loss_function)

  app.launch_new_instance()


tensor(0.3526)

# Homework

1- write the code to perform real testing (e.g. classification accuracy) on the test set

2- try to use torch.nn.sequential(...) to avoid having to write

    def forward(self, bow_vec):
        out = self.firstLayer(bow_vec)
        out = self.relu(out)
        out = self.secondLayer(out)
        out = self.relu(out)
        out = self.thirdLayer(out)
        return F.log_softmax(out)

3 - adapt all what has been done to work with sequences of digits insetad of words
the problem you need to solve is the following: given a proper name, understand if it's masculine or feminine

Does the model perform well? Can you explain why?