# Hands on: Sentiment Analysis classification using different DNN architectures

## Task

The objective of sentiment analysis is to define the polarity of a sentence. For example in movie' s review the objective is to define the overall user satisfaction whether it is positive or negative. Here, we will not focus on aspect based sentiment analysis (polarity for specific aspects e.g actors, scenario, costume) but we will focus on a simple classification task of positive and negative polarity. 

We will train a model to predict for the polarity of a given sentence/paragraph. The input will be text and the output will be the positive or negative class. 

To perform this task we need data. Download and store your data under the data folder on the "Colabnotebooks" folder you have created for the project. 

The dataset is called aclImdb and contains movie reviews from a well known benchmark. The folder contains the training, test and validation data. 

We will 
1. build the model using different architectures
2. train the model 
3. evaluate the model' s accuracy 

The architectures we will use are
1. a simple linear regression model 
2. an LSTM model 
3. BERT model 


## Prepare notebook
Before you proceed make sure that your notebook 

## Prepare data
You need to mount your google drive to be able to read and write on the drive. 
During the reading operation the data will be read by the code and after training model will be stored under data/models folder. So make sure you create the folder in advance. To mount the drive just follow the instructions (go to url, choose the email account linked to this notebook and copy paste the url link in the window that appears below). 

In [None]:
https://github.com/mkoutsog/uoc_nlp_lectures.git

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
!ls

drive  gdrive  sample_data


In [5]:
%cd gdrive/MyDrive/UOC/invited_talk/

/content/gdrive/MyDrive/UOC


In [9]:
!ls


 1703.01619.pdf		  csd
 1-HandsOnUoC.ipynb	 'Project on Auditory Sig. Proc..ipynb'
 2020summer_dl4nlp_labs   UOC-2021.gslides


In [None]:
# How to setup a Multi-Layer Perceptron model for imdb sentiment analysis in Keras

def Snippet_380(): 

    print()
    print(format('How to setup a Multi-Layer Perceptron model for sentiment analysis in Keras','*^92'))

    import time
    start_time = time.time()

    # load libraries
    from keras.datasets import imdb
    from keras.models import Sequential
    from keras.layers import Dense, Flatten
    from keras.layers.embeddings import Embedding
    from keras.preprocessing import sequence
    
    # load data and Set the number of words we want
    top_words = 5000
    input_length = 500

    # Load data and target vector from movie review data
    (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
    
    print(); print(X_train.shape); print(X_train)
    print(); print(y_train.shape); print(y_train)    
    print(); print(X_test.shape);  print(X_test)
    print(); print(y_test.shape);  print(y_test)    
    
    # Convert movie review data to feature matrix
    X_train = sequence.pad_sequences(X_train, maxlen=input_length)
    print(); print(X_train.shape); print(X_train)

    X_test = sequence.pad_sequences(X_test, maxlen=input_length)
    print(); print(X_test.shape);  print(X_test)

    # setup a MLP network
    model = Sequential()
    model.add(Embedding(top_words, 32, input_length=input_length))
    model.add(Flatten())
    model.add(Dense(250, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()

    # Fit the model
    model.fit(X_train, y_train, validation_data=(X_test, y_test), 
              epochs=20, batch_size=128, verbose=1)

    # Final evaluation of the model
    scores = model.evaluate(X_test, y_test, verbose=1)
    print("Accuracy: %.2f%%" % (scores[1]*100))

    print(); print("Execution Time %s seconds: " % (time.time() - start_time))

Snippet_380()

Bef
Install the packages and load libraries. 


In [None]:
# set seed for replicability of results
import numpy as np
import tensorflow as tf

np.random.seed(1)
tf.random.set_seed(2)

In [None]:
# Load the data
import re
import pandas as pd

# Let's do 2-way positive/negative classification instead of 5-way    
def load_sst_data(path,
                  easy_label_map={0:0, 1:0, 2:None, 3:1, 4:1}):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = easy_label_map[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)
    data = pd.DataFrame(data)
    return data

sst_home = 'drive/My Drive/Colab Notebooks/2020summer_dl4nlp_labs/data/trees/'
training_set = load_sst_data(sst_home + 'train.txt')
dev_set = load_sst_data(sst_home + 'dev.txt')
test_set = load_sst_data(sst_home + 'test.txt')

print('Training size: {}'.format(len(training_set)))
print('Dev size: {}'.format(len(dev_set)))
print('Test size: {}'.format(len(test_set)))

## 2. Examining the data

In [None]:
# Print a sample of negative text chunks
training_set[training_set.label == 0].head(10)

In [None]:
# Print a sample of positive text chunks
training_set[training_set.label == 1].head(10)

## 3. Preprocessing the data
Once data is loaded the next step is to preprocess it to obtain the vectorized form (i.e. the process of transforming text into numeric tensors), which basically consist of:

- Tokenization, tipically segment the text into words. (Alternatively, we could segment text into characters, or extract n-grams of words or characters.)
- Definition of the dictionary index and vocabulary size (in this case we set to 1000 most frequent words)
- Transform each word into a vector. 


There are multiple ways to vectorize tokens. The main two are the following: ___One-hot encoding___ and ___word embedding___. In this lab, we'll Keras basic tools to obtain the one-hot encoding, and we'll leave word embeddings for the successive labs. 

In [None]:
from sklearn.utils import shuffle

# Shuffle dataset
training_set = shuffle(training_set)
dev_set = shuffle(dev_set)
test_set = shuffle(test_set)

# Obtain text and label vectors, and tokenize the text
train_texts = training_set.text
train_labels = training_set.label

dev_texts = dev_set.text
dev_labels = dev_set.label

test_texts = test_set.text
test_labels = test_set.label

In [None]:
print(training_set.loc[0])
print(train_labels.loc[0])

### 3.1. One-hot encoding of the data

One-hot encoding is the most basic way to convert a token into a vectort. Here, we'll turn the input vectors into (0,1)-vectors. The process consist of associating a unique integer-index with every word in the vocabulary.

>>>>>![](http://ixa2.si.ehu.es/~jibloleo/uc3m_dl4nlp/img/vectorize_small.png)


For example, if the tokenized vector contains a word that its dictionary index is 14, then in the processed vector, the 14th entry of the vector will be 1 and the rest will set to 0.

Note that when using keras built-in tools for indexing, ```0``` is a reserved index that won't be assigned to any word.

In [None]:
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras import utils

# Create a tokenize that takes the 1000 most common words
tokenizer = text.Tokenizer(num_words=1000)

# Build the word index (dictionary)
tokenizer.fit_on_texts(train_texts) # Create word index using only training part

# Vectorize texts into one-hot encoding representations
x_train = tokenizer.texts_to_matrix(train_texts, mode='binary')
x_dev = tokenizer.texts_to_matrix(dev_texts, mode='binary')
x_test = tokenizer.texts_to_matrix(test_texts, mode='binary')
          
# Converts the labels to a one-hot representation
y_train = train_labels
y_dev = dev_labels
y_test = test_labels

print('Text of the first examples: \n{}\n'.format(train_texts.iloc[0]))
print('Vector of the first example:\n{}\n'.format(x_train[0]))
print('Binary representation of the output:\n{}\n'.format(y_train[0]))

print('Shape of the training set (nb_examples, vector_size): {}\n'.format(x_train.shape))

In [None]:
# Recorver the word index that was created with the tokenizer
word_index = tokenizer.word_index
print('Found {} unique tokens.\n'.format(len(word_index)))
word_count = tokenizer.word_counts
print("Show the most frequent word index:")
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=True)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 9: 
        print('')
        break

for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=False)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 9: 
        print('')
        break

Check what we obtain when we vectorize words that are out of the index (_out of vocabulary_ words).

In [None]:
oov_sample = ['saddam', 'plausible']
sequences = tokenizer.texts_to_matrix(oov_sample)
print(sequences[0])

It is possible to obtain the lists of integers indices instead of the one-hot binary representation.

In [None]:
# Turns strings into list of integer indices
print(train_texts.iloc[0])
one_hot_results = tokenizer.texts_to_sequences(train_texts)
print(one_hot_results[0])

## 4. Building the model architecture
When we build a neural network we usually take into account the following points:
- The __layers__, and how they are combined (that is, the structure and parameters of the model)
- The __input__ and the __labeled output__ data that the model needs to map.
- __Loss function__ that signals how well the model is doing.
- The __optimizier__ which defines the learning procedure.

In this very first session we'll keep all this very simple. Keras provide a simple framework for combining layers. There are available two types of classes for building the model: The _Sequential_ Class and the _functional_ API. The later is dedicated to DAGs structures, which let you to build arbitrary models. The former is for linear stacks of layers, which is the most common and simplest archicture. 

In this session, we will build a __logistic regression__ which is the most simple neural network model. More complicated models, like the multilayer perceptron, will be learnt in the next lab sessions.

The logistic regression can be implemented in Keras using ```Dense``` layer. ```Dense``` implements the following  operation:

> ```output = activation(dot(input, kernel) + bias)```

where activation is the element-wise activation function passed as the ```activation``` argument (_sigmoid_ function in our case), ```kernel``` is a weights matrix created by the layer, and ```bias``` is a bias vector created by the layer.

Remenber from the slides that mathematically can be written as follows:
> $sigmoid(W^{T}X + b)$


Regarding input data, we will use the __one-hot encoding__ (see previous sections). We'll set ```(binary) cross-entropy``` as a __loss function__ and ```adam```, a variant of the _Stochastic Gradient Descent_, as the __optimizer__.

Feel free to explore different loss-functions (e.g. MSE) and optimizers (e.g. rmsprop) you can improve the model (see Exercise 2, below).

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

input_size = x_train[0].shape[0] ## vector length equals to vocabulary size.
print(input_size)
num_classes = 2

model = Sequential()
model.add(Dense(units=1, activation='sigmoid', input_shape=(input_size,)))
# now the model will take as input arrays of shape (*, input_size)
# and output arrays of shape (*, 1)

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

### Exercise 1
Answer the following questions:
- How many layers has the model?
- What is the input size? 
- How many parameters has the models? What is the relationship with the input size?

## 5. Training the model
Next piece of code trains the model defined above. As you can see the way we train the model is very similar to scikit-learn.

```model.fit()``` returns the history of the training, which contains the accuracies and loss values of training and development sets obtained in each of the 50 epoch.

In [None]:
# TODO Run the model. Feel free to experiment with different batch sizes and number of epochs.
history = model.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=2)

In [None]:
import matplotlib.pyplot as plt

# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

## 6. Evaluating the model

Once we fit the model we can use the method ```model.evaluate()``` to obtain the accuracy on test set.

In [None]:
score = model.evaluate(x_test, y_test, verbose=1)
print("Accuracy: ", score[1])

## 7. Model Tuning

### 7.1. Effect of Learning Rate 

#### Exercise 2
The model in Section 4 uses default values of learning rate, and does not use any type of regularization. It would be great if you could try improving the model by exploring different parameters. You can explore the following hyperparameters: 
- Optimizers: SGD, ADAGRAD, etc.
- Learning Rates of the optimizer.
- Regularization.

You can check Keras API to learn how to use and set up different optimizers: 
- https://www.tensorflow.org/api_docs/python/tf/keras/optimizers (https://keras.io/optimizers/)
- https://www.tensorflow.org/api_docs/python/tf/keras/regularizers (https://keras.io/regularizers/)


In this session we'll focus in the importance of the learning rate. We'll compare a large and a small learning rate with the default one. 


__Please run the following cells and answers to the next question:__

- Why we obtain such a different plots with each learning rate?

- What is the difference when comparing the following curves:
   - ```train large``` vs  ```train orig```
   - ```dev large``` vs  ```dev orig```

- And the following ones:
   - ```train small``` vs  ```train orig```
   - ```dev small``` vs ```dev orig```




In [None]:
# Example of using optimizer object

from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam

model2 = Sequential()

# add L2 weight regularization to logistic regression
regularizer = regularizers.l2(0.)
model2.add(Dense(units=1, activation='sigmoid', input_shape=(input_size,), kernel_regularizer=regularizer))

# Init Optimizer
lr_small = Adam(learning_rate=0.00001)
lr_large = Adam(learning_rate=0.5)

model2.compile(loss='binary_crossentropy', optimizer=lr_small, metrics=['accuracy'])
history_small_lr= model2.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

model2.compile(loss='binary_crossentropy', optimizer=lr_large, metrics=['accuracy'])
history_large_lr= model2.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)


In [None]:
import matplotlib.pyplot as plt

# summarize history for accuracy
plt.plot(history_large_lr.history['accuracy'])
plt.plot(history_large_lr.history['val_accuracy'])

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])

plt.plot(history_small_lr.history['accuracy'])
plt.plot(history_small_lr.history['val_accuracy'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train large', 'dev large', 'train orig', 'dev orig', 'train small', 'dev small'], loc='center right')
plt.show()

### 7.2. Effect of regularization

#### Exercise 3

In this session we'll focus in the effect of regularization. We'll compare regularized model agains non-regularized one.

__Please run the following cell and answers to the next question__:

What is the effect of including a regularization term? Is it always a good thing to be included?


------

The plots might not be the expected, but you should note that we reduced the vocabulary to only 1000 most frequent   words in training. Anyway, you should see the differences of learning curves when training with and without regularization.

----


In [None]:
import matplotlib.pyplot as plt
from tensorflow.keras import optimizers, regularizers

model2 = Sequential()

# add L2 weight regularization to logistic regression
regularizer = regularizers.l2(0.000001)
model2.add(Dense(units=1, activation='sigmoid', input_shape=(input_size,), kernel_regularizer=regularizer))

# Init optimizer
opt = optimizers.Adam() 
model2.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
history_reg = model2.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])

plt.plot(history_reg.history['accuracy'])
plt.plot(history_reg.history['val_accuracy'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train orig', 'dev orig', 'train reg', 'dev reg'], loc='lower right')
plt.show()