# Deep Learning for Natural Language Processing: Exercise 02
Moritz Eck (14-715-296)<br/>
University of Zurich

Please see the section right at the bottom of this notebook for the discussion of the results as well as the answers to the exercise questions.

### Mount Google Drive (Please do this step first => only needs to be done once!)

This mounts the user's Google Drive directly.

On my personal machine inside the Google Drive folder the input files are stored in the following folder:<br/> 
**~/Google Drive/Colab Notebooks/ex02/**

Below I have defined the default filepath as **default_fp = 'drive/Colab Notebooks/ex02/'**.<br/>
Please change the filepath to the location where you have the input file and the glove embeddings saved.

In [None]:
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse

from google.colab import auth
auth.authenticate_user()

from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()

import getpass

!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

**Mount Google Drive**

In [None]:
!mkdir -p drive
!google-drive-ocamlfuse drive

### Install the required packages

In [None]:
!pip install pandas==0.23.4
!pip install numpy 
!pip install scikit-learn==0.20
!pip install tensorflow
!pip install keras

### Check that the GPU is used

In [None]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()

if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')

print('Found GPU at: {}'.format(device_name))

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

### Preprocessing TED Talks File
- Run Once

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict

# fix random seed for reproducibility
seed = 200
seed = np.random.seed(seed)

# default inputs filepath
default_fp = 'drive/Colab Notebooks/ex02/'
ted_file = default_fp + './inputs/ted_en-20160408.xml'

# constants
lines = []
talks = {}
relevant = False
talk = ""
key = ""
talk_count = 0

print("Preprocessing TED Talks...")

with open(ted_file, 'r', encoding='utf-8') as freader:
    lines = freader.readlines()

for line in lines:
    # determine according to the keywords if the talk is relevant
    if '<keywords' in line:
        key = ""

        if 'technology' in line:
            key += "T"
        else:
            key += 'x'          

        if 'entertainment' in line:
            key += "E"
        else:
            key += 'x'  

        if 'design' in line:
            key += 'D'
        else:
            key += 'x'

        if key != 'xxx':
            relevant = True
            continue
        else:
            relevant = False
            continue

    if not relevant:
        continue
    
    # start reading the content
    if '<transcription>' in line:
        talk = ""

    # append each line of the transcript
    elif '<seekvideo' in line:
        start = line.find('>') + 1
        end = line.rfind('<') 
        talk += line[start:end] + " "

    # end of a talk
    elif '</transcription>' in line: 
        # store each talk with key and content
        talks[talk_count] = [key, talk]
        talk_count += 1
        relevant = False

# transform the dict into a dataframe
df = pd.DataFrame.from_dict(talks, orient='index')
df = df.rename(index=str, columns={0:'label', 1:'talk'})
df = df.reset_index().drop(columns=['index'])
print(df.describe())

# link to pretrained embeddings tutorial
# https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
# https://fasttext.cc/docs/en/english-vectors.html
# https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

### Preprocessing Pre-Trained Word Embeddings
- Run Once
- If the embedding dimension shall be changed => change the filepath to the respective 60d/100d/200d/300d.Then re-run!

In [None]:
print('Indexing word vectors.')

glove_file = default_fp + './glove/glove.6B.100d.txt'

embeddings_index = {}

with open(glove_file, encoding='utf-8') as freader:
    for line in freader:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

### Basic Sklearn MLP


In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, LabelBinarizer, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.utils.class_weight import compute_class_weight

# get the preprocessed data
df = df.copy()

# compute the class weight
class_weights = compute_class_weight('balanced', np.unique(df['label']), df['label'])

for label, weight in zip(np.unique(df['label']), class_weights):
    print("label: {} -> weight: {}".format(label, weight))

# split in training and test set
train = df.sample(frac=0.8, random_state=seed)
test = df.drop(train.index)

# for training
y_train = train['label']
x_train = train.drop('label', axis=1)
x_train = x_train['talk'].values

# for testing
y_test = test['label']
x_test = test.drop('label', axis=1)
x_test = x_test['talk'].values

print('Training samples shape: ', x_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test samples shape: ', x_test.shape)
print('Test labels shape: ', y_test.shape)

# encode the label
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)
print(label_encoder.classes_)

# transform the talks
tfvect = TfidfVectorizer(ngram_range=(1,3), max_df=0.99, max_features=None)
tfvect.fit(x_train)
x_train = tfvect.transform(x_train)
x_test = tfvect.transform(x_test)

# standardize the data
stand = StandardScaler(with_mean=False, with_std=True)
stand.fit(x_train)
x_train = stand.transform(x_train)
x_test = stand.transform(x_test)

# setup base mlp classifier
mlp = MLPClassifier(early_stopping=True, validation_fraction=0.2, random_state=seed, batch_size='auto', 
                    max_iter=200, n_iter_no_change=15, learning_rate='adaptive', verbose=True)

# set up parameter grid to evaluate over
param_grid = dict(hidden_layer_sizes=[(100,100,100), (100,100), (100,)], 
                  solver = ['adam', 'sgd'], activation=['tanh', 'relu'], alpha=[0.001, 0.0001])

# train mlp classifier using randomized grid search
gs_mlp = RandomizedSearchCV(mlp, param_grid, n_iter=5, cv=5, n_jobs=4, verbose=True, refit=True)
gs_mlp.fit(x_train, y_train)

# print the best parameters of the evaluation
print(gs_mlp.best_params_)
print(gs_mlp.best_score_)

# predict the test label
y_pred = gs_mlp.predict(x_test)

# print the accuracy and the confusion matrix
print(accuracy_score(y_pred, y_test))
print(confusion_matrix(y_pred, y_test))

### Train Keras Model
Run only once or once the model input parameters have been change: 
- e.g. changing input/embedding dimensions, MAX_NUM_WORDS, MAX_SEQUENCE_LENGTH

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model, Sequential
from keras.initializers import Constant
from keras.regularizers import l2
from keras.optimizers import SGD, Adam, Adagrad
from keras.callbacks import EarlyStopping

from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix, accuracy_score

# constants
MAX_SEQUENCE_LENGTH = 2000
MAX_NUM_WORDS = 10000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

# input data
texts = df['talk'].values
labels = df['label'].values

# compute the class weight
classes = np.unique(df['label'])
class_weights = compute_class_weight('balanced', classes, df['label'])
class_weights = {label:weight for label, weight in zip(range(len(classes)), class_weights)}

# encode the label as numerical value
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(labels)

# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, 
                      filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
                      lower=True, split=' ', char_level=False)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# one-hot encode the labels
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training, validation and test set
x_train, x_test, y_train, y_test = train_test_split(data, labels, stratify=labels, 
                                                    test_size=VALIDATION_SPLIT, shuffle=True)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, stratify=y_train, 
                                                  test_size=VALIDATION_SPLIT, shuffle=True)

print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))

for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

### Model Training
- Change the parameters
- Rerun with new configurations

In [None]:
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model, Sequential
from keras.initializers import Constant
from keras.regularizers import l2
from keras.optimizers import SGD, Adam, Adagrad
from keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix, accuracy_score

# load pre-trained word embeddings into an embedding layer
# if trainable = False, it will keep the embeddings fixed
embedding_layer = Embedding(input_dim=num_words, output_dim=EMBEDDING_DIM, 
                            embeddings_initializer=Constant(embedding_matrix), 
                            input_length=MAX_SEQUENCE_LENGTH, trainable=False)

print('Training model.')

# create a NN using a pre-trained embedding layer
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

# flatten the embedding matrix to use it in a feed forward nn
x = Conv1D(filters=128, kernel_size=10, padding='same', activation='relu')(embedded_sequences)
# x = Flatten()(embedded_sequences)
x = GlobalMaxPooling1D()(x)

# hidden layers
x = Dense(128, activation='relu', kernel_regularizer=l2(0.1))(x)
x = Dense(128, activation='relu', kernel_regularizer=l2(0.2))(x)
x = Dense(128, activation='relu', kernel_regularizer=l2(0.3))(x)

# output layer with softmax activation
preds = Dense(len(classes), activation='softmax')(x)

# different optimizers
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) 
adagrad = Adagrad()
adam = Adam()

for opt in [adam, adagrad, sgd]:
    # build, compile and print summary of the current model
    model = Model(sequence_input, preds)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['acc', 'mae'])
    print(model.summary())
    
    # stop training if paramter stops improving for x epochs
    callbacks = [EarlyStopping(patience=25, monitor='val_acc')]

    # fit the model
    model.fit(x_train, y_train, batch_size=128, epochs=100, 
              validation_data=(x_val, y_val), callbacks=callbacks,
              class_weight=class_weights)

    # evaluate the model on the test set
    y_pred = model.predict(x_test)
    
    # convert one hot back to categorical
    y_pred = np.argmax(y_pred, axis=1)
    y_test_ = np.argmax(y_test, axis=1)
    
    # evaluate the model using the test set
    print(accuracy_score(y_pred, y_test_))
    print(label_encoder.classes_)
    print(class_weights)
    print(confusion_matrix(y_pred, y_test_))

## Answers to Questions Ex02

#### What happens if you make the embeddings trainable? <br/>
- I have seen a tendency for the model to overfit with larger number of epochs 50+ compared to when the embedding vector stays fixed.<br/>
- It's possible that the model starts to rely too much on certain features => **assumption:** maybe the ones customized too much to the training data => when dropout regularization is used the result can be slightly improved and the training error doesn't converge as faster towards zero. <br/>
- The model required in general less neurons per layer or at least this is what I was able to pick up from the results => models with less neurons but a trainable embeddings matrix performed better than their counterparts with a fixed embeddings matrix => this is only true up to a certain threshold of number of units (e.g., around 128 per layer).


#### Does it work better than your first MLP? <br/>
- The best result of the sklearn MLP: 59.6% accuracy and the best result of Keras: 58.3% accuracy are comparable. <br/> => For a more results please see the next section (Results Ex02).
- The sklearn model results was achieved using RandomGridSearchCV.
- The Keras model was a bit more of an experimentation => see at the bottom of this answer post ("best Keras results"). Keras can get a lot better results when one or more convolutional steps are integrated. 


#### What happens if you change the number of hidden layers? <br/>
- The training error converges faster towards zero but the validation accuracy converges at the same speed. Sometimes using just one hidden layer provided good results.<br/>
- I was unable to bring the rate of convergence of the training error and validation error to approx. the same speed. The training loss always decrease a lot faster and the validation accuracy got rarely above 60% accuracy.


#### Do you notice anything when you change the number of units per layer? 
- **Too few units** => underfitting the model cannot pick up anything i.e. cannot learn any patterns, etc. => result is even worse when the embeddings are fixed. 
- **Too many units** => the model learns the aspects of the training set perfectly (training loss=0.0 & acc=1.0) => strong overfit towards the training set, the model starts to rely very much on certain things that it has trained (even stronger when embeddings are trianable)


#### The best Keras results was achieved using: 
- optimizer: Adam() and activation function: relu()
- embedding layer ("not trainable")
- GlobalMaxPoolingLayer() instead of Flatten() => **Note:** Flatten had a tendency to produce "Out of memory expections when used in Google Colab with large embedding dimension, number of words and max sequence length". Flattten() only worked up to 200 embedding dimensions using at most 2000 words and sequence length. Additionally, the max pooling layer reduced the number of parameters compared to flatten() => **2nd assumption**: there might be too many parameters which are trainable and the results of the model is more of a random result. 
- 1 hidden layer: 128 units, tanh activation, 0.005 kernel l2 regularization
- 1 ouptulayer using softmax

For the three different Keras optimizers listed above the following network setups have been tested:
```
# create a NN using a pre-trained embedding layer
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

# flatten the embedding matrix to use it in a feed forward nn
x = Flatten()(embedded_sequences)
# x = GlobalMaxPooling1D() => used instead of Flatten() sometimes provided better results

# hidden layers - different combinations of layers and dropouts
# units: range of 10, 128, 256, 512
# dropout: 0.1, 0.3, 0.5
# l2 reg: different levels 0.001-0.01

# example below
x = Dense(128, activation='tanh', kernel_regularizer=l2(0.005))(x)
x = Dropout(0.2)(x)
x = Dense(128, activation='tanh', kernel_regularizer=l2(0.005))(x)

# output layer with softmax activation
preds = Dense(len(classes), activation='softmax')(x)
```



## Results Ex02

#### Task 1: Using Sklearn MLP
- Default Settings of TfidfVectorizer: ngram_range=(1, 1), max_df=1.0, max_features=None.<br/>
best parameters: {'solver': 'adam', 'hidden_layer_sizes': (100, 100), 'alpha': 0.001, 'activation': 'relu'} <br/>
**validation score: 0.5796344647519582, test score: 0.5968586387434555**

- Settings of TfidfVectorizer: n_gram_range=(1,4), max_df=0.99, max_features=3000<br/>
best parameters: {'solver': 'sgd', 'hidden_layer_sizes': (100,), 'alpha': 0.001, 'activation': 'tanh'}<br/>
**validation score: 0.5039164490861618, test score: 0.5078534031413613**

```
labels: ['TED' 'TEx' 'TxD' 'Txx' 'xED' 'xEx' 'xxD']
[[ 0  0  0  1  0  0  1]
 [ 0  0  1  0  0  1  0]
 [ 1  0  7  3  0  1  3]
 [ 3  5 26 59  1  8 13]
 [ 0  0  0  0  0  0  1]
 [ 1  1  1  5  2 16  7]
 [ 1  1  2  3  0  1 15]]
```

- Settings of TfidfVectorizer: ngram_range=(1,2), max_df=0.99, max_features=5000<br/>
best parameters: {'solver': 'adam', 'hidden_layer_sizes': (100, 100, 100), 'alpha': 0.001, 'activation': 'tanh'}<br/>
**validation score: 0.5691906005221932, test score: 0.581151832460733**

```
labels: ['TED' 'TEx' 'TxD' 'Txx' 'xED' 'xEx' 'xxD']
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 3  5  8  7  0  2  9]
 [ 1  2 16 67  0  7  6]
 [ 0  0  0  0  0  0  0]
 [ 2  4  3  0  2 21  4]
 [ 0  0  4  2  0  1 15]]
```

#### Task 2: Using Keras
- **Experiment 1: with no class weights or sampling** => only the most instances of the majority classes (with the most instances in the data set) are classified correctly. All others are classified incorrect => predicts everything to be of class 'Txx' <br/>
Optimizer: **SGD**, Best Result: **0.453125**
 
```
labels: ['TED' 'TEx' 'TxD' 'Txx' 'xED' 'xEx' 'xxD']
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 2  0  3  4  0  0  0]
 [ 3  1 28 58  2  7 29]
 [ 0  0  0  0  0  0  0]
 [ 0  5  2  4  6 23  5]
 [ 1  1  2  2  1  0  3]]
```

 - **Experiment 2: Using the following class_weights**: {0: 4.0210084033613445, 1: 3.7976190476190474, 2: 0.9428571428571428, 3: 0.35883014623172105, 4: 5.696428571428571, 5: 0.7902559867877786, 6: 0.8336236933797909}. The class_weights can be thought of a multiplicative factor of how much each category will be up/down sampled.<br/> 
 => **Conclusion:** The model is still very focussed on the majority class even though the result is slightly better


Optimizer: **Adam**, Best Result: **0.5833333333333334**

```
labels: ['TED' 'TEx' 'TxD' 'Txx' 'xED' 'xEx' 'xxD']
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  1  0  0  0]
 [ 0  0  0  2  0  0  1]
 [ 3  1 16 75  1  5 13]
 [ 0  0  0  0  0  0  0]
 [ 3  1  7  4  4 22  3]
 [ 1  1  7  4  0  2 15]]
```

Optimizer: **Adagrad**, Best Result: **0.578125**

```
labels: ['TED' 'TEx' 'TxD' 'Txx' 'xED' 'xEx' 'xxD']
 [[ 0  0  0  0  0  0  0]
 [ 0  0  0  1  0  0  0]
 [ 0  0  0  4  0  0  2]
 [ 4  1 14 74  1  6 11]
 [ 0  0  0  0  0  0  0]
 [ 3  1  9  5  3 21  3]
 [ 0  1  7  2  1  2 16]]
```