<center>
<h1>Deep Learning with Python</h1>
<h2>Emotions text classification </h2>
<h2>Iro-Georgia Malta</h2>
</center>

<img src="https://www.seekpng.com/png/detail/88-884850_inside-joy-sadness-inside-out-pixel-art.png"
     alt="Markdown Monster icon"
     style="float: center; margin-right: 10px;" />

## About the task:
The aim of this project is to build a **multi-class classification model** which will be trained on tweets that convey one of the following emotions: joy, sadness, anger or fear. The task is also a single-label classification since each sample requires one label (emotion). The dataset used for this project is the "Emotion Classification NLP" which can be found in kaggle (https://www.kaggle.com/datasets/anjaneyatripathi/emotion-classification-nlp?select=emotion-labels-train.csv). Identifying emotions in data (e.g., tweets, articles, reviews etc.) has become an integral part of many NLP and Data Science tasks such as text classification, sentiment analysis or automatic summarization. Additionally, analyzing the emotions expressed in a text can improve the performance of NLP systems predicting the context or the intent of a text. For the reasons mentioned above, I decided to build and train a neural model on this specific dataset.
<br>
For this project, besides building a multi-class classification model, I will use the **One-vs-Rest** strategy to find which model (mutli-class classification model vs. binary classification model(s)) has higher **accuracy** and **F-measure**, and why this model has better predictions for the dataset. According to the results of the models, some final conlcusions are drawn at the end of the project. 

## Import modules and set seed:
In this section, I import the modules that are used throughout the whole project. Modules or libraries which have to be imported at a specific point of the project, are not mentioned here. Addionally, I set a seed here in order to reproduce the same results. The function **set_seeds()** must be called before training each model.

In [None]:
# import necessary modules

import keras
import numpy as np
import tensorflow as tf
import random as python_random

In [None]:
def set_seeds():
   np.random.seed(123) 
   python_random.seed(123)
   tf.random.set_seed(1234)

set_seeds()

## 1. Data Processing:

Before I start building the model(s), the dataset needs to be loaded and processed in order to be fed into the neural network(s). To load the dataset and to extract data I use **pandas** library:

In [None]:
# import pandas library for extracting data
import pandas as pd

The dataset are three separate **csv-files**: train, validation and test dataset. The data from the csv-files are loaded and assigned to the following variables:

In [None]:
# read_csv() method from pandas library

emotions_train_data = pd.read_csv("/compLing/students/courses/deepLearning/finalProject23/iro.malta/emotion-labels-train.csv") # train data
emotions_val_data = pd.read_csv("/compLing/students/courses/deepLearning/finalProject23/iro.malta/emotion-labels-val.csv") # validation data
emotions_test_data = pd.read_csv("/compLing/students/courses/deepLearning/finalProject23/iro.malta/emotion-labels-test.csv") # test data

The content of the three different datasets must be visualized in order to know which information needs to be extracted from them. The **head() function** is used and it shows the datasets contain **two columns**: text and label.

In [None]:
# visualize train data

emotions_train_data.head()

In [None]:
# visualize validation data

emotions_val_data.head()

In [None]:
# visualize test data

emotions_test_data.head()

As an extra step, I iterate through the column labels to make sure about the **column labels** (text, label) and **the number of columns** (2):

In [None]:
for col in emotions_train_data.columns:
    print(col)

## 2. Label Encoding:

The samples of the three datasets are assigned with the following labels: **joy, sadness, anger and fear**. The format of these labels is not appropriate to be used by a neural network and thus, I convert the labels to classes **0-3**. To encode the labels I use the **preprocessing.LabelEncoder** from **scikit-learn**.

In [None]:
# find the unique elements of the column 'label' in the datasets

emotions_train_data['label'].unique() # train data

In [None]:
emotions_val_data['label'].unique() # validation data

In [None]:
emotions_test_data['label'].unique() # test data

All three datasets contain the labels: **joy, fear, anger and sadness**. An additional step of exploring the data of the column 'label' is to **count the instances of each label** in the datasets. For this reason, I use the **value_counts()** function on the datasets:

In [None]:
count_labels_train = emotions_train_data['label'].value_counts()
print(count_labels_train)

In [None]:
count_labels_val = emotions_val_data['label'].value_counts()
print(count_labels_val)

In [None]:
count_labels_test = emotions_test_data['label'].value_counts()
print(count_labels_test)

The **value_counts()** function returns the instances of each label in a **descending order**. It is observed that the label **'fear'** has the most counts, then the label **'anger'** comes second, and lastly **'joy'** and **'sadness'**. From the label counts, it is also apparent that the train dataset and the test dataset have more samples than the validation dataset.
<br>
Now, I import the **preprocessing.LabelEncoder** from **scikit-learn** to encode the labels to classes **0-3**:

In [None]:
# import label encoder from scikit-learn

from sklearn import preprocessing

I create an extra column with the title **'label_class'** in all three datasets. In this way, I can associate the classes **0-3** with each emotion:

In [None]:
label_encoder = preprocessing.LabelEncoder() # assign LabelEncoder object 

# create column 'label_class' and encode the emotion labels of column 'label'
emotions_train_data['label_class'] = label_encoder.fit_transform(emotions_train_data['label']) # train data
emotions_val_data['label_class'] = label_encoder.fit_transform(emotions_val_data['label']) # validation data
emotions_test_data['label_class'] = label_encoder.fit_transform(emotions_test_data['label']) # test data

Now, column **'label_class'** has the classes **0-3** as unique elements in all datasets:

In [None]:
# check the unique elements of the column 'label_class'

emotions_train_data['label_class'].unique() # train data

In [None]:
emotions_val_data['label_class'].unique() # validation data

In [None]:
emotions_test_data['label_class'].unique() # test data

The extra column **'label_class'** can be found now in the datasets. Here, I check the extra column in the train dataset:

In [None]:
emotions_train_data.head() # train data

Now, I use the **value_counts()** function on the three datasets to associate the emotion labels with the label classes. The function returns the associated pairs as well as their counts in the dataset:

In [None]:
emotions_train_data[['label', 'label_class']].value_counts() # train data

In [None]:
emotions_val_data[['label', 'label_class']].value_counts() # validation data

In [None]:
emotions_test_data[['label', 'label_class']].value_counts() # test data

The associated pairs between the emotions and the classes are: **anger - 0, fear - 1, joy - 2 and sadness - 3**. 

## 3. Conversion of data and labels into numerical formats:

I extract the necessary information from the columns **text** and **label_class** in the three datasets. I put the extracted samples and labels into lists:

In [None]:
# train data
emotions_train_list = emotions_train_data['text'].tolist()
emotions_train_labels = emotions_train_data['label_class'].tolist()

# validation data
emotions_val_list = emotions_val_data['text'].tolist()
emotions_val_labels = emotions_val_data['label_class'].tolist()

# test data
emotions_test_list = emotions_test_data['text'].tolist()
emotions_test_labels = emotions_test_data['label_class'].tolist()

Now, I calculate the **average sentence length** (the mean of sentences length) in the train dataset:

In [None]:
# the mean of sentences length in the train data

sentence_length = []

for l in emotions_train_list:
    sentence_length.append(len(l.split(' ')))
    
sentence_mean = np.mean(sentence_length) # the mean is 16
print(sentence_mean)

Additionally, I plot a histogram with **max lengths** of the sentences and **their instences** in the train dataset (counts); I also plot **the mean of sentences length**. For the histogram, **Matplotlib** library is used.

In [None]:
# import matplotlib to visualize max length of sentences

import matplotlib.pyplot as plt

x = np.array(sentence_length) # convert the list to numpy array

plt.hist(x, color="skyblue", ec="white", lw=1, density=False, bins=20) # density=False for counts
plt.ylabel('Counts')
plt.xlabel('Max_length')
plt.axvline(x.mean(), color='k', linestyle='dashed', linewidth=1) # plot the mean of x

The histogram shows that the **max length** of the setences is around **32**. It also depicts that **the max length of most sentences** can be found around **16** and **24**. Therefore, **the mean of sentences length** is plotted close to **16** which proves that the previous calculation of the mean is correct.
<br>
Additionally, I calculate **the counts** of the sentences for each max length in the train dataset to see which sentence length has the most sentences:

In [None]:
sentence_count = {}

for c in sentence_length:
    if c in sentence_count.keys():
        sentence_count[c]+=1
    else:
        sentence_count[c]=1

print(sentence_count)

From the counts shown above, most sentences have max length between **16** and **24**. The max length **19** has the most counts of sentences **203**. There is only one sentence of length **58** and thus, it is not depicted in the histogram.
<br>
I also check sentence lengths normality in the dataset to see if the dataset follows a normal distribution and to determine the max length of sentences for sequence padding more accurately. For this reason, I use **the probability density function** for norm from **scipy.stats**:

In [None]:
from scipy.stats import norm

x, bins, y = plt.hist(sentence_length, 20, density=True) # density=True for propability density
mu = np.mean(sentence_length) # mean
sigma = np.std(sentence_length) # SD
plt.ylabel('Probability_density')
plt.xlabel('Max_length')
plt.plot(bins, norm.pdf(bins, mu, sigma)) # plot normal probability density

According to the histogram above, it is depicted that there are many sentences with max length higher than **16**. Therefore, truncating the sentences at length **16** might result to losing some valuable information of the sentences. For this reason, I decide to truncate the tweets at **30** words and thus, more data will be included in the training process of the model. 
<br>
Before I start with **sequence padding**, **tokenization** process of the text data takes place first. I only consider the **10.000** most frequent words and thus, I set the parameter *num_words* to *10000* in the **Tokenizer()**:

In [None]:
# import tokenizer and padding

from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from tensorflow.keras.utils import to_categorical

In [None]:
# truncate the tweets after 30 words
max_length = 30

# consider 10,000 most frequent words for tokenization
max_words = 10000

# tokenize the datasets and set num_words parameter = max_words
tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(emotions_train_list) # train data
tokenizer.fit_on_texts(emotions_val_list) # validation data
tokenizer.fit_on_texts(emotions_test_list) # test data

sequences_train = tokenizer.texts_to_sequences(emotions_train_list) # train data
sequences_val = tokenizer.texts_to_sequences(emotions_val_list) # validation data
sequences_test = tokenizer.texts_to_sequences(emotions_test_list) # test data

# find number of unique tokens
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Now that the sentences from the three datasets are tokenized, the **sequence padding** process can be performed on the datasets:

In [None]:
# pad sequences to the same length, max_length = 30
train_np_data = pad_sequences(sequences_train, maxlen = max_length) # train data
print('Shape of train data tensor:', train_np_data.shape)

val_np_data = pad_sequences(sequences_val, maxlen = max_length) # validation data
print('Shape of validation data tensor:', val_np_data.shape)

test_np_data = pad_sequences(sequences_test, maxlen = max_length) # test data
print('Shape of test data tensor:', test_np_data.shape)

The sentences of the datasets are now converted into appropriate **numerical formats**. At this point, I convert the **class labels** into numpy arrays and then, into **one-hot encoding** format:

In [None]:
# convert the labels into numpy arrays and then, into one-hot encoding

labels_one_hot_train = np.asarray(emotions_train_labels) # train labels
labels_one_hot_train = to_categorical(labels_one_hot_train)
print('Shape of label tensor:', labels_one_hot_train.shape)

labels_one_hot_val = np.asarray(emotions_val_labels) # validation labels
labels_one_hot_val = to_categorical(labels_one_hot_val)
print('Shape of label tensor:', labels_one_hot_val.shape)

labels_one_hot_test = np.asarray(emotions_test_labels) # test labels
labels_one_hot_test = to_categorical(labels_one_hot_test)
print('Shape of label tensor:', labels_one_hot_test.shape)

For the model's training process, I shuffle the text data of the **train dataset** as well as the corresponding labels:

In [None]:
# shuffle the train dataset

indices = np.arange(train_np_data.shape[0]) 
np.random.shuffle(indices)
train_np_data = train_np_data[indices] # train data
labels_one_hot_train = labels_one_hot_train[indices] # train labels

Now that the tweets as well as the class labels are converted into appropriate numerical formats, they are assigned to new variables which will be used later for the model's training and testing processes:

In [None]:
# train data and labels
x_train = train_np_data
y_train = labels_one_hot_train

# validation data and labels
x_val = val_np_data
y_val = labels_one_hot_val

# test data and labels
x_test = test_np_data
y_test = labels_one_hot_test

In [None]:
# check if the sum of the splitted data is equal to the sum of all the data

len(x_train) + len(x_val) + len(x_test) == len(train_np_data) + len(val_np_data) + len(test_np_data)

In [None]:
# check if the sum of the splitted labels is equal to the sum of all the labels

len(y_train) + len(y_val) + len(y_test) == len(labels_one_hot_train) + len(labels_one_hot_val) + len(labels_one_hot_test)

## 4. Word Embeddings

For the models training process I use the pre-trained GloVe embeddings which are trained on **Twitter data** and they are represented in **200** dimension vectors. The file of the embeddings is **'glove.twitter.27B.200d.txt'** and it can be found in this website (https://nlp.stanford.edu/projects/glove/). 
<br>
First, I read in the file containing the embeddings and then, I pre-process the embeddings so that they can be loaded into the models:

In [None]:
import os 

glove_path = '/compLing/students/courses/deepLearning/finalProject23/iro.malta/'


embeddings_index = {}

file_txt = open(os.path.join(glove_path, 'glove.twitter.27B.200d.txt'))
for line in file_txt:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
file_txt.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
embedding_dim = 200 # 200d vectors

embedding_matrix = np.zeros((max_words, embedding_dim)) 
for word, i in word_index.items(): # iterate through the tokens 
    embedding_vector = embeddings_index.get(word) # return the value of the key
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

## 5. Multi-class classification Model Setup, Training and Testing

Before I set up the models for training and testing, I import the necessary modules for **model setups, plots, evaluation metrics and confusion matrices**:

In [None]:
# import modules for building the models, plots, evaluation & confusion matrices

from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM
import matplotlib.pyplot as plt
import sklearn
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
%matplotlib inline
from sklearn.metrics import confusion_matrix
import itertools

Additionally, I define the function **plot_confusion_matrix**, which plots a confusion matrix when calling it. The function was found in this website: https://deeplizard.com/learn/video/km7pxKy4UHU.

In [None]:
# define function to produce a confusion matrix

def plot_confusion_matrix(cm, classes,
                        normalize=False,
                        title='Confusion matrix',
                        cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

## 5.1. First multi-class classification model:
I decide to develop **LSTM models** for this task because they are effective in memorizing important information, and they are capable of learning long-term dependencies, especially in sequence prediction problems. Specifically, I try the following types of **LSTM models**: Vanilla and Stacked LSTM models with/ without dropout rate.

## Vanilla LSTM models
I set up a **Sequential model** that contains an **Embedding layer** and one hidden **LSTM layer** with **128** units. On the **Embedding layer** the pre-trained embeddings are loaded. The model also contains a **Dense layer** as the output layer, and the layer has **4** output units since the model classifies **4** different types of emotions (anger, fear, joy and sadness). Additionally, I use the **softmax** activation function on the **Dense layer** because the **softmax** function is appropriate for multi-class classification problems with mutually exclusive classes such as this one.

In [None]:
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length = max_length))
model.add(LSTM(128))
model.add(Dense(4, activation='softmax'))
model.summary()

I load the GloVe matrix which was prepared into the **Embedding layer**. Additionally, I set the parameter **trainable** to **True** so that the pre-trained embeddings adapt to the specific training set. I also tried setting the parameter **trainable** to **False**, but the model's performance was worse. Thus, in all the models the parameter **trainable** of the pre-trained embeddings is set to **True**.

In [None]:
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = True

For the training process, I use the **RMSProp** algorithm for optimization, and I choose **categoricalCrossentropy** as loss function for the model because the specific task has four labels (e.g., 0, 1, 2, 3). Additionaly, I choose accuracy as metric to monitor the model's performance during training. 
<br>
I fit the model to the training data and the training labels, I train the model over **5** epochs, and I group the data into batches of size **32**. The data for validation are also specified.

In [None]:
model.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task is a mutli-class classification
              metrics=['acc'])

history = model.fit(x_train, y_train, # training/ fitting the model
                    epochs=5,
                    batch_size=32,
                    validation_data=(x_val, y_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = model.evaluate(x_test, y_test)

In [None]:
# plotting training/ validation accuracy and training/ validation loss

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

### Summary of results:
From the metrics **validation accuracy** and **validation loss**, it is apparent that there is overfitting during training in the validation data after epoch **3**. Thus, I would stop training the model at epoch **3**, where the **validation loss** is the lowest and before the model starts overfitting. In the test data a **~78%** test accuracy is gained; however, the model performs better on the training data (~92%) than on the test data, which signals the presence of overfitting. Thus, I decide to change the parameters in the second model with the aim of improving its performance.

## 5.2. Second multi-class classification model: 
The set up of the second model is similar to the first one: it is a **Sequential model**, containing an **Embedding layer** and one hidden **LSTM layer** with fewer units this time, specifically **64** units. It also contains a **Dense layer** as the output layer, with **4** output units and **softmax** activation function is used on it.

In [None]:
model_2 = Sequential()
model_2.add(Embedding(max_words, embedding_dim, input_length = max_length))
model_2.add(LSTM(64))
model_2.add(Dense(4, activation='softmax'))
model_2.summary()

In [None]:
# load the GloVe matrix
model_2.layers[0].set_weights([embedding_matrix])
model_2.layers[0].trainable = True

The set up for the model's training is the same as the one of the first model. However, this time I reduce the training **epochs** to **3** because it was previously observed that overfitting occurs after 3 epochs:

In [None]:
model_2.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task is a mutli-class classification
              metrics=['acc'])

history_2 = model_2.fit(x_train, y_train, # training/ fitting the model
                    epochs=3,
                    batch_size=32,
                    validation_data=(x_val, y_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = model_2.evaluate(x_test, y_test)

In [None]:
# plotting training/ validation accuracy and training/ validation loss

import matplotlib.pyplot as plt

acc = history_2.history['acc']
val_acc = history_2.history['val_acc']
loss = history_2.history['loss']
val_loss = history_2.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

### Summary of results:
According to the metrics **validation accuracy** and **validation loss**, this time there is not overfitting during training. However, there is still overfitting in the test dataset: **~75%** test accuracy is gained, while maximum accuracy of **~83%** is gained during training.
<br>
In comparison to the previous model, the second model has achieved a bit lower **loss** **~66%** in the test dataset. The results of both models are very similar and overall, the models do not perform well when looking at both the **validation loss** and **test loss** of the models. Therefore, I decide to build another model with fewer units in the hidden layer to see if the model will perform better.

### 5.3. Third multi-class classification model:

The third model's set up is similar to the previous two models; the only difference is the reduced number of the units used in the hidden **LSTM layer**, which is **32** in the third model:

In [None]:
model_3 = Sequential()
model_3.add(Embedding(max_words, embedding_dim, input_length = max_length))
model_3.add(LSTM(32))
model_3.add(Dense(4, activation='softmax'))
model_3.summary()

In [None]:
# load the GloVe matrix
model_3.layers[0].set_weights([embedding_matrix])
model_3.layers[0].trainable = True

The model's parameters during training are similar to the previous two models. This time, I train the model for **5** epochs to observe whether overfitting occurs after epoch **3** again:

In [None]:
model_3.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task is a mutli-class classification
              metrics=['acc'])

history_3 = model_3.fit(x_train, y_train, # training/ fitting the model
                    epochs=5,
                    batch_size=32,
                    validation_data=(x_val, y_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = model_3.evaluate(x_test, y_test)

In [None]:
import matplotlib.pyplot as plt

acc = history_3.history['acc']
val_acc = history_3.history['val_acc']
loss = history_3.history['loss']
val_loss = history_3.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

### Summary of results:
As it was predicted, overfitting occurs after epoch **3** in the validation dataset. Lower **loss**, around **~62%**, is gained during testing, comparing to the previous model. This could indicate that the reduction of the units in the hidden layer plays a role in the model's performance for this dataset. A maximum accuracy of **78%** is gained in the test dataset, which is again lower than the model's training accuracy **~90%**.
<br>
I decide to compute other metrics (F-measure, Precision, Recall) and create a confusion matrix with the model's predictions so that I can evaluate the model's performance more accurately.

### Confusion Matrix and F-measure, Precision, Recall (model_3):
I use **model_3** to make predictions, using the **model_3.predict()**. I store these predictions in **y_pred3** variable:

In [None]:
y_pred3 = model_3.predict(x_test)
print(y_pred3)

I use the **argmax() NumPy** function to return the indices of the maximum values along the second axis (axis=1) of the model's predicted probabilities (y_pred3) as well as the true class probabilities (y_test). Then, I create the confusion matrix with those probabilities:

In [None]:
y_pred_3 = np.argmax(y_pred3, axis=1) # converts the predicted probabilities to the predicted class labels
y_test_3 = np.argmax(y_test, axis=1) # converts the true class probabilities to the true class labels
cm3 = confusion_matrix(y_test_3, y_pred_3) # creates a confusion matrix with y_test and y_pred

# use the function to produce the confusion matrix
plot_confusion_matrix(cm=cm3, classes=["anger", "fear", "joy", "sadness"], title='Confusion Matrix')

### Confusion Matrix interpretation (model_3):

According to the confusion matrix above (cm3), the model has predicted correctly in total **2.458** out of **3.142** the labels of test data (if the sum of the correctly identified labels is not the same, it means that the values on the confusion matrix changed). The label **fear** is the one with the most correct classifications, then it is label **joy** and lastly, labels **anger** and **sadness**. However, the confusion matrix depicts many missclassifications that the model does. It is apparent that the model missclassifies considerably label **fear** with **sadness** and the opposite. Additionally, label **sadness** seems to be missclassified with **anger**. Furthermore, it is very contradictory that label **joy** is missclassified with **sadness** as well as the other two labels.

I compute the **F1-score**, also known as F-measure. The F1-score reaches a **0.78** value, which is good because it is close to value 1:

In [None]:
sklearn.metrics.f1_score(y_test_3, y_pred_3, average='macro')

I use the **classification_report()** function to compute **Precision**, **Recall** and **F1-score** of each label class which was predicted by the model:

In [None]:
report_3 = classification_report(y_test_3, y_pred_3, labels=[0,1,2,3], target_names=["anger", "fear", "joy", "sadness"])
print(report_3)

According to the report above, it is apparent that **anger** and **joy** have the highest precision scores; however, label **anger** (or sadness) has the lowest recall score. Furthemore, **sadness** has the lowest precision score, but a **78%** (or more) recall. Label **joy** has the highest F1-score, while **sadness** has the lowest.

I run the same model again; however, this time I train it for **3** epochs to compare the results of the test dataset:

In [None]:
model_3b = Sequential()
model_3b.add(Embedding(max_words, embedding_dim, input_length = max_length))
model_3b.add(LSTM(32))
model_3b.add(Dense(4, activation='softmax'))
model_3b.summary()

In [None]:
# load the GloVe matrix
model_3b.layers[0].set_weights([embedding_matrix])
model_3b.layers[0].trainable = True

In [None]:
model_3b.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task is a mutli-class classification
              metrics=['acc'])

history_3b = model_3b.fit(x_train, y_train, # training/ fitting the model
                    epochs=3,
                    batch_size=32,
                    validation_data=(x_val, y_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = model_3b.evaluate(x_test, y_test)

In [None]:
import matplotlib.pyplot as plt

acc = history_3b.history['acc']
val_acc = history_3b.history['val_acc']
loss = history_3b.history['loss']
val_loss = history_3b.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

### Summary of results:
This time there is no overfitting during training. However, a higher loss **~70%** and a lower **accuracy** **~73%** are gained during testing, comparing to **model_3**. There is still overfitting between the maximum accuracy of the test dataset **~73%** and the training dataset **~80%**; however, the difference between the two accuracies is smaller than the difference of **model_3**.

### Confusion Matrix and F-measure, Precision, Recall (model_3b):
I use **model_3b** to make predictions, using the **model_3b.predict()**. I store these predictions in **y_pred3b** variable:

In [None]:
y_pred3b = model_3b.predict(x_test)
print(y_pred3b)

I use the **argmax() NumPy** function again, to return the indices of the maximum values along the second axis (axis=1). Then, I create the confusion matrix:

In [None]:
y_pred_3b = np.argmax(y_pred3b, axis=1) # converts the predicted probabilities to the predicted class labels
y_test_3b = np.argmax(y_test, axis=1) # converts the true class probabilities to the true class labels
cm3b = confusion_matrix(y_test_3b, y_pred_3b) # creates a confusion matrix with y_test and y_pred

# use the function to produce the confusion matrix
plot_confusion_matrix(cm=cm3b, classes=["anger", "fear", "joy", "sadness"], title='Confusion Matrix')

### Confusion Matrix interpretation (model_3b):

According to the confusion matrix above (cm3b), the model has predicted correctly in total **2.335** out of **3.142** the labels of test data (if the sum of the correctly identified labels is not the same, it means that the values on the confusion matrix changed). The label **fear** is the one with the most correct classifications, then it is label **joy** and lastly, labels **sadness** and **anger**. In comparison to **model_3 confusion matrix**, this model has more missclassifications, especially between labels: **anger** and **fear**, **fear** and **sadness** (and the opposite), **joy** and **fear**.

I compute the **F1-score**. The F1-score reaches a **~0.73** value, which is lower than the F1-score of **model_3**:

In [None]:
sklearn.metrics.f1_score(y_test_3b, y_pred_3b, average='macro')

In [None]:
report_3b = classification_report(y_test_3b, y_pred_3b, labels=[0,1,2,3], target_names=["anger", "fear", "joy", "sadness"])
print(report_3b)

In this report, it is observed that the model's predictions for each label have lower **recall** values and f1-scores than **model_3**. Since the results reported for this model's predictions are worse than the results reported for **model_3**, I reach the conclusion that this model doesn't perform well for this task. Therefore, I decide to build **model_4**.

### 5.4. Fourth multi-class classification model
### Vanilla LSTM model with dropout rate = 0.5
The taining set up of this model is similar to the previous (three) models; however this time, I set the **dropout** argument of the hidden **LSTM** layer to **0.5**. I also tried other dropout rates, starting from **0.1**, and I found out that the model performs better with **0.5** dropout rate:

In [None]:
model_4 = Sequential()
model_4.add(Embedding(max_words, embedding_dim, input_length = max_length))
model_4.add(LSTM(32, dropout=0.5))
model_4.add(Dense(4, activation='softmax'))
model_4.summary()

In [None]:
# load the GloVe matrix
model_4.layers[0].set_weights([embedding_matrix])
model_4.layers[0].trainable = True

I train the model for **5** epochs and I reduce the batch size of the data from **32** to **16** (I tried also with batch size 32, but the metrics were worse):

In [None]:
model_4.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task is a mutli-class classification
              metrics=['acc'])

history_4 = model_4.fit(x_train, y_train, # training/ fitting the model
                    epochs=5,
                    batch_size=16,
                    validation_data=(x_val, y_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = model_4.evaluate(x_test, y_test)

In [None]:
acc = history_4.history['acc']
val_acc = history_4.history['val_acc']
loss = history_4.history['loss']
val_loss = history_4.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

### Summary of results:
According to the plots above, there is no overfitting during training. I tried training the model for more epochs and I observed that overfitting occurs after epoch 5. Thus, I train the model over **5** epochs. A lower loss **~62%** and a higher accuracy **~79%** is gained during testing, comparing to **model_3**. The difference between the maximum accuracy of the test dataset **~79%** and the training dataset **~80%** is considerably smaller than all the previous models. This is also the case between the loss of the training set **~52%** and the loss of the test set **~62%**.

### Confusion Matrix and F-measure, Precision, Recall (model_4):
I use **model_4** to make predictions, using the **model_4.predict()**. I store these predictions in **y_pred4** variable:

In [None]:
y_pred4 = model_4.predict(x_test)
print(y_pred4)

I use the **argmax() NumPy** function again, to return the indices of the maximum values along the second axis (axis=1). Then, I create the confusion matrix:

In [None]:
y_pred_4 = np.argmax(y_pred4, axis=1)
y_test_4 =np.argmax(y_test, axis=1)
cm4 = confusion_matrix(y_test_4, y_pred_4)

plot_confusion_matrix(cm=cm4, classes=["anger", "fear", "joy", "sadness"], title='Confusion Matrix')

### Confusion Matrix interpretation (model_4):

According to the confusion matrix above (cm4), the model has predicted correctly in total **2.464** out of **3.142** the labels of test data (if the sum of the correctly identified labels is not the same, it means that the values on the confusion matrix changed). The label **fear** is the one with the most correct classifications, then it is label **joy** and lastly, labels **anger** and **sadness**. In comparison to **model_3 confusion matrix**, this model has fewer missclassifications. However, there are still some label pairs which are considerably missclassified such as: **sadness** with **fear** (and the opposite), **anger** with **fear** (and the opposite), **joy** with **fear** (and the opposite), and **anger** with **sadness**.

I compute the **F1-score**. The F1-score reaches a **~0.78** value, which is a bit higher than the F1-score of **model_3** and it is closer to value 1:

In [None]:
sklearn.metrics.f1_score(y_test_4, y_pred_4, average='macro')

In [None]:
report_4 = classification_report(y_test_4, y_pred_4, labels=[0,1,2,3], target_names=["anger", "fear", "joy", "sadness"])
print(report_4)

Label **fear** and **sadness** have lower precision values than the other two labels. Label **joy** (or anger) has the highest precision value, and label **fear** has the highest recall value. Looking at the results reported, it can be concluded that this is the best model for this task so far because the f1-scores of each class are greater than the ones from **model_3**.

### 5.5. Fifth multi-class classification model
### Stacked LSTM model with dropout rate = 0.4
I decide to build a **Stacked LSTM** model including a dropout rate in the hidden layers to see whether the accuracy of the model will be improved. I set up a **Sequential model**, containing an **Embedding layer** and two hidden **LSTM layers** with **32** units. In the first **LSTM** hidden layer, I set the argument **return_sequences** to **True**, and in both hidden layers I set the dropout rate to **0.4** (I also tried with lower dropout rates, but the model performed worse). The output layer, which is a **Dense layer**, has the same contents as the previous models:

In [None]:
model_5 = Sequential()
model_5.add(Embedding(max_words, embedding_dim, input_length = max_length))
model_5.add(LSTM(32, return_sequences=True, dropout=0.4))
model_5.add(LSTM(32, dropout=0.4))
model_5.add(Dense(4, activation='softmax'))
model_5.summary()

In [None]:
# load the GloVe matrix
model_5.layers[0].set_weights([embedding_matrix])
model_5.layers[0].trainable = True

For the model's training process, I reduce the batche size to **16** and I train the model for **7** epochs (I tried also with 10 epochs, but overfitting occurs after epoch 7):

In [None]:
model_5.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task is a mutli-class classification
              metrics=['acc'])

history_5 = model_5.fit(x_train, y_train, # training/ fitting the model
                    epochs=7,
                    batch_size=16,
                    validation_data=(x_val, y_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = model_5.evaluate(x_test, y_test)

In [None]:
import matplotlib.pyplot as plt

acc = history_5.history['acc']
val_acc = history_5.history['val_acc']
loss = history_5.history['loss']
val_loss = history_5.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

### Summary of results:
The **Stacked LSTM model_5 with dropout rate 0.4** performs better than the previous models, model_4 and model_3. It achieves maximum **validation accuracy** of **~79%** as well as a **~61%** for validation loss. Overfitting is reduced during training and thus, maximum accuracy of **~81%** with **~54%** loss is gained during testing. The difference between training and testing maximum accuracy is also smaller, **~88%** training and **~81%** test accuracy.

### Confusion Matrix and F-measure, Precision, Recall (model_5):
I use **model_5** to make predictions, using the **model_5.predict()**. I store these predictions in **y_pred5** variable:

In [None]:
y_pred5 = model_5.predict(x_test)
print(y_pred5)

I use the **argmax() NumPy** function again, to return the indices of the maximum values along the second axis (axis=1). Then, I create the confusion matrix:

In [None]:
y_pred_5 = np.argmax(y_pred5, axis=1)
y_test_5 =np.argmax(y_test, axis=1)
cm5 = confusion_matrix(y_test_5, y_pred_5)

plot_confusion_matrix(cm=cm5, classes=["anger", "fear", "joy", "sadness"], title='Confusion Matrix')

### Confusion Matrix interpretation (model_5):

According to the confusion matrix above (cm5), the model has predicted correctly in total **2.560** out of **3.142** the labels of test data (if the sum of the correctly identified labels is not the same, it means that the values on the confusion matrix changed). The label **fear** is the one with the most correct classifications, then it is label **anger** and lastly, labels **joy** and **sadness**. In comparison to previous confusion matrices, this model has fewer missclassifications. However, there are still some label pairs which are considerably missclassified such as: **fear** with **sadness** (and the opposite), **fear** with **anger** (and the opposite), **joy** with **fear**, **joy** with **sadness**, and **sadness** with **anger**.

I compute the **F1-score**. The F1-score reaches a **~0.81** value, which is higher from all the previous F1-scores and it is closer to value 1:

In [None]:
sklearn.metrics.f1_score(y_test_5, y_pred_5, average='macro')

In [None]:
report_5 = classification_report(y_test_5, y_pred_5, labels=[0,1,2,3], target_names=["anger", "fear", "joy", "sadness"])
print(report_5)

The classification report demonstrates high precision, recall and f1-score values for each class, which indicates that this is a better and more suited model for this task than **model_4**.

### 5.6. Sixth multi-class classification model
### Stacked LSTM model with dropout rate = 0.5
I build another **Stacked LSTM** model including a dropout rate in the hidden layers to see whether the accuracy of the model will be improved. Similarly to the previous model, I set up a **Sequential model**, containing an **Embedding layer** and two hidden **LSTM layers** with **32** units. However, this time both hidden layers have dropout rate of **0.5**. The output layer, which is a **Dense layer**, has the same contents as the previous models:

In [None]:
model_6 = Sequential()
model_6.add(Embedding(max_words, embedding_dim, input_length = max_length))
model_6.add(LSTM(32, return_sequences=True, dropout=0.5))
model_6.add(LSTM(32, dropout=0.5))
model_6.add(Dense(4, activation='softmax'))
model_6.summary()

In [None]:
# load the GloVe matrix
model_6.layers[0].set_weights([embedding_matrix])
model_6.layers[0].trainable = True

I train again the model over **7 epochs** to see whether **model_6** can gain similar or better results than **model_5** (I also tried training the model more than 7 epochs, but overfitting after 7 epochs occurs):

In [None]:
model_6.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task is a mutli-class classification
              metrics=['acc'])

history_6 = model_6.fit(x_train, y_train, # training/ fitting the model
                    epochs=7,
                    batch_size=16,
                    validation_data=(x_val, y_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = model_6.evaluate(x_test, y_test)

In [None]:
acc = history_6.history['acc']
val_acc = history_6.history['val_acc']
loss = history_6.history['loss']
val_loss = history_6.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

### Summary of results:
The **Stacked LSTM model_6 with dropout rate 0.5** seems to have similar performance to **model_5**. It achieves maximum **validation accuracy** of **~76%** as well as a **~63%** for validation loss. A maximum accuracy of **~80%** with **~55%** loss is gained during testing. The difference between training and testing maximum accuracy is also smaller, **~82%** training and **~80%** test accuracy.

### Confusion Matrix and F-measure, Precision, Recall (model_6):
I use **model_6** to make predictions, using the **model_6.predict()**. I store these predictions in **y_pred6** variable:

In [None]:
y_pred6 = model_6.predict(x_test)
print(y_pred6)

I use the **argmax() NumPy** function again, to return the indices of the maximum values along the second axis (axis=1). Then, I create the confusion matrix:

In [None]:
y_pred_6 = np.argmax(y_pred6, axis=1)
y_test_6 =np.argmax(y_test, axis=1)
cm6 = confusion_matrix(y_test_6, y_pred_6)

plot_confusion_matrix(cm=cm6, classes=["anger", "fear", "joy", "sadness"], title='Confusion Matrix')

### Confusion Matrix interpretation (model_6):

According to the confusion matrix above (cm6), the model has predicted correctly in total **2.512** out of **3.142** the labels of test data (if the sum of the correctly identified labels is not the same, it means that the values on the confusion matrix changed). The label **fear** is the one with the most correct classifications, then it is label **anger** and lastly, labels **joy** and **sadness**. In comparison to the previous confusion matrix of model_5, the model has more missclassifications, but their results are similar. However, there are still some label pairs which are considerably missclassified such as: **fear** with **anger** (and the opposite),**sadness** with **anger**, **sadness** with **fear**, and **joy** with **anger** as well as with **fear**.

I compute the **F1-score**. The F1-score reaches a **0.80** value, which is close to the F1-score of model_5 (0.81):

In [None]:
sklearn.metrics.f1_score(y_test_6, y_pred_6, average='macro')

In [None]:
report_6 = classification_report(y_test_6, y_pred_6, labels=[0,1,2,3], target_names=["anger", "fear", "joy", "sadness"])
print(report_6)

The classification report demonstrates high precision, recall and f1-score values for each class. The results reported are similar to the ones of **model_5**.

## Conclusion:
**Model_5** is **the best multi-class classification model** for this task with **0.81** for F1-score. **Model_6** comes in second place with **0.80** for F1-score. **Model_4** could be considered as a third option, but its F1-score **0.78** is lower than the F1-score of the other two models.
<br>
Now, I will apply **model_5** with One vs. Rest strategy to see if model_5 performs good in a binary classification problem with the same data. Additionally, I will find out with which class (e.g., anger, fear, joy, sadness) the model gains higher accuracy and F1-score.

## 6. One vs. Rest strategy
The One-vs-Rest strategy splits a multi-class classification into one binary classification problem per class. Given the multi-class classification problem with examples for each class **anger, fear, joy and sadness**, this can be divided into **four** binary classification problems as follows:

* **Binary classification problem 1**: anger vs [fear, joy, sadness]
* **Binary classification problem 2**: fear vs [anger, joy, sadness]
* **Binary classification problem 3**: joy vs [anger, fear, sadness]
* **Binary classification problem 4**: sadness vs [anger, fear, joy]

Therefore, one model must be created for each problem (or class) mentioned above. Before I start building the models for each class, I prepare the labels of the twitter data accordingly. In the three datasets (train, validation, test), I create four extra columns. Each column will contain the new labels for each binary classification problem and these labels will be used by the binary classification models to be trained and tested.

## 6.1. Data processing (One vs. Rest)

I create the function **add_ovr_label()**, which takes as argument a **Pandas DataFrame** (in this case the train, validation and test datasets) and creates the following **four** columns: **'ovr_for_label_anger', 'ovr_for_label_fear', 'ovr_for_label_joy', 'ovr_for_label_sadness'**. In these columns, the class mentioned in the name of the column (e.g., anger in 'ovr_for_label_anger') is substituted with **1**, and the other classes are substituted with **0** by the function:

In [None]:
def add_ovr_label(data: pd.DataFrame):
    data['ovr_for_label_anger'] = data['label'].apply(lambda x: 1 if x == 'anger' else 0)
    data['ovr_for_label_fear'] = data['label'].apply(lambda x: 1 if x == 'fear' else 0)
    data['ovr_for_label_joy'] = data['label'].apply(lambda x: 1 if x == 'joy' else 0)
    data['ovr_for_label_sadness'] = data['label'].apply(lambda x: 1 if x == 'sadness' else 0)


# call the function to create four columns with new labels
add_ovr_label(emotions_train_data) # train data
add_ovr_label(emotions_val_data) # validation data
add_ovr_label(emotions_test_data) # test data

I check that the four columns have been added to the three datasets using the **head()** and **value.counts()** functions:

In [None]:
# visualize train data
emotions_train_data.head()

In [None]:
# visualize validation data
emotions_val_data.head()

In [None]:
# visualize test data
emotions_test_data.head()

In [None]:
emotions_train_data[['label', 'ovr_for_label_anger', 'ovr_for_label_fear', 'ovr_for_label_joy',
                     'ovr_for_label_sadness']].value_counts()

In [None]:
emotions_val_data[['label', 'ovr_for_label_anger', 'ovr_for_label_fear', 'ovr_for_label_joy',
                     'ovr_for_label_sadness']].value_counts()

In [None]:
emotions_test_data[['label', 'ovr_for_label_anger', 'ovr_for_label_fear', 'ovr_for_label_joy',
                     'ovr_for_label_sadness']].value_counts()

The **value_counts()** function shows above that each class for each binary classification problem is substituted with **1** in the columns, while the other classes are substituted with **0**.

## 6.3. Vectorizing labels:
The text data are already converted into apropriate numerical formats. Now, it remains to vectorize the new binary labels for each classification problem. First, I put the extracted labels into lists and then, I vectorize the labels by encoding them into one-hot format:

In [None]:
# train labels
y_train_for_label_anger = emotions_train_data['ovr_for_label_anger'].tolist()
y_train_for_label_fear = emotions_train_data['ovr_for_label_fear'].tolist()
y_train_for_label_joy = emotions_train_data['ovr_for_label_joy'].tolist()
y_train_for_label_sadness = emotions_train_data['ovr_for_label_joy'].tolist()

# validation labels
y_val_for_label_anger = emotions_val_data['ovr_for_label_anger'].tolist()
y_val_for_label_fear = emotions_val_data['ovr_for_label_fear'].tolist()
y_val_for_label_joy = emotions_val_data['ovr_for_label_joy'].tolist()
y_val_for_label_sadness = emotions_val_data['ovr_for_label_joy'].tolist()

# test labels
y_test_for_label_anger = emotions_test_data['ovr_for_label_anger'].tolist()
y_test_for_label_fear = emotions_test_data['ovr_for_label_fear'].tolist()
y_test_for_label_joy = emotions_test_data['ovr_for_label_joy'].tolist()
y_test_for_label_sadness = emotions_test_data['ovr_for_label_joy'].tolist()

In [None]:
# Binary classification problem 1: anger vs [fear, joy, sadness]

y_one_hot_anger_train = np.asarray(y_train_for_label_anger) # train labels
y_one_hot_anger_train = to_categorical(y_one_hot_anger_train)
print('Shape of label tensor for anger:', y_one_hot_anger_train.shape)

y_one_hot_anger_val = np.asarray(y_val_for_label_anger) # validation labels
y_one_hot_anger_val = to_categorical(y_one_hot_anger_val)
print('Shape of label tensor for anger:', y_one_hot_anger_val.shape)

y_one_hot_anger_test = np.asarray(y_test_for_label_anger) # test labels
y_one_hot_anger_test = to_categorical(y_one_hot_anger_test)
print('Shape of label tensor for anger:', y_one_hot_anger_test.shape)


# Binary classification problem 2: fear vs [anger, joy, sadness]

y_one_hot_fear_train = np.asarray(y_train_for_label_fear) # train labels
y_one_hot_fear_train = to_categorical(y_one_hot_fear_train)
print('Shape of label tensor for fear:', y_one_hot_anger_train.shape)

y_one_hot_fear_val = np.asarray(y_val_for_label_fear) # validation labels
y_one_hot_fear_val = to_categorical(y_one_hot_fear_val)
print('Shape of label tensor for fear:', y_one_hot_fear_val.shape)

y_one_hot_fear_test = np.asarray(y_test_for_label_fear) # test labels
y_one_hot_fear_test = to_categorical(y_one_hot_fear_test)
print('Shape of label tensor for fear:', y_one_hot_fear_test.shape)


# Binary classification problem 3: joy vs [anger, fear, sadness]

y_one_hot_joy_train = np.asarray(y_train_for_label_joy) # train labels
y_one_hot_joy_train = to_categorical(y_one_hot_joy_train)
print('Shape of label tensor for joy:', y_one_hot_joy_train.shape)

y_one_hot_joy_val = np.asarray(y_val_for_label_joy) # validation labels
y_one_hot_joy_val = to_categorical(y_one_hot_joy_val)
print('Shape of label tensor foy joy:', y_one_hot_joy_val.shape)

y_one_hot_joy_test = np.asarray(y_test_for_label_joy) # test labels
y_one_hot_joy_test = to_categorical(y_one_hot_joy_test)
print('Shape of label tensor for joy:', y_one_hot_joy_test.shape)


# Binary classification problem 4: sadness vs [anger, fear, joy]

y_one_hot_sadness_train = np.asarray(y_train_for_label_sadness) # train labels
y_one_hot_sadness_train = to_categorical(y_one_hot_sadness_train)
print('Shape of label tensor for sadness:', y_one_hot_sadness_train.shape)

y_one_hot_sadness_val = np.asarray(y_val_for_label_sadness) # validation labels
y_one_hot_sadness_val = to_categorical(y_one_hot_sadness_val)
print('Shape of label tensor for sadness:', y_one_hot_sadness_val.shape)

y_one_hot_sadness_test = np.asarray(y_test_for_label_sadness) # test labels
y_one_hot_sadness_test = to_categorical(y_one_hot_sadness_test)
print('Shape of label tensor for sadness:', y_one_hot_sadness_test.shape)

## 6.4. Binary classification Models Setup, Training and Testing:
### 6.4.1 Binary classification problem 1: anger vs [fear, joy, sadness]
I set up the exact same **model_5** with stacked LSTM hidden layers, which each layer contains **32 units** and dropout rate **0.4**. As the output layer I set again a Dense layer, but this time with **2** units because the task has two labels:

In [None]:
binary_model_1 = Sequential()
binary_model_1.add(Embedding(max_words, embedding_dim, input_length = max_length))
binary_model_1.add(LSTM(32, return_sequences=True, dropout=0.4))
binary_model_1.add(LSTM(32, dropout=0.4))
binary_model_1.add(Dense(2, activation='softmax'))
binary_model_1.summary()

In [None]:
# load the GloVe matrix
binary_model_1.layers[0].set_weights([embedding_matrix])
binary_model_1.layers[0].trainable = True

I train the model over **7 epochs**, as I did in the multi-class classification task for **model_5**:

In [None]:
binary_model_1.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task has two labels
              metrics=['acc'])

binary_history_1 = binary_model_1.fit(x_train, y_one_hot_anger_train, # training/ fitting the model
                    epochs=7,
                    batch_size=16,
                    validation_data=(x_val, y_one_hot_anger_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = binary_model_1.evaluate(x_test, y_one_hot_anger_test)

In [None]:
acc = binary_history_1.history['acc']
val_acc = binary_history_1.history['val_acc']
loss = binary_history_1.history['loss']
val_loss = binary_history_1.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

In [None]:
y_pred_binary1 = binary_model_1.predict(x_test)
print(y_pred_binary1)

In [None]:
y_pred_binary_1 = np.argmax(y_pred_binary1, axis=1)
y_one_hot_anger_test_1 = np.argmax(y_one_hot_anger_test, axis=1)
cm1_binary = confusion_matrix(y_one_hot_anger_test_1, y_pred_binary_1)

plot_confusion_matrix(cm=cm1_binary, classes=["anger", "Rest"], title='Confusion Matrix')

In [None]:
sklearn.metrics.f1_score(y_one_hot_anger_test_1, y_pred_binary_1, average='macro')

In [None]:
report_binary_1 = classification_report(y_one_hot_anger_test_1, y_pred_binary_1, labels=[1,0], target_names=["anger", "Rest"])
print(report_binary_1)

### Summary of binary_model_1 results:
The results overall are bad, when trying to classify label **anger** vs the rest of the other labels. According to the confusion matrix, it is apparent that the model underperforms a lot in comparison to **model_5** since it struggles to distinguish **anger** from the rest of the labels. For this reason, a very low F1-score **0.46** is gained from this model. The results could be improved maybe by changing some parameters of the model (e.g., reducing the dropout rate to 0.2); however, it does fall out of scope for this project's goal.

### 6.4.2 Binary classification problem 2: fear vs [anger, joy, sadness]
Similar to the previous binary classification model, I set up the exact same **model_5** for the binary classification problem, fear vs the rest of the labels, and I train the model for **7** epochs.

In [None]:
binary_model_2 = Sequential()
binary_model_2.add(Embedding(max_words, embedding_dim, input_length = max_length))
binary_model_2.add(LSTM(32, return_sequences=True, dropout=0.4))
binary_model_2.add(LSTM(32, dropout=0.4))
binary_model_2.add(Dense(2, activation='softmax'))
binary_model_2.summary()

In [None]:
# load the GloVe matrix
binary_model_2.layers[0].set_weights([embedding_matrix])
binary_model_2.layers[0].trainable = True

In [None]:
binary_model_2.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task has two labels
              metrics=['acc'])

binary_history_2 = binary_model_2.fit(x_train, y_one_hot_fear_train, # training/ fitting the model
                    epochs=7,
                    batch_size=16,
                    validation_data=(x_val, y_one_hot_fear_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = binary_model_2.evaluate(x_test, y_one_hot_fear_test)

In [None]:
y_pred_binary2 = binary_model_2.predict(x_test)
print(y_pred_binary2)

In [None]:
y_pred_binary_2 = np.argmax(y_pred_binary2, axis=1)
y_one_hot_fear_test_2 = np.argmax(y_one_hot_fear_test, axis=1)
cm2_binary = confusion_matrix(y_one_hot_fear_test_2, y_pred_binary_2)

plot_confusion_matrix(cm=cm2_binary, classes=["fear", "Rest"], title='Confusion Matrix')

In [None]:
sklearn.metrics.f1_score(y_one_hot_fear_test_2, y_pred_binary_2, average='macro')

In [None]:
report_binary_2 = classification_report(y_one_hot_fear_test_2, y_pred_binary_2, labels=[1,0], target_names=["fear", "Rest"])
print(report_binary_2)

### Summary of binary_model_2 results:
Similar to the previous binary model, the results are bad again, when binary_model_2 classifies label **anger** vs the rest of the other labels. According to the confusion matrix, it is apparent that the model underperforms a lot in comparison to **model_5** since it struggles to distinguish **fear** from the rest of the labels. For this reason, a very low F1-score **0.45** is gained from this model, which is a bit more than from the previous binary model. The reason behind the higher F1-score **0.12** for label **fear** could be the fact that label **fear** has more samples (995) available than the other labels.

### 6.4.3 Binary classification problem 3: joy vs [anger, fear, sadness]

In [None]:
binary_model_3 = Sequential()
binary_model_3.add(Embedding(max_words, embedding_dim, input_length = max_length))
binary_model_3.add(LSTM(32, return_sequences=True, dropout=0.4))
binary_model_3.add(LSTM(32, dropout=0.4))
binary_model_3.add(Dense(2, activation='softmax'))
binary_model_3.summary()

In [None]:
# load the GloVe matrix
binary_model_3.layers[0].set_weights([embedding_matrix])
binary_model_3.layers[0].trainable = True

In [None]:
binary_model_3.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task has two labels
              metrics=['acc'])

binary_history_3 = binary_model_3.fit(x_train, y_one_hot_joy_train, # training/ fitting the model
                    epochs=7,
                    batch_size=16,
                    validation_data=(x_val, y_one_hot_joy_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = binary_model_3.evaluate(x_test, y_one_hot_joy_test)

In [None]:
y_pred_binary3 = binary_model_3.predict(x_test)
print(y_pred_binary3)

In [None]:
y_pred_binary_3 = np.argmax(y_pred_binary3, axis=1)
y_one_hot_joy_test_3 = np.argmax(y_one_hot_joy_test, axis=1)
cm3_binary = confusion_matrix(y_one_hot_joy_test_3, y_pred_binary_3)

plot_confusion_matrix(cm=cm3_binary, classes=["joy", "Rest"], title='Confusion Matrix')

In [None]:
sklearn.metrics.f1_score(y_one_hot_joy_test_3, y_pred_binary_3, average='macro')

In [None]:
report_binary_3 = classification_report(y_one_hot_joy_test_3, y_pred_binary_3, labels=[1,0], target_names=["joy", "Rest"])
print(report_binary_3)

### Summary of binary_model_3 results:
Similar to the previous binary models, the results are bad, when binary_model_3 classifies label **joy** vs the rest of the other labels. According to the confusion matrix, it is apparent that the model underperforms a lot in comparison to **model_5** since it struggles to distinguish **joy** from the rest of the labels. For this reason, a very low F1-score **0.44** is gained from this model.

### 6.4.4 Binary classification problem 4: sadness vs [anger, fear, joy]

In [None]:
binary_model_4 = Sequential()
binary_model_4.add(Embedding(max_words, embedding_dim, input_length = max_length))
binary_model_4.add(LSTM(32, return_sequences=True, dropout=0.4))
binary_model_4.add(LSTM(32, dropout=0.4))
binary_model_4.add(Dense(2, activation='softmax'))
binary_model_4.summary()

In [None]:
# load the GloVe matrix
binary_model_4.layers[0].set_weights([embedding_matrix])
binary_model_4.layers[0].trainable = True

In [None]:
binary_model_4.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', # the task has two labels
              metrics=['acc'])

binary_history_4 = binary_model_4.fit(x_train, y_one_hot_sadness_train, # training/ fitting the model
                    epochs=7,
                    batch_size=16,
                    validation_data=(x_val, y_one_hot_sadness_val), # specify validation data
                    verbose = 1)

In [None]:
test_loss, test_acc = binary_model_4.evaluate(x_test, y_one_hot_sadness_test)

In [None]:
y_pred_binary4 = binary_model_4.predict(x_test)
print(y_pred_binary4)

In [None]:
y_pred_binary_4 = np.argmax(y_pred_binary4, axis=1)
y_one_hot_sadness_test_4 = np.argmax(y_one_hot_sadness_test, axis=1)
cm4_binary = confusion_matrix(y_one_hot_sadness_test_4, y_pred_binary_4)

plot_confusion_matrix(cm=cm4_binary, classes=["sadness", "Rest"], title='Confusion Matrix')

In [None]:
sklearn.metrics.f1_score(y_one_hot_sadness_test_4, y_pred_binary_4, average='macro')

In [None]:
report_binary_4 = classification_report(y_one_hot_sadness_test_4, y_pred_binary_4, labels=[1,0], target_names=["sadness", "Rest"])
print(report_binary_4)

### Summary of binary_model_4 results:
Similar to the previous binary models, the results are bad, when binary_model_4 classifies label **sadness** vs the rest of the other labels. According to the confusion matrix, it is apparent that the model underperforms a lot in comparison to **model_5** since it struggles to distinguish **sadness** from the rest of the labels. For this reason, a very low F1-score **0.44** is gained from this model. Labels **sandess** has also the lowest precision score in comparison to the other labels of the previous binary models.

## Final conclusions:

For the performance of the models on the task of text classification according to the emotions, **anger, fear, joy and sadness**, I draw the following conclusions:
1. The multi-class classification model(s) performed better on the task than the binary classification models. This can be concluded from the F1-scores of the models as well as from the confusion matrices. **Model_5** is the best model for this task because it gains **0.81** for F1-score.
2. Since all the binary classification models struggle to correctly classify one of the emotion labels from the rest, it means that the specific task is too complicated to be solved by a binary classification model. However, the binary classification models could gain better accuracy and F1-scores if the parameters of **model_5** were adjusted differently.
3. The reason behind the large number of missclassifications in the **binary classification models** could be the nature and the size of the dataset:
- The dataset contains some unuseful information, such as hashtags, other users mentions (@...), (maybe emojis) etc., which creates extra noise during training the model. As a next step, the dataset could be preprocessed in order to improve the model's performance.
- The amount of samples for each label is unbalanced for the binary classification models. This can be observed with the **value_counts()** and the **classification_report()**, where **Rest** has always more samples than the specific label. In general though, label **fear** has the most samples and therefore, it always has the most correct predictions from the multi-class models.
4. The samples of labels **anger, fear and sadness** must be re-evaluated because they might not be representative enough for each class and therefore, they are missclassified a lot with each other. This issue could also be solved with more labeled data.