# Introduction


*   While analyzing the predictions made by the initial BERT model,  noticed that **some 'Neutral' comments were actually not 'Neutral'**
*   It appears that this class has been used like a **"trash" class** when manual labeling was difficult
*   Some samples were labeled with multiple labels, including the 'Neutral' emotion. However, **a sample cannot be 'Neutral' and also include other emotions**
*   Training a model on these **noisy data** could lead to increase the confusion between some emotions
* Let's try and train a new model by **filtering out the 'Neutral'** samples



# 1 - Importing libraries and loading data

## 1.1 - Installing and importing libraries

First, let's install the `transformers` library which contains thousands of pre-trained models, including BERT.

In [2]:
!pip install transformers
!pip install emoji
!pip install contractions

Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji
Successfully installed emoji-2.12.1
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [3

In [3]:
# Data manipulation libraries
import sys, os
import pandas as pd
import numpy as np
import json

import emoji
import contractions
import re

# Scikit-learn packages
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.utils.class_weight import compute_class_weight

# Packages to define a BERT model
from transformers import TFBertModel, BertTokenizerFast, BertConfig

# Keras and TensorFlow packages
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras import backend as K
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.initializers import TruncatedNormal

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1.2 - Loading datasets and lists of emotions

First, let's load our clean data.

In [5]:
# Importing train, validation and test datasets with preprocessed texts and labels
train_GE = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Thesis_Work/Thesis-II/DataFiles/full_dataset/train_dataset.csv")
val_GE = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Thesis_Work/Thesis-II/DataFiles/full_dataset/validation_dataset.csv")
test_GE = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Thesis_Work/Thesis-II/DataFiles/full_dataset/test_dataset.csv")

# Shape validation
print(train_GE.shape)
print(val_GE.shape)
print(test_GE.shape)


(147847, 30)
(31682, 30)
(31682, 30)


Let's also load the lists of emotions from GoEmotions taxonomies **excluding the the 'Neutral' emotion** this time.

In [6]:
# Loading emotion labels for GoEmotions taxonomy
with open("/content/drive/MyDrive/Colab Notebooks/Thesis_Work/Thesis-II/DataFiles/full_dataset/emotions.txt", "r") as file:
    GE_taxonomy = file.read().split("\n")
GE_taxonomy.remove('neutral')
print("Emotions on GoEmotions taxonomy are : \n{}".format(GE_taxonomy))

print()


Emotions on GoEmotions taxonomy are : 
['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise']



## 1.3 - Filtering out the 'Neutral' only samples

First, we need to drop the 'Neutral' emotion from all datasets.

In [7]:
train_GE = train_GE.drop(columns=['neutral'])
val_GE = val_GE.drop(columns=['neutral'])
test_GE = test_GE.drop(columns=['neutral'])

In [8]:
train_GE.head()

Unnamed: 0,cleaned_text,emotion,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,...,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise
0,i was born in 98 so i feel like your 98 loss i...,approval,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"wow, you all are heroes!",curiosity,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2,if its not obvious everyone is having issues w...,realization,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,this architecture will be treasured even more ...,neutral,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"that sucks, i hope you find the finances to be...",nervousness,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0


In [9]:
# Drop the 'emotion' column
train_GE = train_GE.drop(columns=['emotion'], errors='ignore')
val_GE = val_GE.drop(columns=['emotion'], errors='ignore')
test_GE = test_GE.drop(columns=['emotion'], errors='ignore')

# Display the first few rows of each dataframe to verify
print(train_GE.head())
print(val_GE.head())
print(test_GE.head())

                                        cleaned_text  admiration  amusement  \
0  i was born in 98 so i feel like your 98 loss i...           0          0   
1                          wow, you all are heroes!            0          0   
2  if its not obvious everyone is having issues w...           0          0   
3  this architecture will be treasured even more ...           0          0   
4  that sucks, i hope you find the finances to be...           0          0   

   anger  annoyance  approval  caring  confusion  curiosity  desire  ...  joy  \
0      0          0         1       0          0          0       0  ...    0   
1      0          0         0       0          0          1       0  ...    0   
2      0          0         0       0          0          0       0  ...    0   
3      0          0         0       0          0          0       0  ...    0   
4      0          0         0       0          0          0       0  ...    0   

   love  nervousness  optimism  pride 

Then, we need remove all the samples that have been left without a label.

In [10]:
# Removing samples with only 0 in their labels
train_GE = train_GE.loc[ train_GE.apply(lambda x: sum(x[1:]), axis=1)>0 ]
val_GE = val_GE.loc[ val_GE.apply(lambda x: sum(x[1:]), axis=1)>0 ]
test_GE = test_GE.loc[ test_GE.apply(lambda x: sum(x[1:]), axis=1)>0 ]

# Shape validation
print(train_GE.shape)
print(val_GE.shape)
print(test_GE.shape)

#Actual (147847, 30)|(31682, 30)|(31682, 30)

(106728, 28)
(22941, 28)
(22847, 28)


In [11]:
# Preview of data
display(val_GE.head(3))

Unnamed: 0,cleaned_text,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,...,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise
0,i had the biggest smile!,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,minimalist aesthetic stuff that does not look ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,i have seen it too. i think it was an honest m...,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Doing so, we have **decreased the number of samples by nearly 30%** of the original data.

#2 - Modeling : BERT (Bidirectional Encoder Representations from Transformers)

Now we can go ahead and start defining our BERT-based model.

##2.1 - Configuration of the base model

First of all, let's define a `max_length` variable. This variable sets a fixed length of sequences to be fed to our model. Therefore, sequences will be either truncated if larger than this value, or completed using padding if smaller. To avoid truncating, we fix this value according to the largest sample of our data.

In [12]:
# Computing max length of samples
full_text = pd.concat([train_GE['cleaned_text'], val_GE['cleaned_text'], test_GE['cleaned_text']])
max_length = full_text.apply(lambda x: len(x.split())).max()
max_length

34

We are going to use BERT's base model which contains almost 110 M trainable parameters.

Also, in order to match the tokenization and vocabulary used during the training, we are going to use a BERT tokenizer.

In [13]:
# Importing BERT pre-trained model and tokenizer
model_name = 'bert-base-uncased'
config = BertConfig.from_pretrained(model_name, output_hidden_states=False)
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path = model_name, config = config)
transformer_model = TFBertModel.from_pretrained(model_name, config = config)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

## 2.2 - Definition of the model architecture

Now that everything is in place, we can create a model based on BERT's main layer, and replace the top layers to reach our main objective (multi-label classification accross **27 possible emotions**).

Our model takes three inputs that result from tokenization:

*   `input_ids`: indices of input sequence tokens in the vocabulary
*   `token_ids`: Segment token indices to indicate first and second portions of the inputs.   0 for sentence A and 1 for sentence B
*   `attention mask`: Mask to avoid performing attention on padding token indices.  0 for masked and 1 for not masked



In [None]:
# # function for creating BERT based model
# def create_model(nb_labels):

#   # Load the MainLayer
#   bert = transformer_model.layers[0]

#   # Build the model inputs
#   input_ids = tf.keras.Input(shape=(max_length,), name='input_ids', dtype='int32')
#   attention_mask = tf.keras.Input(shape=(max_length,), name='attention_mask', dtype='int32')
#   token_ids = tf.keras.Input(shape=(max_length,), name='token_ids', dtype='int32')
#   inputs = {'input_ids': input_ids, 'attention_mask': attention_mask, 'token_ids': token_ids}

#   # Load the Transformers BERT model as a layer in a Keras model
#   bert_model = bert(inputs)[1]
#   dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
#   pooled_output = dropout(bert_model, training=False)

#   # Then build the model output
#   emotion = Dense(units=nb_labels, activation="sigmoid", kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='emotion')(pooled_output)
#   outputs = emotion

#   # And combine it all in a model object
#   model = Model(inputs=inputs, outputs=outputs, name='BERT_MultiLabel')

#   return model

In [16]:
def create_model(nb_labels):
    # Load the MainLayer
    bert = transformer_model.layers[0]

    # Build the model inputs
    input_ids = tf.keras.Input(shape=(max_length,), name='input_ids', dtype='int32')
    attention_mask = tf.keras.Input(shape=(max_length,), name='attention_mask', dtype='int32')
    token_ids = tf.keras.Input(shape=(max_length,), name='token_ids', dtype='int32')

    # Load the Transformers BERT model as a layer in a Keras model
    bert_model = bert({'input_ids': input_ids, 'attention_mask': attention_mask, 'token_type_ids': token_ids})[1]

    # Pass all three inputs to BERT
    dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
    pooled_output = dropout(bert_model, training=False)

    # Build the model output
    emotion = Dense(units=nb_labels, activation="sigmoid", kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='emotion')(pooled_output)
    outputs = emotion

    # And combine it all in a model object
    model = Model(inputs=[input_ids, attention_mask, token_ids], outputs=outputs, name='BERT_MultiLabel')
    return model

We use here a `sigmoid` activation function in the last dense layer that is better suited than a `softmax` activation function. In fact, `softmax` shrinks output probabilities for each label so that the sum of probabilities is 1. In our case, each label (emotion) can independently have a probability between 0 and 1, and `sigmoid` allows that.

We can now create our model using 27 labels and visualize a summary.

In [18]:
# Creating a model instance
model = create_model(27)

# Take a look at the model
model.summary()



Model: "BERT_MultiLabel"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
Total params: 109482240 (417.64 MB)
Trainable params: 109482240 (417.64 MB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________


##2.3 - Data preprocessing and model training

###2.3.1 - Tokenizing data

Let's go ahead and process our data. We will first separate texts from labels in the train, validation and test datasets, and then tokenize the texts using the BERT tokenizer.

In [19]:
# Creating train, validation and test variables
X_train = train_GE['cleaned_text']
y_train = train_GE.loc[:, GE_taxonomy].values.astype(float)

X_val = val_GE['cleaned_text']
y_val = val_GE.loc[:, GE_taxonomy].values.astype(float)

X_test = test_GE['cleaned_text']
y_test = test_GE.loc[:, GE_taxonomy].values.astype(float)

# Tokenizing train data
train_token = tokenizer(
    text = X_train.to_list(),
    add_special_tokens = True,
    max_length = max_length,
    truncation = True,
    padding = 'max_length',
    return_tensors = 'tf',
    return_token_type_ids = True,
    return_attention_mask = True,
    verbose = True)

# Tokenizing valisation data
val_token = tokenizer(
    text = X_val.to_list(),
    add_special_tokens = True,
    max_length = max_length,
    truncation = True,
    padding = 'max_length',
    return_tensors = 'tf',
    return_token_type_ids = True,
    return_attention_mask = True,
    verbose = True)

# Tokenizing test data
test_token = tokenizer(
    text = X_test.to_list(),
    add_special_tokens = True,
    max_length = max_length,
    truncation = True,
    padding = 'max_length',
    return_tensors = 'tf',
    return_token_type_ids = True,
    return_attention_mask = True,
    verbose = True)

In [20]:
# Creating BERT compatible inputs with Input Ids, attention masks and token Ids
train = {'input_ids': train_token['input_ids'], 'attention_mask': train_token['attention_mask'],'token_ids': train_token['token_type_ids']}
val = {'input_ids': val_token['input_ids'], 'attention_mask': val_token['attention_mask'],'token_ids': val_token['token_type_ids']}
test = {'input_ids': test_token['input_ids'], 'attention_mask': test_token['attention_mask'],'token_ids': test_token['token_type_ids']}

During the training phase, we our going to use batches of 16 samples. After each epoch, data will be shuffled. Let's create TensorFlow tensors accordingly.

In [21]:
# Creating TF tensors
train_tensor = tf.data.Dataset.from_tensor_slices((train, y_train)).shuffle(len(train)).batch(16)
val_tensor = tf.data.Dataset.from_tensor_slices((val, y_val)).shuffle(len(val)).batch(16)
test_tensor = tf.data.Dataset.from_tensor_slices((test, y_test)).shuffle(len(test)).batch(16)

### 2.3.2 - Class weights for multi-label and custom loss function

Training requires to monitor the loss function and eventually some other metrics to see how the model behaves throughout the epochs.

Therefore, we need to define a weighted loss function that takes into account  class weights in our multi-label case.

First, we need to compute class weights.

In [None]:
# # Function for calculating multilabel class weights
# def calculating_class_weights(y_true):
#     number_dim = np.shape(y_true)[1]
#     weights = np.empty([number_dim, 2])
#     for i in range(number_dim):
#         weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
#     return weights

# class_weights = calculating_class_weights(y_train)

Then, we can define a custom crossentropy function in which we multiply the weights.

###2.3.3 - Model training

Everything is ready, we can now start training our model.

We chose not to exceed 4 epochs to train our model as it will most likely start to overfit our data.

In [22]:

from sklearn.utils.class_weight import compute_class_weight

# Function for calculating multilabel class weights
def calculating_class_weights(y_true):
    number_dim = np.shape(y_true)[1]
    weights = np.empty([number_dim, 2])
    for i in range(number_dim):
        unique_classes = np.unique(y_true[:, i])
        weights[i] = compute_class_weight('balanced', classes=unique_classes, y=y_true[:, i])
    return weights

class_weights = calculating_class_weights(y_train)

In [23]:
# Custom loss function for multilabel
def get_weighted_loss(weights):
    def weighted_loss(y_true, y_pred):
        return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
    return weighted_loss

In [None]:
# Set an optimizer
optimizer = Adam(
    learning_rate=3.e-05
    )

# Set loss
loss = get_weighted_loss(class_weights)

# Compile the model
model.compile(
    optimizer = optimizer,
    loss = loss)

# train the model
history = model.fit(train_tensor,
                    epochs=4,
                    validation_data=val_tensor,
                    )

Epoch 1/4
1069/6671 [===>..........................] - ETA: 10:24:27 - loss: 0.5548

## 2.4 - Model evaluation

### 2.4.1 - Evaluation on GoEmotions taxonomy

In [None]:
# # Save model weights
# model.save_weights('/content/drive/MyDrive/Colab Notebooks/Thesis_Work/Thesis-II/DataFiles/full_dataset/emotions.txt/bert-weights.hdf5')

In [None]:
# Define the directory path where you want to save the model weights
directory_path = '/content/drive/MyDrive/Colab Notebooks/Thesis_Work/Thesis-II/DataFiles/full_dataset/'

# Save model weights
model.save_weights(directory_path + 'bert-weights.hdf5')

Let's generate predictions on test data.

In [None]:
# Making probability predictions on test data
y_pred_proba = model.predict(test)

When making predictions, we only generate probabilities associated with each label. To predict actual labels, we need to add an additional step that transforms these probabilities into labels given a certain threshold.

We define a function to do so with a default threshold set to 0.8.

In [None]:
# from probabilities to labels using a given threshold
def proba_to_labels(y_pred_proba, threshold=0.8):

    y_pred_labels = np.zeros_like(y_pred_proba)

    for i in range(y_pred_proba.shape[0]):
        for j in range(y_pred_proba.shape[1]):
            if y_pred_proba[i][j] > threshold:
                y_pred_labels[i][j] = 1
            else:
                y_pred_labels[i][j] = 0

    return y_pred_labels

In [None]:
# Generate labels
y_pred_labels = proba_to_labels(y_pred_proba)

Let's evaluate these predictions using the evaluation function we defined in the previous notebooks.

In [None]:
# Model evaluation function
def model_eval(y_true, y_pred_labels, emotions):

    # Defining variables
    precision = []
    recall = []
    f1 = []

    # Per emotion evaluation
    idx2emotion = {i: e for i, e in enumerate(emotions)}

    for i in range(len(emotions)):

        # Computing precision, recall and f1-score
        p, r, f1_score, _ = precision_recall_fscore_support(y_true[:, i], y_pred_labels[:, i], average="binary")

        # Append results in lists
        precision.append(round(p, 2))
        recall.append(round(r, 2))
        f1.append(round(f1_score, 2))

    # Macro evaluation
    macro_p, macro_r, macro_f1_score, _ = precision_recall_fscore_support(y_true, y_pred_labels, average="macro")

    # Append results in lists
    precision.append(round(macro_p, 2))
    recall.append(round(macro_r, 2))
    f1.append(round(macro_f1_score, 2))

    # Converting results to a dataframe
    df_results = pd.DataFrame({"Precision":precision, "Recall":recall, 'F1':f1})
    df_results.index = emotions+['MACRO-AVERAGE']

    return df_results

In [None]:
# Model evaluation
model_eval(y_test, y_pred_labels, GE_taxonomy)

Looking at the results, we see that this model performs better than the previous one. It looks like **removing the noise brought by the 'Neutral' emotion helped to better distinguish the other emotions.**

### 2.4.2 - Threshold optimization

In the initial evaluation, we set an aribitrary threshold. However, we can also choose a threshold that maximizes a certain metric.

We define a function that tests a certain number of possible thresholds, and returns the best threshold together with the best predicted labels and best macro f1-score.

In [None]:
# Function that computes labels from probabilities and optimizes the threshold that maximizes f1-score
def proba_to_labels_opt(y_true, y_pred_proba):

    '''
    Inputs:
        y_true: Ground truth labels
        y_pred_proba: predicted probabilities

    Outputs :
        best_y_pred_labels: preticted labels associated with best threshold
        best_t: best threshold
        best_macro_f1: macro f1-score associated with predicted labels
    '''

    # range of possible thresholds
    thresholds = np.arange(0.7, 0.99, 0.01)

    # Computing threshold that maximizes macro f1-score
    best_y_pred_labels = np.zeros_like(y_pred_proba)
    best_t = 0
    best_macro_f1 = 0

    # Iterating through possible thresholds
    for t in thresholds:

        y_pred_labels = proba_to_labels(y_pred_proba, t)

        _, _, macro_f1, _ = precision_recall_fscore_support(y_true, y_pred_labels, average="macro")

        if macro_f1 > best_macro_f1:
            best_macro_f1 = macro_f1
            best_t = t
            best_y_pred_labels = y_pred_labels

    return best_y_pred_labels, best_t, best_macro_f1

We can now apply this function to our predicted probabilities and compute optimized label predictions.

In [None]:
# Compute label predictions and corresponding optimal thresholds
y_pred_labels_opt, threshold_opt, macro_f1_opt = proba_to_labels_opt(y_test, y_pred_proba)
print("The model's threshold is {}".format(threshold_opt))
print("The model's best macro-f1 is {}".format(macro_f1_opt))

In [None]:
# Model evaluation : Precision, Recall, F-score
model_eval(y_test, y_pred_labels_opt, GE_taxonomy)

**Optimizing the threshold** helped us to **slightly improve** the model predictions.

## 2.5 - Make predictions

To make predictions on a new sample, it needs to be processed using all the different precessing steps we used.

In [None]:
# Retrieving initial preprocessings
def preprocess_corpus(x):

    # Adding a space between words and punctation
    x = re.sub( r'([a-zA-Z\[\]])([,;.!?])', r'\1 \2', x)
    x = re.sub( r'([,;.!?])([a-zA-Z\[\]])', r'\1 \2', x)

    # Demojize
    x = emoji.demojize(x)

    # Expand contraction
    x = contractions.fix(x)

    # Lower
    x = x.lower()

    #correct some acronyms/typos/abbreviations
    x = re.sub(r"lmao", "laughing my ass off", x)
    x = re.sub(r"amirite", "am i right", x)
    x = re.sub(r"\b(tho)\b", "though", x)
    x = re.sub(r"\b(ikr)\b", "i know right", x)
    x = re.sub(r"\b(ya|u)\b", "you", x)
    x = re.sub(r"\b(eu)\b", "europe", x)
    x = re.sub(r"\b(da)\b", "the", x)
    x = re.sub(r"\b(dat)\b", "that", x)
    x = re.sub(r"\b(dats)\b", "that is", x)
    x = re.sub(r"\b(cuz)\b", "because", x)
    x = re.sub(r"\b(fkn)\b", "fucking", x)
    x = re.sub(r"\b(tbh)\b", "to be honest", x)
    x = re.sub(r"\b(tbf)\b", "to be fair", x)
    x = re.sub(r"faux pas", "mistake", x)
    x = re.sub(r"\b(btw)\b", "by the way", x)
    x = re.sub(r"\b(bs)\b", "bullshit", x)
    x = re.sub(r"\b(kinda)\b", "kind of", x)
    x = re.sub(r"\b(bruh)\b", "bro", x)
    x = re.sub(r"\b(w/e)\b", "whatever", x)
    x = re.sub(r"\b(w/)\b", "with", x)
    x = re.sub(r"\b(w/o)\b", "without", x)
    x = re.sub(r"\b(doj)\b", "department of justice", x)

    # replace some words with multiple occurences of a letter, example "coooool" turns into --> cool
    x = re.sub(r"\b(j+e{2,}z+e*)\b", "jeez", x)
    x = re.sub(r"\b(co+l+)\b", "cool", x)
    x = re.sub(r"\b(g+o+a+l+)\b", "goal", x)
    x = re.sub(r"\b(s+h+i+t+)\b", "shit", x)
    x = re.sub(r"\b(o+m+g+)\b", "omg", x)
    x = re.sub(r"\b(w+t+f+)\b", "wtf", x)
    x = re.sub(r"\b(w+h+a+t+)\b", "what", x)
    x = re.sub(r"\b(y+e+y+|y+a+y+|y+e+a+h+)\b", "yeah", x)
    x = re.sub(r"\b(w+o+w+)\b", "wow", x)
    x = re.sub(r"\b(w+h+y+)\b", "why", x)
    x = re.sub(r"\b(s+o+)\b", "so", x)
    x = re.sub(r"\b(f)\b", "fuck", x)
    x = re.sub(r"\b(w+h+o+p+s+)\b", "whoops", x)
    x = re.sub(r"\b(ofc)\b", "of course", x)
    x = re.sub(r"\b(the us)\b", "usa", x)
    x = re.sub(r"\b(gf)\b", "girlfriend", x)
    x = re.sub(r"\b(hr)\b", "human ressources", x)
    x = re.sub(r"\b(mh)\b", "mental health", x)
    x = re.sub(r"\b(idk)\b", "i do not know", x)
    x = re.sub(r"\b(gotcha)\b", "i got you", x)
    x = re.sub(r"\b(y+e+p+)\b", "yes", x)
    x = re.sub(r"\b(a*ha+h[ha]*|a*ha +h[ha]*)\b", "haha", x)
    x = re.sub(r"\b(o?l+o+l+[ol]*)\b", "lol", x)
    x = re.sub(r"\b(o*ho+h[ho]*|o*ho +h[ho]*)\b", "ohoh", x)
    x = re.sub(r"\b(o+h+)\b", "oh", x)
    x = re.sub(r"\b(a+h+)\b", "ah", x)
    x = re.sub(r"\b(u+h+)\b", "uh", x)

    # Handling emojis
    x = re.sub(r"<3", " love ", x)
    x = re.sub(r"xd", " smiling_face_with_open_mouth_and_tightly_closed_eyes ", x)
    x = re.sub(r":\)", " smiling_face ", x)
    x = re.sub(r"^_^", " smiling_face ", x)
    x = re.sub(r"\*_\*", " star_struck ", x)
    x = re.sub(r":\(", " frowning_face ", x)
    x = re.sub(r":\^\(", " frowning_face ", x)
    x = re.sub(r";\(", " frowning_face ", x)
    x = re.sub(r":\/",  " confused_face", x)
    x = re.sub(r";\)",  " wink", x)
    x = re.sub(r">__<",  " unamused ", x)
    x = re.sub(r"\b([xo]+x*)\b", " xoxo ", x)
    x = re.sub(r"\b(n+a+h+)\b", "no", x)

    # Handling special cases of text
    x = re.sub(r"h a m b e r d e r s", "hamberders", x)
    x = re.sub(r"b e n", "ben", x)
    x = re.sub(r"s a t i r e", "satire", x)
    x = re.sub(r"y i k e s", "yikes", x)
    x = re.sub(r"s p o i l e r", "spoiler", x)
    x = re.sub(r"thankyou", "thank you", x)
    x = re.sub(r"a^r^o^o^o^o^o^o^o^n^d", "around", x)

    # Remove special characters and numbers replace by space + remove double space
    x = re.sub(r"\b([.]{3,})"," dots ", x)
    x = re.sub(r"[^A-Za-z!?_]+"," ", x)
    x = re.sub(r"\b([s])\b *","", x)
    x = re.sub(r" +"," ", x)
    x = x.strip()

    return x

Now we can define a prediction function that takes one or more samples, and outputs the detected emotions from the model.

In [None]:
def predict_samples(text_samples, model, threshold):

    # Text preprocessing and cleaning
    text_samples_clean = [preprocess_corpus(text) for text in text_samples]

    # Tokenizing train data
    samples_token = tokenizer(
        text = text_samples_clean,
        add_special_tokens = True,
        max_length = max_length,
        truncation = True,
        padding = 'max_length',
        return_tensors = 'tf',
        return_token_type_ids = True,
        return_attention_mask = True,
        verbose = True,
    )

    # Preparing to feed the model
    samples = {'input_ids': samples_token['input_ids'],
               'attention_mask': samples_token['attention_mask'],
               'token_ids': samples_token['token_type_ids']
              }

    # Probability predictions
    samples_pred_proba = model.predict(samples)

    # Label prediction using threshold
    samples_pred_labels = proba_to_labels(samples_pred_proba)

    samples_pred_labels_df = pd.DataFrame(samples_pred_labels)
    samples_pred_labels_df = samples_pred_labels_df.apply(lambda x: [GE_taxonomy[i] for i in range(len(x)) if x[i]==1], axis=1)

    #return list(samples_pred_labels_df)
    return pd.DataFrame({"Text":text_samples, "Emotions":list(samples_pred_labels_df)})

Let's try on few examples.

In [None]:
# Predict samples
predict_samples(["My favourite food is anything I didn't have to cook myself", "are you kiddin me ??!!", "red"], model, threshold_opt)

Although the score is not very high, we see that the model **detects emotions that coherent**. Also, **when entering a neutral text such as "red", the model does not detect any emotion**.

# Conclusion

*   In this notebook, we verified our assumption : **filtering out the 'Neutral' samples from the data allows to improve the model**.

*   Not only **we better distinguish actual emotions**, but **we can also detect 'Neutral' comments**, **without teaching our model what is a 'Neutral' comment**.