# Introduction



*   In the previous step, we have cleanded our data
*   In this notebook, we will try to build a baseline model that detects one or multiple emotions in a text based on the GoEmotions data (multi-label text classification)
*   The score of pur baseline model will be used as a reference when building more complex models






# 1 - Importing libraries and loading data

First, let's install and import some libraries for data exploration and  processing.

In [None]:
# Installing additional libraries for text preprocessing
!pip install emoji
!pip install contractions



In [None]:
# Data manipulation libraries
import pandas as pd
import numpy as np
import json
from pprint import pprint

# Text processing libraries
import emoji
import re
import contractions
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

# Scikit-Learn packages
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import precision_recall_fscore_support


import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Now, let's import our data.

In [None]:
# Importing train, validation and test datasets with preprocessed texts and labels
train_GE = pd.read_csv("/content/drive/MyDrive/GoEmotions_Git/data/train_clean.csv")
val_GE = pd.read_csv("/content/drive/MyDrive/GoEmotions_Git/data/val_clean.csv")
test_GE = pd.read_csv("/content/drive/MyDrive/GoEmotions_Git/data/test_clean.csv")

# Shape validation
print(train_GE.shape)
print(val_GE.shape)
print(test_GE.shape)

(43410, 29)
(5426, 29)
(5427, 29)


In [None]:
# Loading emotion labels for GoEmotions taxonomy
with open("/content/drive/MyDrive/GoEmotions_Git/data/emotions.txt", "r") as file:
    GE_taxonomy = file.read().split("\n")

for emo in GE_taxonomy:
  print(emo)

admiration
amusement
anger
annoyance
approval
caring
confusion
curiosity
desire
disappointment
disapproval
disgust
embarrassment
excitement
fear
gratitude
grief
joy
love
nervousness
optimism
pride
realization
relief
remorse
sadness
surprise
neutral


# 2 - Preprocessings and transformations

Before defining and constructing a baseline model, we need to perform some additional processings such as tokenizing and lemmatizing our samples.

## 2.1 - Additional preprocessings for basic Machine Learning tasks

First, let's remove all punctuations.

In [None]:
# Additional text preprocessing
train_GE['Clean_text'] = train_GE['Clean_text'].apply(lambda x: re.sub(r"[^A-Za-z_]+"," ", x))
test_GE['Clean_text'] = test_GE['Clean_text'].apply(lambda x: re.sub(r"[^A-Za-z_]+"," ", x))

New we can tokenize our samples using spacy and more specifically the english model. After creating these tokens, we will be able to lemmatize them and remove english stop words that may not help us in the classification task.

In [None]:
# Download model 
!python -m spacy download en_core_web_sm -q

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
# Import English using en_core_web_sm.load()
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
# Creating tokenized documents
tokenized_train_GE = train_GE["Clean_text"].apply(lambda desc: nlp(desc))
tokenized_test_GE = test_GE["Clean_text"].apply(lambda desc: nlp(desc))

In [None]:
# Lemmatize each token and removing english stopwords
tokenized_train_GE = tokenized_train_GE.apply(lambda x: [token.lemma_ for token in x if token.lemma_ not in STOP_WORDS])
tokenized_test_GE = tokenized_test_GE.apply(lambda x: [token.lemma_ for token in x if token.lemma_ not in STOP_WORDS])

# Creating clean data in our dataframes
train_GE["Clean_token"] = [" ".join(x) for x in tokenized_train_GE]
test_GE["Clean_token"] = [" ".join(x) for x in tokenized_test_GE]

## 2.2 - Create TF-IDF matrix

Finally, we can create a TF-IDF matrix that will help us represent each sample of our corpus using the importance and frequency of each word in the sample, but also in the whole corpus.

In [None]:
# TF-IDF vector with 1000 words vocabulary 
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)

# Fitting the vectorizer and transforming train and test data
tfidf_train_GE = vectorizer.fit_transform(train_GE['Clean_token'])
tfidf_test_GE = vectorizer.transform(test_GE['Clean_token'])

# Transforming from generators to arrays
tfidf_train_GE = tfidf_train_GE.toarray()
tfidf_test_GE = tfidf_test_GE.toarray()

# Validating the shape of train and test data
print(tfidf_train_GE.shape)
print(tfidf_test_GE.shape)

(43410, 1000)
(5427, 1000)


The `max_features` argument in the `TfidfVectorizer` allows the maximum number of words to be considered in the vocabulary. Therefore, each sample in the train and test datasets will be represented using a vector of dimension `(1,1000)`.




## 2.3 - Train and test variables

Let's define some explicit variables that will be used in constructing a machine learning model.

In [None]:
# Defining train and test variables
X_train =  tfidf_train_GE
y_train = train_GE.loc[:,GE_taxonomy].values

X_test =  tfidf_test_GE
y_test = test_GE.loc[:,GE_taxonomy].values

# Shape validation
print("The shape of X_train is : ", X_train.shape)
print("The shape of y_train is : ", y_train.shape)
print()
print("The shape of X_test is : ", X_test.shape)
print("The shape of y_test is : ", y_test.shape)

The shape of X_train is :  (43410, 1000)
The shape of y_train is :  (43410, 28)

The shape of X_test is :  (5427, 1000)
The shape of y_test is :  (5427, 28)


# 3 - Dummy model

## 3.1 - Simulating dummy predictions

Before creating a baseline model, we can try and simulate a **"dummy model"** that will **always detect the same emotions**, regardless of the sample. In our case, the dummy model could always predict the **'Neutral'** emotion as it is the most represented class in our train dataset.

In [None]:
# Preview of data
display(train_GE.head(3))

Unnamed: 0,Clean_text,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,disapproval,disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral,Clean_token
0,my favourite food is anything i did not have t...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-PRON- favourite food cook -PRON-
1,now if he does off himself everyone will think...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-PRON- -PRON- think -PRON- s laugh screw peopl...
2,why the fuck is bayless isoing,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,fuck bayles isoe


Without training an actual model, we can directly generate a predicitions matrice that mimics such a behaviour. The 'Neutral' emotion is the last emotion in our `GE_Taxonomy`list, therefore, it is also the last column in `y_train` and `y_test`.

In [None]:
# Always predicting neutral emotion 
dummy_preds = np.zeros_like(y_test)
dummy_preds[:,-1] = 1
dummy_preds

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

## 3.2 - Evaluation on GoEmotions taxonomy

In order to evaluate the model, we will be using the f1-score. The f1-score allows to balance between recall and precision, which is very useful when it comes to unbalanced data.

We define a custom function that will compute the f1-score, precision and recall for each emotion, and also compute the macro-average of these metrics as a global metric.

In [None]:
# Model evaluation function 
def model_eval(y_true, y_pred_labels, emotions):
    
    # Defining variables
    precision = []
    recall = []
    f1 = []
    
    # Per emotion evaluation      
    idx2emotion = {i: e for i, e in enumerate(emotions)}
    
    for i in range(len(emotions)):
   
        # Computing precision, recall and f1-score
        p, r, f1_score, _ = precision_recall_fscore_support(y_true[:, i], y_pred_labels[:, i], average="binary")
        
        # Append results in lists
        precision.append(round(p, 2))
        recall.append(round(r, 2))
        f1.append(round(f1_score, 2))
    
    # Macro evaluation
    macro_p, macro_r, macro_f1_score, _ = precision_recall_fscore_support(y_true, y_pred_labels, average="macro")
    
    # Append results in lists
    precision.append(round(macro_p, 2))
    recall.append(round(macro_r, 2))
    f1.append(round(macro_f1_score, 2))
    
    # Converting results to a dataframe
    df_results = pd.DataFrame({"Precision":precision, "Recall":recall, 'F1':f1})
    df_results.index = emotions+['MACRO-AVERAGE']
    
    return df_results

In [None]:
# Model evaluation
model_eval(y_test, dummy_preds, GE_taxonomy)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Precision,Recall,F1
admiration,0.0,0.0,0.0
amusement,0.0,0.0,0.0
anger,0.0,0.0,0.0
annoyance,0.0,0.0,0.0
approval,0.0,0.0,0.0
caring,0.0,0.0,0.0
confusion,0.0,0.0,0.0
curiosity,0.0,0.0,0.0
desire,0.0,0.0,0.0
disappointment,0.0,0.0,0.0


As expected, **the model performs very poorly**. However, we can try to improve this score by implementing a baseline model using a simple machine learning classification model.

# 4 - Baseline model: Ridge Classifier

In this section, we will train a simple classification algorithm, the ridge classifier. However, this algorithm does not support multi-label classification. A simple strategy to do that consists of fitting one model per target using the `MultiOutputClassifier`.

## 4.1 - Training the model and evaluation on GoEmotions taxonomy

Let's create our model and fit it to our data. This a pretty simple model as it converts target variables to {-1,1} and trats the problem as a regular regression task.

In [None]:
# Multi-label classification 
rc = RidgeClassifier(class_weight='balanced')
classifier = MultiOutputClassifier(rc, n_jobs=-1)
classifier.fit(X_train, y_train)



MultiOutputClassifier(estimator=RidgeClassifier(alpha=1.0,
                                                class_weight='balanced',
                                                copy_X=True, fit_intercept=True,
                                                max_iter=None, normalize=False,
                                                random_state=None,
                                                solver='auto', tol=0.001),
                      n_jobs=-1)

In [None]:
# Making predictions on GoEmotions taxonomy 
classifier_preds = classifier.predict(X_test)
classifier_preds

array([[0, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 1]])

In [None]:
# Model evaluation
model_eval(y_test, classifier_preds, GE_taxonomy)

Unnamed: 0,Precision,Recall,F1
admiration,0.38,0.71,0.5
amusement,0.58,0.91,0.71
anger,0.15,0.65,0.24
annoyance,0.14,0.61,0.23
approval,0.15,0.67,0.24
caring,0.07,0.59,0.13
confusion,0.07,0.61,0.13
curiosity,0.08,0.53,0.15
desire,0.11,0.69,0.18
disappointment,0.06,0.51,0.1


As we can see, our baseline model performs better than the dummy model. However, the score is still relatively low and can be improved using a more advanced model. This score will be used as a reference in the next steps.

## 4.2 - Make predictions

To make predictions on a new sample, it needs to be processed using all the different precessing steps we used.

In [None]:
# Retrieving initial text preprocessings
def preprocess_corpus(x):
    
    # Adding a space between words and punctation
    x = re.sub( r'([a-zA-Z\[\]])([,;.!?])', r'\1 \2', x)
    x = re.sub( r'([,;.!?])([a-zA-Z\[\]])', r'\1 \2', x)

    # Demojize
    x = emoji.demojize(x)

    # Expand contraction
    x = contractions.fix(x)

    # Lower
    x = x.lower()

    #correct some acronyms/typos/abbreviations  
    x = re.sub(r"lmao", "laughing my ass off", x)  
    x = re.sub(r"amirite", "am i right", x)
    x = re.sub(r"\b(tho)\b", "though", x)
    x = re.sub(r"\b(ikr)\b", "i know right", x)
    x = re.sub(r"\b(ya|u)\b", "you", x)
    x = re.sub(r"\b(eu)\b", "europe", x)
    x = re.sub(r"\b(da)\b", "the", x)
    x = re.sub(r"\b(dat)\b", "that", x)
    x = re.sub(r"\b(dats)\b", "that is", x)
    x = re.sub(r"\b(cuz)\b", "because", x)
    x = re.sub(r"\b(fkn)\b", "fucking", x)
    x = re.sub(r"\b(tbh)\b", "to be honest", x)
    x = re.sub(r"\b(tbf)\b", "to be fair", x)
    x = re.sub(r"faux pas", "mistake", x)
    x = re.sub(r"\b(btw)\b", "by the way", x)
    x = re.sub(r"\b(bs)\b", "bullshit", x)
    x = re.sub(r"\b(kinda)\b", "kind of", x)
    x = re.sub(r"\b(bruh)\b", "bro", x)
    x = re.sub(r"\b(w/e)\b", "whatever", x)
    x = re.sub(r"\b(w/)\b", "with", x)
    x = re.sub(r"\b(w/o)\b", "without", x)
    x = re.sub(r"\b(doj)\b", "department of justice", x)

    # replace some words with multiple occurences of a letter, example "coooool" turns into --> cool
    x = re.sub(r"\b(j+e{2,}z+e*)\b", "jeez", x)
    x = re.sub(r"\b(co+l+)\b", "cool", x)
    x = re.sub(r"\b(g+o+a+l+)\b", "goal", x)
    x = re.sub(r"\b(s+h+i+t+)\b", "shit", x)
    x = re.sub(r"\b(o+m+g+)\b", "omg", x)
    x = re.sub(r"\b(w+t+f+)\b", "wtf", x)
    x = re.sub(r"\b(w+h+a+t+)\b", "what", x)
    x = re.sub(r"\b(y+e+y+|y+a+y+|y+e+a+h+)\b", "yeah", x)
    x = re.sub(r"\b(w+o+w+)\b", "wow", x)
    x = re.sub(r"\b(w+h+y+)\b", "why", x)
    x = re.sub(r"\b(s+o+)\b", "so", x)
    x = re.sub(r"\b(f)\b", "fuck", x)
    x = re.sub(r"\b(w+h+o+p+s+)\b", "whoops", x)
    x = re.sub(r"\b(ofc)\b", "of course", x)
    x = re.sub(r"\b(the us)\b", "usa", x)
    x = re.sub(r"\b(gf)\b", "girlfriend", x)
    x = re.sub(r"\b(hr)\b", "human ressources", x)
    x = re.sub(r"\b(mh)\b", "mental health", x)
    x = re.sub(r"\b(idk)\b", "i do not know", x)
    x = re.sub(r"\b(gotcha)\b", "i got you", x)
    x = re.sub(r"\b(y+e+p+)\b", "yes", x)
    x = re.sub(r"\b(a*ha+h[ha]*|a*ha +h[ha]*)\b", "haha", x)
    x = re.sub(r"\b(o?l+o+l+[ol]*)\b", "lol", x)
    x = re.sub(r"\b(o*ho+h[ho]*|o*ho +h[ho]*)\b", "ohoh", x)
    x = re.sub(r"\b(o+h+)\b", "oh", x)
    x = re.sub(r"\b(a+h+)\b", "ah", x)
    x = re.sub(r"\b(u+h+)\b", "uh", x)

    # Handling emojis
    x = re.sub(r"<3", " love ", x)
    x = re.sub(r"xd", " smiling_face_with_open_mouth_and_tightly_closed_eyes ", x)
    x = re.sub(r":\)", " smiling_face ", x)
    x = re.sub(r"^_^", " smiling_face ", x)
    x = re.sub(r"\*_\*", " star_struck ", x)
    x = re.sub(r":\(", " frowning_face ", x)
    x = re.sub(r":\^\(", " frowning_face ", x)
    x = re.sub(r";\(", " frowning_face ", x)
    x = re.sub(r":\/",  " confused_face", x)
    x = re.sub(r";\)",  " wink", x)
    x = re.sub(r">__<",  " unamused ", x)
    x = re.sub(r"\b([xo]+x*)\b", " xoxo ", x)
    x = re.sub(r"\b(n+a+h+)\b", "no", x)
    
    # Handling special cases of text
    x = re.sub(r"h a m b e r d e r s", "hamberders", x)
    x = re.sub(r"b e n", "ben", x)
    x = re.sub(r"s a t i r e", "satire", x)
    x = re.sub(r"y i k e s", "yikes", x)
    x = re.sub(r"s p o i l e r", "spoiler", x)
    x = re.sub(r"thankyou", "thank you", x)
    x = re.sub(r"a^r^o^o^o^o^o^o^o^n^d", "around", x)

    # Remove special characters and numbers replace by space + remove double space
    x = re.sub(r"\b([.]{3,})"," dots ", x)
    x = re.sub(r"[^A-Za-z!?_]+"," ", x)
    x = re.sub(r"\b([s])\b *","", x)
    x = re.sub(r" +"," ", x)
    x = x.strip()

    return x     

Now, let's define a function that makes predictions based on a text sample.

In [None]:
def predict_samples(text_samples, model):

    # Text preprocessing and cleaning
    text_samples = pd.Series(text_samples)
    text_samples_clean = text_samples.apply(preprocess_corpus)
    
    # Create tfidf representation
    tfidf_text_samples_clean = vectorizer.transform(text_samples_clean)
    
    # labels predictions
    samples_pred_labels = model.predict(tfidf_text_samples_clean)
    samples_pred_labels_df = pd.DataFrame(samples_pred_labels)
    samples_pred_labels_df = samples_pred_labels_df.apply(lambda x: [GE_taxonomy[i] for i in range(len(x)) if x[i]==1], axis=1)
    
    return pd.DataFrame({"Text":text_samples, "Emotions":list(samples_pred_labels_df)})

In [None]:
# Predict samples
predict_samples("no one cares my guy", classifier)

Unnamed: 0,Text,Emotions
0,no one cares my guy,"[curiosity, neutral]"


# Conclusion

*   In this notebook, we constructed a dummy model that always predicts the 'Neutral' emotion. Given that this emotion is the most represented, the "model" has reasonable performances when it comes to detecting 'Neutral', but has poor global performances.
* The baseline model we trained allowed an increase in the score but it is still very low. This can be due to the fact that it considers the words in a text sample only according to their importance, and does not put them in their context (a sample is a combination of independent words)
*  In the next step, we are going to be using an algorithme that adresses the latter issue usinf the mechanism of 'attention': The BERT model