# Introduction

Student: Davide Brescia 0001036867

The workflow of this neural network is as follows:


1.   **Managing contractions**: i used the library `contractions` (*they're* became *they are*)
2.   **Handling Emoji**: i used a CSV found on [Github](https://gist.github.com/bfeldman89/fb25ddb63bdaa6de6ab7ac946acde96f) in which there is a dictionary with each emoji and its translation (😂 became *Face With Tears of Joy*)
3.   **Regex Basic Cleaning**: i used a script found on [Kaggle](https://www.kaggle.com/code/amackcrane/python-version-of-glove-twitter-preprocess-script/script) which was translated from Glove's official page [GloVe](https://nlp.stanford.edu/projects/glove/) and is responsible for improving the quality of the text (detects written smilies, deals with detecting words written in caps, repeated punctuation, etc...)
4. **Tokenization**: for tokenization i used the `nltk` library since it implemented a tokenizer for twitter. Since analyzing the dataset the writing seemed quite informal I decided that this was a good choice. 
5. **Spell Checker**: I used the `pyspellchecker` library to perform a simple check on the typos, so as not to appensatize the training and not to apply too aggressive preprocessing, I chose a smaller size. Synthesizing deals with detecting when two letters have been inverted in a word.
6. **Preparing the dataset**: Next, I used the tensorflow tokenizer to assign a unique number per word. Once the vectors (which corresponded to a single sentence) were translated, I performed padding to have all inputs with the same size
7. **Neural Networt**: The neural network was taken from the following paper [A Comparison of Word-Embeddings in Emotion Detection from Text using BiLSTM, CNN and Self-Attention](https://www.researchgate.net/publication/333740389_A_Comparison_of_Word-Embeddings_in_Emotion_Detection_from_Text_using_BiLSTM_CNN_and_Self-Attention). After some research and after several model changes, I came to the conclusion that this network seems to be one of the best. It was also chosen to use a pre-trained layer in the embedded layer called [GloVe](https://nlp.stanford.edu/projects/glove/).



# Imports

In [154]:
import pandas as pd
import numpy as np

In [155]:
#Reading the datasets

pd.set_option('display.max_colwidth', None)

train_df = pd.read_csv("/content/train_ekmann.csv")
print("Number of train data: ", train_df.shape[0])

test_df = pd.read_csv("/content/test_ekmann.csv")
print("Number of test data: ", test_df.shape[0])

val_df = pd.read_csv("/content/val_ekmann.csv")
print("Number of validation data: ", val_df.shape[0])

train_df[['Text', 'Emotion']].sample(n=5)

Number of train data:  43410
Number of test data:  5427
Number of validation data:  5426


Unnamed: 0,Text,Emotion
38338,Because [NAME] was definitely pregnant with both a boy and a girl when he got on. Yup.,neutral
43328,i can also post pics of the receipt if anyone doesn’t believe me!!!,neutral
41663,And thirsty babies,neutral
36987,His friend was sexy as hell 💁,joy
31068,"If you don’t include AZ in your 3, you’re wrong.",anger


In [156]:
!wget https://nlp.stanford.edu/data/glove.twitter.27B.zip
!unzip -q glove.twitter.27B.zip

In [157]:
import os

path_to_glove_file = os.path.join(
    os.path.expanduser("~"), "/content/glove.twitter.27B.200d.txt"
)

embeddings_index = {}
f = open(path_to_glove_file)
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 1193514 word vectors.


# Preprocessing

## Manage Contractions

In [158]:
!pip install contractions
import contractions

def expand_contraction(text):
  expanded_words = []   
  for word in text.split():
    # using contractions.fix to expand the shortened words
    expanded_words.append(contractions.fix(word))
  return ' '.join(expanded_words)

train_df['textpp'] = train_df.apply(lambda x: expand_contraction(x['Text']), axis=1)
test_df['textpp'] = test_df.apply(lambda x: expand_contraction(x['Text']), axis=1)
val_df['textpp'] = val_df.apply(lambda x: expand_contraction(x['Text']), axis=1)

train_df[['textpp', 'Text']].sample(n=5)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Unnamed: 0,textpp,Text
7668,I did not downvote Mate :) In fact I just gave you two upvotes :) Cheers :),I did not downvote Mate :) In fact I just gave you two upvotes :) Cheers :)
10055,Great point.,Great point.
26654,"i believe he has a minor ankle injury at the moment, i am not too worried about his skating","i believe he has a minor ankle injury at the moment, i’m not too worried about his skating"
22404,"i see, thanks bro","i see, thanks bro"
15219,Is this a real question or a snark?,Is this a real question or a snark?


## Handle Emoji

In [159]:
#Emoji
emoji = pd.read_csv("/content/emojis.csv", index_col=0, header=None, squeeze=True).to_dict() 

train_df[['textpp']] = train_df[['textpp']].replace(emoji, regex=True)
test_df[['textpp']] = test_df[['textpp']].replace(emoji, regex=True)
val_df[['textpp']] = val_df[['textpp']].replace(emoji, regex=True)

train_df[['textpp', 'Text']].sample(n=5)

Unnamed: 0,textpp,Text
9220,No longer want to what?,No longer want to what?
4898,Thank you for your inspiration,Thank you for your inspiration
43036,Everyone forgot to tell you the 1st step is to marry the Sheriff's daughter.,Everyone forgot to tell you the 1st step is to marry the Sheriff's daughter.
28800,We will not. If Ireland do they will be violating the GFA.,We won't. If Ireland do they'll be violating the GFA.
23611,Just tagging along for the screenshot. As you were.,Just tagging along for the screenshot. As you were.


## Regex Basic cleaning

Urls, text-emoji, hashtag ecc...

In [160]:
import sys
import regex as re
FLAGS = re.MULTILINE | re.DOTALL

def hashtag(text):
    text = text.group()
    hashtag_body = text[1:]
    if hashtag_body.isupper():
        result = "<hashtag> {} <allcaps>".format(hashtag_body.lower())
    else:
        result = " ".join(["<hashtag>"] + re.split(r"(?=[A-Z])", hashtag_body, flags=FLAGS))
    return result

def allcaps(text):
    text = text.group()
    return text.lower() + " <allcaps> " # amackcrane added trailing space


def tokenize(text):
    # Different regex parts for smiley faces
    eyes = r"[8:=;]"
    nose = r"['`\-]?"

    # function so code less repetitive
    def re_sub(pattern, repl):
        return re.sub(pattern, repl, text, flags=FLAGS)

    text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "<url>")
    text = re_sub(r"@\w+", "<user>")
    text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), "<smile>")
    text = re_sub(r"{}{}p+".format(eyes, nose), "<lolface>")
    text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), "<sadface>")
    text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")
    text = re_sub(r"/"," / ")
    text = re_sub(r"<3","<heart>")
    text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", "<number>")
    text = re_sub(r"#\w+", hashtag)  # amackcrane edit
    text = re_sub(r"([!?.]){2,}", r"\1 <repeat>")
    text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong>")

    ## -- I just don't understand why the Ruby script adds <allcaps> to everything so I limited the selection.
    # text = re_sub(r"([^a-z0-9()<>'`\-]){2,}", allcaps)
    #text = re_sub(r"([A-Z]){2,}", allcaps)  # moved below -amackcrane

    text = re_sub(r'\'s ', " ")

    # amackcrane additions
    text = re_sub(r"([a-zA-Z<>()])([?!.:;,])", r"\1 \2")
    text = re_sub(r"\(([a-zA-Z<>]+)\)", r"( \1 )")
    text = re_sub(r"  ", r" ")
    text = re_sub(r" ([A-Z]){2,} ", allcaps)

    return text.lower()


#_, text = sys.argv  # kaggle envt breaks this -amackcrane
#if text == "test":

train_df['textpp'] = train_df.apply(lambda x: tokenize(x['textpp']), axis=1)
test_df['textpp'] = test_df.apply(lambda x: tokenize(x['textpp']), axis=1)
val_df['textpp'] = val_df.apply(lambda x: tokenize(x['textpp']), axis=1)

train_df[['textpp', 'Text']].sample(n=5)

Unnamed: 0,textpp,Text
10952,"> but , president [name] can just hand out pardons for anything federal . presidential pardons do not affect impeachments .","> But, President [NAME] can just hand out pardons for anything federal. Presidential pardons don't affect impeachments."
41414,they are so damn handsome !,They are so damn handsome!
9569,"thanks , i guess ? haha","Thanks, I guess? Haha"
7256,lol okay . it is not like they are billion dollar organizations that have their own personal agenda . fine . enjoy your bubble .,Lol okay. It's not like they're billion dollar organizations that have their own personal agenda. Fine. Enjoy your bubble.
3401,"would it make you feel any better , little girl , if they was pushed out of windows ?","Would it make you feel any better, little girl, if they was pushed out of windows?"


# Tokenization with Ntlk

In [161]:
import nltk
from nltk.tokenize.casual import TweetTokenizer

t = TweetTokenizer(reduce_len = True)
nltk.download('punkt')
#nltk.download('wordnet')

#Twitter tokenization
train_df['tokenized'] = train_df['textpp'].apply(t.tokenize)
test_df['tokenized'] = test_df['textpp'].apply(t.tokenize)
val_df['tokenized'] = val_df['textpp'].apply(t.tokenize)

train_df[['tokenized', 'Text']].sample(n=5)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,tokenized,Text
5923,"[i, never, believe, draft, rumors, of, who, teams, are, looking, at, or, wanting, ,, especially, if, its, fan, rumors, .]","I never believe draft rumors of who teams are looking at or wanting, especially if its fan rumors."
33004,"[you, are, right, ., totally, my, bad, .]",You are right. Totally my bad.
8115,"[i, am, surprised, it, was, uploaded, today, ,, especially, since, he, uploaded, a, video, just, a, few, hours, ago]","I’m surprised it was uploaded today, especially since he uploaded a video just a few hours ago"
7906,"[>, it, is, upsetting, ., what, can, be, done, ?, we, will, have, to, take, it, to, the, high, courts, ,, let, them, decide, !]","> It's upsetting. What can be done? We'll have to take it to the high courts, let them decide!"
11810,"[this, is, horrible, ., <repeat>, one, of, my, worst, nightmares, actually, .]",This is horrible... One of my worst nightmares actually.


## Spell Checker 

In [162]:
#Spell checker
!pip install pyspellchecker
from spellchecker import SpellChecker
import re

spell = SpellChecker(distance=1)

def spell_correct(tokens):
  #tokens = text.split()
  misspelled = spell.unknown(tokens)
  sptext = []
  for word in tokens:
    if len(word)>2 and word in misspelled:
        sptext.append(spell.correction(word))
    else:
      sptext.append(word)
  #return ' '.join(sptext)    
  return sptext

train_df['tokenized'] = train_df['tokenized'].apply(spell_correct)
test_df['tokenized'] = test_df['tokenized'].apply(spell_correct)
val_df['tokenized'] = val_df['tokenized'].apply(spell_correct)

train_df[['tokenized', 'Text']].sample(n=5)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Unnamed: 0,tokenized,Text
39427,"[hey, friend, nothing, down, here, but, there, are, all, <allcaps>, sorts, of, interesting, smells, !]",Hey friendo nothing down here but there are ALL SORTS of interesting smells!
24871,"[holy, shit, <allcaps>, i, have, <allcaps>, kari, in, <allcaps>, my, pool]",HOLy SHIT I HAVE KADRI IN MY POOL
6618,"[a, better, description, would, be, she, got, a, nasty, fright, rather, than, it, being, terrifying, .]",A better description would be she got a nasty fright rather than it being terrifying.
19426,"[told, you, <allcaps>, all, ,, bravo]","TOLD YOU ALL, bravo"
23721,"[by, having, to, pay, out, over, a, hundred, million, in, lawsuits, ?]",By having to pay out over a hundred million in lawsuits?


In [163]:
from collections import Counter

counter_train = Counter()
train_df['tokenized'].apply(counter_train.update)

vocabulary_length = len(counter_train)

print("Number of words found: ", vocabulary_length)

Number of words found:  24560


# Sequences and Padding

In [164]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

#From words to number
oov_token = "<OOV>"
max_words = vocabulary_length
max_length = 200
tokenizer = Tokenizer(num_words = max_words, oov_token = oov_token)

tokenizer.fit_on_texts(train_df['tokenized'])
word_index = tokenizer.word_index

train_sequences = tokenizer.texts_to_sequences(train_df['tokenized'])
test_sequences = tokenizer.texts_to_sequences(test_df['tokenized'])
val_sequences = tokenizer.texts_to_sequences(val_df['tokenized'])

#Here we need to use a notation like this [1, 50, 40, 7, 0, 0, 0, ...]
#in order to have the same size for each array
train_padded = pad_sequences(train_sequences, maxlen = max_length, padding='post', truncating='post')
test_padded = pad_sequences(test_sequences, maxlen = max_length, padding='post', truncating='post')
val_padded = pad_sequences(val_sequences, maxlen = max_length, padding='post', truncating='post')

# Create the embedding Matrix

In [165]:
embedding_matrix = np.zeros((len(word_index) + 1, max_length))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# One hot encoder

In [166]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse = False, dtype='uint8')

ohe.fit(train_df[['Emotion']])
train_labels = ohe.transform(train_df[['Emotion']])
test_labels = ohe.transform(test_df[['Emotion']])
val_labels = ohe.transform(val_df[['Emotion']])

i=3
test_padded[i], test_labels[i], test_df['Emotion'][i]

(array([   3,   58,   12,   67,   11,    6,   93,    9,   21, 3474,   41,
         149,  364,   18,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

# Coverage check

In [167]:
import operator 
from tqdm._tqdm_notebook import tqdm_notebook as tqdm
from collections import Counter

#Coverage check
def check_coverage(vocab, embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key= operator.itemgetter(1))[::-1]

    return sorted_x

oov = check_coverage(word_index, embeddings_index)

'''
To analyze the difference between tokenization and glove dictionary,
I print this dictionary to get an estimate of how preprocessing is working
'''
with open(r'/content/unusedw.txt', 'w') as fp:
    for item in oov:
        # write each item on a new line
        fp.write(str(item) + "\n")
    print('Done')

Counter(oov).most_common(10)

  0%|          | 0/24561 [00:00<?, ?it/s]

Found embeddings for 89.83% of vocab
Found embeddings for  85.17% of all text
Done


[(('legitimise', 24550), 1),
 (("guilty'ing", 24549), 1),
 (('flummery', 24548), 1),
 (('th-graders', 24546), 1),
 (('college-level', 24545), 1),
 (('scrapheap', 24539), 1),
 (('tamburitzans', 24536), 1),
 (('serbian-american', 24535), 1),
 (('redditsmuseumoffilth', 24533), 1),
 (('sterilised', 24532), 1)]

# Neural Network

In [168]:
!pip install keras-self-attention
from keras.layers.merge import concatenate
from keras.layers import Input
from keras.layers import Embedding
from keras_self_attention import SeqSelfAttention
from keras.layers import Bidirectional
from keras.layers.convolutional import Conv1D
from keras.layers import MaxPool1D
from keras.layers.core.dropout import Dropout
from keras.layers import Concatenate
from keras.layers import GlobalMaxPool1D
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import GlobalAveragePooling1D
from keras import Model
import keras

#GloVe embedding layer
embedding_layer = Embedding(
    input_dim = len(word_index) + 1,
    output_dim = max_length,
    weights = [embedding_matrix],
    input_length = max_length,
    trainable = False
)

'''
Neural Network from the paper:
A Comparison of Word-Embeddings in Emotion Detection from 
Text using BiLSTM, CNN and Self-Attention
'''
inputs = Input((max_length))
embedding = embedding_layer(inputs)
bilstm = Bidirectional(LSTM(200, return_sequences = True, dropout=0.3, activation='tanh'))(embedding)
selfattention = SeqSelfAttention(attention_activation='sigmoid')(bilstm)
conv1D = Conv1D(400, 5, activation='relu')(selfattention)
maxpool1D = MaxPool1D(2)(conv1D)
dropout_one = Dropout(0.2)(maxpool1D)
concatted = Concatenate(axis=1)([bilstm, dropout_one])
globalmaxpool1D = GlobalMaxPool1D()(concatted)
dense = Dense(100)(globalmaxpool1D)
dropout_two = Dropout(0.2)(dense)
output = Dense(7, activation = 'softmax')(dropout_two)

model = Model(inputs = inputs, outputs=output,)
model.summary()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_9 (InputLayer)           [(None, 200)]        0           []                               
                                                                                                  
 embedding_8 (Embedding)        (None, 200, 200)     4912400     ['input_9[0][0]']                
                                                                                                  
 bidirectional_8 (Bidirectional  (None, 200, 400)    641600      ['embedding_8[0][0]']            
 )                                                                                                
                                                                                            

# Training

In [169]:
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint

callback = EarlyStopping(monitor = "val_loss", 
                         mode = "auto", 
                         min_delta=0.001, 
                         patience = 5, 
                         verbose = 2,
                         restore_best_weights = True,
                         baseline=None)

mc = ModelCheckpoint('./model.h5', 
                     monitor = 'val_f1_score', 
                     mode = 'max', 
                     verbose = 1, 
                     save_best_only = True)

In [170]:
!pip install tensorflow_addons
import tensorflow_addons as tfa

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = tfa.metrics.F1Score(num_classes = 7, average='macro'))

history = model.fit(train_padded, train_labels,
                    epochs = 15, batch_size = 16,
                    validation_data=(val_padded, val_labels),
                    verbose = 1, callbacks = [mc, callback])

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Epoch 1/15
Epoch 1: val_f1_score improved from -inf to 0.57312, saving model to ./model.h5
Epoch 2/15
Epoch 2: val_f1_score did not improve from 0.57312
Epoch 3/15
Epoch 3: val_f1_score improved from 0.57312 to 0.57529, saving model to ./model.h5
Epoch 4/15
Epoch 4: val_f1_score improved from 0.57529 to 0.58781, saving model to ./model.h5
Epoch 5/15
Epoch 5: val_f1_score did not improve from 0.58781
Epoch 6/15
Epoch 6: val_f1_score did not improve from 0.58781
Epoch 7/15
Epoch 7: val_f1_score improved from 0.58781 to 0.59679, saving model to ./model.h5
Epoch 8/15
Epoch 8: val_f1_score did not improve from 0.59679
Epoch 9/15
Epoch 9: val_f1_score did not improve from 0.59679
Restoring model weights from the end of the best epoch: 4.
Epoch 9: early stopping


In [171]:
from sklearn import metrics
from sklearn.metrics import f1_score

predicted_labels = np.argmax(model.predict(test_padded), axis = 1)
true_labels = np.argmax(test_labels, axis = 1)

print(metrics.classification_report(true_labels, predicted_labels))

print(f1_score(true_labels, predicted_labels, average='macro'))

              precision    recall  f1-score   support

           0       0.57      0.42      0.48       572
           1       0.59      0.48      0.53       116
           2       0.54      0.78      0.64        81
           3       0.84      0.72      0.78      1978
           4       0.58      0.74      0.65      1648
           5       0.61      0.55      0.57       355
           6       0.57      0.56      0.57       677

    accuracy                           0.66      5427
   macro avg       0.61      0.61      0.60      5427
weighted avg       0.67      0.66      0.66      5427

0.6021878522094174
