# **<font color="blue">Overview</font>** 

In this notebook, I've used a tweets dataset that contains tweet text with 12 emotions (neutral, worry, happiness, sadness, love, surprise, fun, relief, hate, empty, enthusiasm, boredom and anger) and the goal is to predict the percentage of emotions in a giving text

To achieve that goal I've used some techniques fist to preprocess the text data :

<li>correct misspelled text</li>
<li>replace English contractions with there meaning (isn't => is not)</li>
<li>remove some punctuations, URLS user mentions and extra spaces</li>
<li>replace emojis with there meaning</li><br>

For the modeling part I've used LSTM's and Roberta base Model:
<li>First a Basic LSTM </li>
<li>LSTM model with glove word embeddings</li>
<li>Roberta Base model </li>
<br>
In the final part, I've made a donut chart that detects the level of emotions is a particular text.

In [1]:
!pip install tweet-preprocessor 2>/dev/null 1>/dev/null

In [2]:
import preprocessor as p
import numpy as np 
import pandas as pd 
import emoji
import keras
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import plotly.graph_objects as go
import plotly.express as px
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers
from transformers import TFAutoModel, AutoTokenizer
from tqdm.notebook import tqdm
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors
from tqdm import tqdm

Using TensorFlow backend.


# **<font color="blue">Data preparation</font>**

In [3]:
data = pd.read_csv("/kaggle/input/figure-eight-labelled-textual-dataset/text_emotion.csv")

In [4]:
data.head(5)

Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...


### Misspelled data

Correcting word spelling as defined in miss_corr dictionery

In [5]:
misspell_data = pd.read_csv("/kaggle/input/spelling/aspell.txt",sep=":",names=["correction","misspell"])
misspell_data.misspell = misspell_data.misspell.str.strip()
misspell_data.misspell = misspell_data.misspell.str.split(" ")
misspell_data = misspell_data.explode("misspell").reset_index(drop=True)
misspell_data.drop_duplicates("misspell",inplace=True)
miss_corr = dict(zip(misspell_data.misspell, misspell_data.correction))

#Sample of the dict
{v:miss_corr[v] for v in [list(miss_corr.keys())[k] for k in range(20)]}

{'nevade': 'Nevada',
 'presbyterian': 'Presbyterian',
 'rsx': 'RSX',
 'Steffen': 'Stephen',
 'susan': 'Susan',
 'abilitey': 'ability',
 'abouy': 'about',
 'absorbtion': 'absorption',
 'accidently': 'accidentally',
 'accomodate': 'accommodate',
 'acommadate': 'accommodate',
 'acord': 'accord',
 'aquantance': 'acquaintance',
 'equire': 'acquire',
 'adultry': 'adultery',
 'aggresive': 'aggressive',
 'alchohol': 'alcohol',
 'alchoholic': 'alcoholic',
 'allieve': 'alive',
 'alright': 'all_right'}

In [6]:
def misspelled_correction(val):
    for x in val.split(): 
        if x in miss_corr.keys(): 
            val = val.replace(x, miss_corr[x]) 
    return val

data["clean_content"] = data.content.apply(lambda x : misspelled_correction(x))

In [7]:
data.head(5)

Unnamed: 0,tweet_id,sentiment,author,content,clean_content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...,@dannycastillo We want to trade with someone w...


### Contractions
Reducing the size of word

In [8]:
contractions = pd.read_csv("/kaggle/input/contractions/contractions.csv")
cont_dic = dict(zip(contractions.Contraction, contractions.Meaning))

In [9]:
def cont_to_meaning(val): 
  
    for x in val.split(): 
        if x in cont_dic.keys(): 
            val = val.replace(x, cont_dic[x]) 
    return val


In [10]:
data.clean_content = data.clean_content.apply(lambda x : cont_to_meaning(x))
data.head(5)

Unnamed: 0,tweet_id,sentiment,author,content,clean_content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...,@dannycastillo We want to trade with someone w...


### Remove URLS and mentions 

In [11]:
p.set_options(p.OPT.MENTION, p.OPT.URL)
p.clean("hello guys @alx #sport🔥 1245 https://github.com/s/preprocessor")

'hello guys #sport🔥 1245'

In [12]:
data["clean_content"]=data.content.apply(lambda x : p.clean(x))

### Punctuations and emojis

In [13]:
def punctuation(val): 
  
    punctuations = '''()-[]{};:'"\,<>./@#$%^&_~'''
  
    for x in val.lower(): 
        if x in punctuations: 
            val = val.replace(x, " ") 
    return val


In [14]:
punctuation("test @ #ldfldlf??? !! ")

'test    ldfldlf??? !! '

In [15]:
data.clean_content = data.clean_content.apply(lambda x : ' '.join(punctuation(emoji.demojize(x)).split()))

In [16]:
def clean_text(val):
    val = misspelled_correction(val)
    val = cont_to_meaning(val)
    val = p.clean(val)
    val = ' '.join(punctuation(emoji.demojize(val)).split())
    
    return val

In [17]:
clean_text("isn't 💡 adultry @ttt good bad ... ! ? ")

'is not light bulb adultery good bad ! ?'

In [18]:
data.head(5)

Unnamed: 0,tweet_id,sentiment,author,content,clean_content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...,i know i was listenin to bad habit earlier and...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh waitin on y...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,Funeral ceremony gloomy friday
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...,We want to trade with someone who has Houston ...


### Remove empty comments

In [19]:
data = data[data.clean_content != ""]
data.head(5)

Unnamed: 0,tweet_id,sentiment,author,content,clean_content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...,i know i was listenin to bad habit earlier and...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh waitin on y...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,Funeral ceremony gloomy friday
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...,We want to trade with someone who has Houston ...


### Count of records by sentiment

In [20]:
data.sentiment.value_counts()

neutral       8579
worry         8454
happiness     5208
sadness       5162
love          3841
surprise      2187
fun           1776
relief        1526
hate          1323
empty          815
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

# **<font color="blue">Data Modelling</font>**

### Encoding the data and train test split 

In [21]:
sent_to_id  = {"empty":0, "sadness":1,"enthusiasm":2,"neutral":3,"worry":4,
                        "surprise":5,"love":6,"fun":7,"hate":8,"happiness":9,"boredom":10,"relief":11,"anger":12}

In [22]:
data["sentiment_id"] = data['sentiment'].map(sent_to_id)


In [23]:
data.head(5)

Unnamed: 0,tweet_id,sentiment,author,content,clean_content,sentiment_id
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...,i know i was listenin to bad habit earlier and...,0
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh waitin on y...,1
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,Funeral ceremony gloomy friday,1
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,wants to hang out with friends SOON!,2
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...,We want to trade with someone who has Houston ...,3


In [24]:
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(data.sentiment_id)

onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
Y = onehot_encoder.fit_transform(integer_encoded)

In [25]:
Y

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [26]:
Y[0]

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [27]:
X_train, X_test, y_train, y_test = train_test_split(data.clean_content,Y, random_state=1995, test_size=0.2, shuffle=True)

### LSTM 

In [28]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 160
Epoch = 5
token.fit_on_texts(list(X_train) + list(X_test))
X_train_pad = sequence.pad_sequences(token.texts_to_sequences(X_train), maxlen=max_len)
X_test_pad = sequence.pad_sequences(token.texts_to_sequences(X_test), maxlen=max_len)

In [29]:
w_idx = token.word_index

In [30]:
embed_dim = 160
lstm_out = 250

model = Sequential()
model.add(Embedding(len(w_idx) +1 , embed_dim,input_length = X_test_pad.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(keras.layers.core.Dense(13, activation='softmax'))
#adam rmsprop 
model.compile(loss = "categorical_crossentropy", optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 160, 160)          4835200   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 160, 160)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 250)               411000    
_________________________________________________________________
dense_1 (Dense)              (None, 13)                3263      
Total params: 5,249,463
Trainable params: 5,249,463
Non-trainable params: 0
_________________________________________________________________
None


In [31]:
batch_size = 32

In [32]:
model.fit(X_train_pad, y_train, epochs = Epoch, batch_size=batch_size,validation_data=(X_test_pad, y_test))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 31935 samples, validate on 7984 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x7f8d4622db50>

In [33]:
def get_sentiment(model,text):
    text = clean_text(text)
    #tokenize
    twt = token.texts_to_sequences([text])
    twt = sequence.pad_sequences(twt, maxlen=max_len, dtype='int32')
    sentiment = model.predict(twt,batch_size=1,verbose = 2)
    sent = np.round(np.dot(sentiment,100).tolist(),0)[0]
    result = pd.DataFrame([sent_to_id.keys(),sent]).T
    result.columns = ["sentiment","percentage"]
    result=result[result.percentage !=0]
    return result

In [34]:
def plot_result(df):
    #colors=['#D50000','#000000','#008EF8','#F5B27B','#EDECEC','#D84A09','#019BBD','#FFD000','#7800A0','#098F45','#807C7C','#85DDE9','#F55E10']
    #fig = go.Figure(data=[go.Pie(labels=df.sentiment,values=df.percentage, hole=.3,textinfo='percent',hoverinfo='percent+label',marker=dict(colors=colors, line=dict(color='#000000', width=2)))])
    #fig.show()
    colors={'love':'rgb(213,0,0)','empty':'rgb(0,0,0)',
                    'sadness':'rgb(0,142,248)','enthusiasm':'rgb(245,178,123)',
                    'neutral':'rgb(237,236,236)','worry':'rgb(216,74,9)',
                    'surprise':'rgb(1,155,189)','fun':'rgb(255,208,0)',
                    'hate':'rgb(120,0,160)','happiness':'rgb(9,143,69)',
                    'boredom':'rgb(128,124,124)','relief':'rgb(133,221,233)',
                    'anger':'rgb(245,94,16)'}
    col_2={}
    for i in result.sentiment.to_list():
        col_2[i]=colors[i]
    fig = px.pie(df, values='percentage', names='sentiment',color='sentiment',color_discrete_map=col_2,hole=0.3)
    fig.show()

### Test LSTM Results

In [35]:
result =get_sentiment(model,"Had an absolutely brilliant day ðŸ˜ loved seeing an old friend and reminiscing")
plot_result(result)
result =get_sentiment(model,"The pain my heart feels is just too much for it to bear. Nothing eases this pain. I can’t hold myself back. I really miss you")
plot_result(result)
result =get_sentiment(model,"I hate this game so much,It make me angry all the time ")
plot_result(result)

### **<font color="blue">LSTM with glove 6B 200d word embedding</font>** 

In [36]:
def read_data(file_name):
    with open(file_name,'r') as f:
        word_vocab = set() 
        word2vector = {}
        for line in f:
            line_ = line.strip() 
            words_Vec = line_.split()
            word_vocab.add(words_Vec[0])
            word2vector[words_Vec[0]] = np.array(words_Vec[1:],dtype=float)
    print("Total Words in DataSet:",len(word_vocab))
    return word_vocab,word2vector

In [37]:
vocab, word_to_idx =read_data("/kaggle/input/glove-global-vectors-for-word-representation/glove.6B.200d.txt")

Total Words in DataSet: 400000


In [38]:
embedding_matrix = np.zeros((len(w_idx) + 1, 200))
for word, i in w_idx.items():
    embedding_vector = word_to_idx.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [39]:
embed_dim = 200
lstm_out = 250

model_lstm_gwe = Sequential()
model_lstm_gwe.add(Embedding(len(w_idx) +1 , embed_dim,input_length = X_test_pad.shape[1],weights=[embedding_matrix],trainable=False))
model_lstm_gwe.add(SpatialDropout1D(0.2))
model_lstm_gwe.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model_lstm_gwe.add(keras.layers.core.Dense(13, activation='softmax'))
#adam rmsprop 
model_lstm_gwe.compile(loss = "categorical_crossentropy", optimizer='adam',metrics = ['accuracy'])
print(model_lstm_gwe.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 160, 200)          6044000   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 160, 200)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 250)               451000    
_________________________________________________________________
dense_2 (Dense)              (None, 13)                3263      
Total params: 6,498,263
Trainable params: 454,263
Non-trainable params: 6,044,000
_________________________________________________________________
None


In [40]:
batch_size = 32

In [41]:

model_lstm_gwe.fit(X_train_pad, y_train, epochs = Epoch, batch_size=batch_size,validation_data=(X_test_pad, y_test))

Train on 31935 samples, validate on 7984 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x7f88f253bb90>

### Test LSTM glove Results 

In [42]:
result =get_sentiment(model_lstm_gwe,"Had an absolutely brilliant day ðŸ˜ loved seeing an old friend and reminiscing")
plot_result(result)
result =get_sentiment(model_lstm_gwe,"The pain my heart feels is just too much for it to bear. Nothing eases this pain. I can’t hold myself back. I really miss you")
plot_result(result)
result =get_sentiment(model_lstm_gwe,"I hate this game so much,It make me angry all the time ")
plot_result(result)

### **<font color="blue">Roberta Base Model</font>**

In [43]:
def regular_encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    
    return np.array(enc_di['input_ids'])

def build_model(transformer, max_len=160):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(13, activation='softmax')(cls_token)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [44]:
AUTO = tf.data.experimental.AUTOTUNE
MODEL = 'roberta-base'

In [45]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




In [46]:
X_train_t = regular_encode(X_train, tokenizer, maxlen=max_len)
X_test_t = regular_encode(X_test, tokenizer, maxlen=max_len)

In [47]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((X_train_t, y_train))
    .repeat()
    .shuffle(1995)
    .batch(batch_size)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((X_test_t, y_test))
    .batch(batch_size)
    .cache()
    .prefetch(AUTO)
)


In [48]:
transformer_layer = TFAutoModel.from_pretrained(MODEL)
model_roberta_base = build_model(transformer_layer, max_len=max_len)
model_roberta_base.summary()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=657434796.0, style=ProgressStyle(descri…


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_word_ids (InputLayer)  [(None, 160)]             0         
_________________________________________________________________
tf_roberta_model (TFRobertaM ((None, 160, 768), (None, 124645632 
_________________________________________________________________
tf_op_layer_strided_slice (T [(None, 768)]             0         
_________________________________________________________________
dense (Dense)                (None, 13)                9997      
Total params: 124,655,629
Trainable params: 124,655,629
Non-trainable params: 0
_________________________________________________________________


In [49]:
n_steps = X_train.shape[0] // batch_size
model_roberta_base.fit(train_dataset,steps_per_epoch=n_steps,validation_data=valid_dataset,epochs=Epoch)

Train for 997 steps, validate for 250 steps
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f88eace4b10>

### Test Roberta Model Results

In [50]:
def get_sentiment2(model,text):
    text = clean_text(text)
    #tokenize
    x_test1 = regular_encode([text], tokenizer, maxlen=max_len)
    test1 = (tf.data.Dataset.from_tensor_slices(x_test1).batch(1))
    #test1
    sentiment = model.predict(test1,verbose = 0)
    sent = np.round(np.dot(sentiment,100).tolist(),0)[0]
    result = pd.DataFrame([sent_to_id.keys(),sent]).T
    result.columns = ["sentiment","percentage"]
    result=result[result.percentage !=0]
    return result

In [51]:
result =get_sentiment2(model_roberta_base,"Had an absolutely brilliant day ðŸ˜ loved seeing an old friend and reminiscing")
plot_result(result)
result =get_sentiment2(model_roberta_base,"The pain my heart feels is just too much for it to bear. Nothing eases this pain. I can’t hold myself back. I really miss you")
plot_result(result)
result =get_sentiment2(model_roberta_base,"I hate this game so much,It make me angry all the time ")
plot_result(result)

## End