# Sentiment and Emotion Classification Model Training
Using dataset from kaggle consisting of `416,809 observations` of labeled text content from twitter
<br>
Kaggle dataset url: [https://www.kaggle.com/datasets/nelgiriyewithana/emotions](https://www.kaggle.com/datasets/nelgiriyewithana/emotions)

**About the dataset**
<br>
Each entry in this dataset consists of a text segment representing a Twitter message and a corresponding label indicating the predominant emotion conveyed. The emotions are classified into six categories: sadness (0), joy (1), love (2), anger (3), fear (4), and surprise (5). This dataset provides a rich foundation for exploring the nuanced emotional landscape within the realm of social media.

**Dataset features**
* text: twitter text content
* label: emotion classification
  * sadness (0), 
  * joy (1), 
  * love (2), 
  * anger (3), 
  * fear (4), 
  * and surprise (5). 

## Initialization

In [293]:
import numpy as np
import pandas as pd
import seaborn as sns
import nltk
import re
import os

from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WordPunctTokenizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest, chi2

In [294]:
%%capture
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/cabrera/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/cabrera/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/cabrera/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package averaged_perceptron_ta

Loading the data

In [295]:
df = pd.read_csv('data/dataset.csv', index_col=0)

df.head()

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


Getting the label distribution

In [296]:
df['label'].value_counts().sort_index()

label
0    121187
1    141067
2     34554
3     57317
4     47712
5     14972
Name: count, dtype: int64

Renaming the attribute name from label -> emotion_label

In [297]:
df.rename(columns={
  'label': 'emotion_label'
}, inplace=True)

df.head()

Unnamed: 0,text,emotion_label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


Adding emotion_label_description attribute

In [298]:
emotion_label_description_map = {
  0: 'sadness',
  1: 'joy',
  2: 'love',
  3: 'anger',
  4: 'fear',
  5: 'surprised',
}

In [299]:
df['emotion_label_description'] = df['emotion_label'].apply(lambda label: emotion_label_description_map.get(label))

df.head()

Unnamed: 0,text,emotion_label,emotion_label_description
0,i just feel really helpless and heavy hearted,4,fear
1,ive enjoyed being able to slouch about relax a...,0,sadness
2,i gave up my internship with the dmrg and am f...,4,fear
3,i dont know i feel so lost,0,sadness
4,i am a kindergarten teacher and i am thoroughl...,4,fear


Adding another attribute called `sentiment_label` and mapping it according to emotion label with the following condition when equating to:
* [0] negative: [sadness (0), anger (3), fear (4)]
* [1] positive: [joy (1), love (2)]
* [2] neutral: [surprised (5)]

In [300]:
df['sentiment_label'] = df['emotion_label'].apply(lambda label: 0 if label in [0, 3, 4] else 1 if label in [1, 2] else 2)

df.head()

Unnamed: 0,text,emotion_label,emotion_label_description,sentiment_label
0,i just feel really helpless and heavy hearted,4,fear,0
1,ive enjoyed being able to slouch about relax a...,0,sadness,0
2,i gave up my internship with the dmrg and am f...,4,fear,0
3,i dont know i feel so lost,0,sadness,0
4,i am a kindergarten teacher and i am thoroughl...,4,fear,0


Adding sentiment_label_description attribute

In [301]:
sentiment_label_description_map = {
  0: 'negative',
  1: 'positive',
  2: 'neutral'
}

In [302]:
df['sentiment_label_description'] = df['sentiment_label'].apply(lambda label: sentiment_label_description_map.get(label))

df.head()

Unnamed: 0,text,emotion_label,emotion_label_description,sentiment_label,sentiment_label_description
0,i just feel really helpless and heavy hearted,4,fear,0,negative
1,ive enjoyed being able to slouch about relax a...,0,sadness,0,negative
2,i gave up my internship with the dmrg and am f...,4,fear,0,negative
3,i dont know i feel so lost,0,sadness,0,negative
4,i am a kindergarten teacher and i am thoroughl...,4,fear,0,negative


Casting attributes to the right datatype

In [303]:
category_attribute_list = ['emotion_label_description', 'sentiment_label_description']
integer_attribute_list = ['emotion_label', 'sentiment_label']

for attribute in category_attribute_list:
  df[attribute] = df[attribute].astype('category')

for attribute in integer_attribute_list:
  df[attribute] = df[attribute].astype('int')

In [304]:
df.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
Index: 416809 entries, 0 to 416808
Data columns (total 5 columns):
 #   Column                       Non-Null Count   Dtype   
---  ------                       --------------   -----   
 0   text                         416809 non-null  object  
 1   emotion_label                416809 non-null  int64   
 2   emotion_label_description    416809 non-null  category
 3   sentiment_label              416809 non-null  int64   
 4   sentiment_label_description  416809 non-null  category
dtypes: category(2), int64(2), object(1)
memory usage: 13.5+ MB


## Execution

### Data preprocessing
Making a data pipeline to:
* Denoise: removing the twitter usernames and non-alphabetical characters and stripping it of white space
* Stopwords removal: stripping out the stopwords in the content such as `[a, an, the, and, but, or]` to improve data quality
* Lemmatization: reducing words to their base form e.g. `[changing, changed, change] -> change`

Removing the username and non-alphabetical characters in the content to reduce noise and improve data quality

In [305]:
def denoiser(df: pd.DataFrame):
  def strip(text: str):
    text = re.sub(r'@\w+', '', text) 
    text = re.sub(r'[^a-zA-Z ]', '', text)
    text = re.sub(r'https\w+', '', text)
    text = re.sub(r'http\w+', '', text)
    text = text.strip()
    return text.lower()

  df['text'] = df['text'].apply(strip)
  return df

Removing stopwords for standardization

In [306]:
def stopwords_remover(df: pd.DataFrame):
  matcher = re.compile(r"|".join([fr"\b{word}\b" for word in stopwords.words("english")]))
  def remove_stopwords(text: str):
    return " ".join(matcher.sub('', text).split())

  df['text'] = df['text'].apply(remove_stopwords)
  return df

Removing observations with null text values

In [307]:
def null_content_observation_remover(df: pd.DataFrame):
  df = df[~df['text'].isnull()]
  df = df[~df['text'].isin([''])]
  df = df.reset_index(drop=True)
  return df

Removing observations with less than 20 characters to improve model accuracy

In [308]:
def n_characters_content_remover(df: pd.DataFrame, len_limit: int):
  df = df[df['text'].apply(lambda text: len(text) > len_limit)]
  df = df.reset_index(drop=True)
  return df

Reducing words to their base or lemmatizing to enhance the effectiveness of the model

In [309]:
def lemmatizer(df: pd.DataFrame):
  wordnet_lemmatizer = WordNetLemmatizer()
  tokenizer = WordPunctTokenizer()

  wordnet_pos_tag_map = {
    "J": wordnet.ADJ,
    "N": wordnet.NOUN,
    "V": wordnet.VERB,
    "R": wordnet.ADV,
  }

  def lemmatize(text: str):
    tokens = tokenizer.tokenize(text)
    pos_tags = pos_tag(tokens)

    lemmatized_tokens = []
    for token, tag in pos_tags:
      wordnet_tag = wordnet_pos_tag_map.get(tag[0].upper())
      if wordnet_tag is None:
        lemmatized_tokens.append(token)
      else:
        lemmatized_tokens.append(wordnet_lemmatizer.lemmatize(token, wordnet_tag))

    return ' '.join(lemmatized_tokens)

  df['text'] = df['text'].apply(lemmatize)
  return df
    

Running the pipeline and exporting to csv to skip reprocessing of the dataset

In [310]:
if os.path.isfile('data/dataset_processed.csv'):
  df = pd.read_csv('data/dataset_processed.csv')
else:
  df = (
    df
    .pipe(denoiser)
    .pipe(stopwords_remover)
    .pipe(null_content_observation_remover)
    .pipe(n_characters_content_remover, 20)
    .pipe(lemmatizer))

  df.to_csv('data/dataset_processed.csv', index=False)

In [311]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 389684 entries, 0 to 389683
Data columns (total 5 columns):
 #   Column                       Non-Null Count   Dtype 
---  ------                       --------------   ----- 
 0   text                         389684 non-null  object
 1   emotion_label                389684 non-null  int64 
 2   emotion_label_description    389684 non-null  object
 3   sentiment_label              389684 non-null  int64 
 4   sentiment_label_description  389684 non-null  object
dtypes: int64(2), object(3)
memory usage: 14.9+ MB


In [312]:
df.head()

Unnamed: 0,text,emotion_label,emotion_label_description,sentiment_label,sentiment_label_description
0,feel really helpless heavy hearted,4,fear,0,negative
1,ive enjoy able slouch relax unwind frankly nee...,0,sadness,0,negative
2,give internship dmrg feeling distraught,4,fear,0,negative
3,kindergarten teacher thoroughly weary job take...,4,fear,0,negative
4,begin feel quite disheartened,0,sadness,0,negative


### Modeling

Initializing the primary variables

In [313]:
x = np.array(df['text'])
y_sentiment = np.array(df['sentiment_label'])
y_emotion = np.array(df['emotion_label'])

TF-IDF Vectorizer

In [314]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2)).fit(x)

Chi-squared test based feature selector

In [315]:
selector_sentiment = SelectKBest(chi2, k=4500)
selector_emotion = SelectKBest(chi2, k=4500)

Defining a class to streamline the transformers

In [316]:
class Transformer:
  def __init__(self, x):
    self.x = x

  def vectorizer(self):
    self.x = vectorizer.transform(self.x)
    return self

  def selector(self, selector):
    self.x = selector.transform(self.x)
    return self

  def get_value(self):
    return self.x

Splitting the training and testing data

In [317]:
x_sentiment_train, x_sentiment_test, y_sentiment_train, y_sentiment_test = train_test_split(
  x, y_sentiment,
  test_size=0.25,
  random_state=42,
  stratify=y_sentiment, )

selector_sentiment.fit(
  vectorizer.transform(x_sentiment_train),
  y_sentiment_train, )

x_sentiment_train_selected = (
  Transformer(x_sentiment_train)
  .vectorizer()
  .selector(selector_sentiment)
  .get_value())

x_sentiment_test_selected = (
  Transformer(x_sentiment_test)
  .vectorizer()
  .selector(selector_sentiment)
  .get_value())

In [318]:
x_emotion_train, x_emotion_test, y_emotion_train, y_emotion_test = train_test_split(
  x, y_emotion,
  test_size=0.25,
  random_state=42,
  stratify=y_emotion, )

selector_emotion.fit(
  vectorizer.transform(x_emotion_train),
  y_emotion_train, )

x_emotion_train_selected = (
  Transformer(x_emotion_train)
  .vectorizer()
  .selector(selector_emotion)
  .get_value())

x_emotion_test_selected = (
  Transformer(x_emotion_test)
  .vectorizer()
  .selector(selector_emotion)
  .get_value())


#### Training the sentiment classifier

In [319]:
model_sentiment = MultinomialNB()

Grid search and cross validation hyperparameter tuning to select the best parameters and yield better accuracy

In [320]:
gridsearch_model_sentiment = GridSearchCV(
  estimator=model_sentiment,
  param_grid={'alpha': [0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.8, 0.9, 1.0, 1.1, 1.2]},
  cv=5,
  scoring='accuracy', )

gridsearch_model_sentiment.fit(
  x_sentiment_train_selected, 
  y_sentiment_train, )

gridsearch_model_sentiment.best_params_

{'alpha': 0.02}

In [321]:
gridsearch_model_sentiment.best_score_

np.float64(0.9412173222535307)

In [322]:
model_sentiment = gridsearch_model_sentiment.best_estimator_

In [323]:
df_sentiment_results = pd.DataFrame({
  'actual_value': y_sentiment_test,
  'predicted_value': model_sentiment.predict(x_sentiment_test_selected)
})

df_sentiment_results['classification'] = df_sentiment_results.apply(
  lambda x:
    'true positive' if x['actual_value'] == 1 and x['predicted_value'] == 1 else
    'true negative' if x['actual_value'] == 0 and x['predicted_value'] == 0 else
    'true neutral' if x['actual_value'] == 2 and x['predicted_value'] == 2 else
    'false positive' if x['actual_value'] != 1 and x['predicted_value'] == 1 else
    'false negative' if x['actual_value'] != 0 and x['predicted_value'] == 0 else
    'false neutral' if x['actual_value'] != 2 and x['predicted_value'] == 2 else
    None
  , axis=1
)

df_sentiment_results.head()

Unnamed: 0,actual_value,predicted_value,classification
0,0,0,true negative
1,0,0,true negative
2,0,0,true negative
3,0,0,true negative
4,1,1,true positive


In [324]:
df_sentiment_results['classification'].value_counts()

classification
true negative     52004
true positive     38211
false negative     4831
true neutral       1136
false positive     1071
false neutral       168
Name: count, dtype: int64

In [325]:
print(classification_report(df_sentiment_results['actual_value'], df_sentiment_results['predicted_value']))

              precision    recall  f1-score   support

           0       0.91      0.99      0.95     52556
           1       0.97      0.92      0.95     41329
           2       0.87      0.32      0.47      3536

    accuracy                           0.94     97421
   macro avg       0.92      0.75      0.79     97421
weighted avg       0.94      0.94      0.93     97421



In [326]:
response_list = [
  "I believe online anonymity encourages more honest and open communication, allowing users to express their true opinions",
  "In my view, online anonymity can lead to a significant increase in negative behaviors, such as trolling and cyberbullying, because users feel shielded from accountability.",
  "I think anonymity provides a double-edged sword; while it allows for free expression, it also creates an environment where people may engage in harmful or deceitful actions.",
  "Online anonymity empowers marginalized voices to speak out, but it also makes it difficult to identify and address harmful content effectively.",
  "I see online anonymity as a critical factor in fostering diverse discussions, but it also contributes to the spread of misinformation, as sources cannot always be verified.",
  "I think that online anonymity can lead to more genuine interactions in certain communities, but it may also reduce the quality of discourse by enabling users to avoid responsibility for their words.",
  "Anonymity online is essential for privacy, but it can also encourage users to engage in behavior they might avoid if their identity were known.",
  "In my opinion, the impact of online anonymity is largely context-dependent; it can promote both positive and negative behaviors depending on the platform and community norms.",
  "I believe online anonymity amplifies both the best and worst aspects of human behavior, providing a space for both creativity and cruelty.",
  "I think online anonymity allows people to connect more authentically, but it can also lead to a lack of trust and credibility in online interactions."
]

test = (pd.DataFrame({'text': response_list})
  .pipe(denoiser)
  .pipe(stopwords_remover)
  .pipe(null_content_observation_remover)
  .pipe(lemmatizer))

model_sentiment.predict(
  Transformer(test['text'])
  .vectorizer()
  .selector(selector_sentiment)
  .get_value())


array([1, 1, 1, 1, 0, 1, 1, 1, 0, 0])

#### Training the emotion classifier

In [327]:
model_emotion = MultinomialNB()

Grid search and cross validation hyperparameter tuning to select the best parameters and yield better accuracy

In [328]:
gridsearch_model_emotion = GridSearchCV(
  estimator=model_emotion,
  param_grid={'alpha': [0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.8, 0.9, 1.0, 1.1, 1.2]},
  cv=5,
  scoring='accuracy', )

gridsearch_model_emotion.fit(
  x_emotion_train_selected, 
  y_emotion_train, )

gridsearch_model_emotion.best_params_

{'alpha': 0.02}

In [329]:
gridsearch_model_emotion.best_score_

np.float64(0.8467613146065893)

In [330]:
model_emotion = gridsearch_model_emotion.best_estimator_

In [331]:
df_emotion_results = pd.DataFrame({
  'actual_value': y_emotion_test,
  'predicted_value': model_emotion.predict(x_emotion_test_selected)
})

df_emotion_results['classification'] = df_emotion_results.apply(
  lambda x:
    'true sadness' if x['actual_value'] == 0 and x['predicted_value'] == 0 else
    'true joy' if x['actual_value'] == 1 and x['predicted_value'] == 1 else
    'true love' if x['actual_value'] == 2 and x['predicted_value'] == 2 else
    'true anger' if x['actual_value'] == 3 and x['predicted_value'] == 3 else
    'true fear' if x['actual_value'] == 4 and x['predicted_value'] == 4 else
    'true surprised' if x['actual_value'] == 5 and x['predicted_value'] == 5 else
    'false sadness' if x['actual_value'] != 0 and x['predicted_value'] == 0 else
    'false joy' if x['actual_value'] != 1 and x['predicted_value'] == 1 else
    'false love' if x['actual_value'] != 2 and x['predicted_value'] == 2 else
    'false anger' if x['actual_value'] != 3 and x['predicted_value'] == 3 else
    'false fear' if x['actual_value'] != 4 and x['predicted_value'] == 4 else
    'false surprised' if x['actual_value'] != 5 and x['predicted_value'] == 5 else
    None
  , axis=1
)

df_sentiment_results.head()

Unnamed: 0,actual_value,predicted_value,classification
0,0,0,true negative
1,0,0,true negative
2,0,0,true negative
3,0,0,true negative
4,1,1,true positive


In [332]:
df_emotion_results['classification'].value_counts()

classification
true joy           32607
true sadness       26961
false joy          10000
true anger          9788
true fear           8035
false sadness       3719
true love           3300
true surprised      1337
false fear           910
false anger          530
false surprised      127
false love           107
Name: count, dtype: int64

In [333]:
print(classification_report(df_emotion_results['actual_value'], df_emotion_results['predicted_value']))

              precision    recall  f1-score   support

           0       0.88      0.96      0.92     28023
           1       0.77      0.99      0.86     33086
           2       0.97      0.40      0.57      8243
           3       0.95      0.74      0.83     13311
           4       0.90      0.72      0.80     11222
           5       0.91      0.38      0.53      3536

    accuracy                           0.84     97421
   macro avg       0.90      0.70      0.75     97421
weighted avg       0.86      0.84      0.83     97421



In [334]:
response_list = [
  "I believe online anonymity encourages more honest and open communication, allowing users to express their true opinions",
  "In my view, online anonymity can lead to a significant increase in negative behaviors, such as trolling and cyberbullying, because users feel shielded from accountability.",
  "I think anonymity provides a double-edged sword; while it allows for free expression, it also creates an environment where people may engage in harmful or deceitful actions.",
  "Online anonymity empowers marginalized voices to speak out, but it also makes it difficult to identify and address harmful content effectively.",
  "I see online anonymity as a critical factor in fostering diverse discussions, but it also contributes to the spread of misinformation, as sources cannot always be verified.",
  "I think that online anonymity can lead to more genuine interactions in certain communities, but it may also reduce the quality of discourse by enabling users to avoid responsibility for their words.",
  "Anonymity online is essential for privacy, but it can also encourage users to engage in behavior they might avoid if their identity were known.",
  "In my opinion, the impact of online anonymity is largely context-dependent; it can promote both positive and negative behaviors depending on the platform and community norms.",
  "I believe online anonymity amplifies both the best and worst aspects of human behavior, providing a space for both creativity and cruelty.",
  "I think online anonymity allows people to connect more authentically, but it can also lead to a lack of trust and credibility in online interactions."
]

test = (pd.DataFrame({'text': response_list})
  .pipe(denoiser)
  .pipe(stopwords_remover)
  .pipe(null_content_observation_remover)
  .pipe(lemmatizer))

model_emotion.predict(
  Transformer(test['text'])
  .vectorizer()
  .selector(selector_emotion)
  .get_value())


array([1, 1, 1, 1, 1, 1, 1, 1, 0, 1])