# Sentiment Analysis Model Training
Sentiment analysis model training using data set from kaggle consisting of **937,854 observations** of labeled text content from twitter
<br>
Kaggle url: [https://www.kaggle.com/datasets/tariqsays/sentiment-dataset-with-1-million-tweets](https://www.kaggle.com/datasets/tariqsays/sentiment-dataset-with-1-million-tweets)
<br> <br>
**Dataset attributes**
* 0 - language used
* 1 - text
* 2 - label (positive, negative, uncertainty, litigious)

## Initialization

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import nltk
import re
import os

from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WordPunctTokenizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [2]:
%%capture
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/cabrera/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/cabrera/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/cabrera/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/cabrera/nltk_data...
[nltk_data]   Package averaged_perceptron_ta

Loading the data

In [3]:
df = pd.read_csv('data/dataset.csv')

df

Unnamed: 0,Text,Language,Label
0,@Charlie_Corley @Kristine1G @amyklobuchar @Sty...,en,litigious
1,#BadBunny: Como dos gotas de agua: Joven se di...,es,negative
2,https://t.co/YJNiO0p1JV Flagstar Bank disclose...,en,litigious
3,Rwanda is set to host the headquarters of Unit...,en,positive
4,OOPS. I typed her name incorrectly (today’s br...,en,litigious
...,...,...,...
937849,@Juice_Lemons in the dark. it’s so good,en,positive
937850,8.SSR &amp; Disha Salian case should be solved...,en,negative
937851,*ACCIDENT: Damage Only* - Raleigh Fire Depart...,en,negative
937852,@reblavoie So happy for her! She’s been incred...,en,positive


In [4]:
df.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 937854 entries, 0 to 937853
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Text      937854 non-null  object
 1   Language  937831 non-null  object
 2   Label     937854 non-null  object
dtypes: object(3)
memory usage: 21.5+ MB


Making the column names lowercase and underscore separated

In [5]:
column_names_replacement_map = {}
for column in df.columns:
  column_names_replacement_map[column] = column.strip().replace(' ', '_').lower()

column_names_replacement_map

{'Text': 'text', 'Language': 'language', 'Label': 'label'}

In [6]:
df = df.rename(columns=column_names_replacement_map)

df.head()

Unnamed: 0,text,language,label
0,@Charlie_Corley @Kristine1G @amyklobuchar @Sty...,en,litigious
1,#BadBunny: Como dos gotas de agua: Joven se di...,es,negative
2,https://t.co/YJNiO0p1JV Flagstar Bank disclose...,en,litigious
3,Rwanda is set to host the headquarters of Unit...,en,positive
4,OOPS. I typed her name incorrectly (today’s br...,en,litigious


Retaining only observations that are using the english language to stay within the scope

In [7]:
df = df[df['language'] == 'en']

Excluding the language field as it is no longer of need and to reduce memory usage

In [8]:
df = df[[column for column in df.columns if column != 'language']]

Removing observations that are tagged as litigious

In [9]:
df = df[df['label'] != 'litigious']

Converting the features into its proper data type

In [10]:
df['text'] = df['text'].astype('str')
df['label'] = df['label'].astype('str')

In [11]:
df.reset_index(drop=True, inplace=True)

df

Unnamed: 0,text,label
0,Rwanda is set to host the headquarters of Unit...,positive
1,It sucks for me since I'm focused on the natur...,negative
2,@ShawnTarloff @itsmieu you can also relate thi...,uncertainty
3,Social Security. Constant political crises dis...,negative
4,@FilmThePoliceLA A broken rib can puncture a l...,negative
...,...,...
691243,@Juice_Lemons in the dark. it’s so good,positive
691244,8.SSR &amp; Disha Salian case should be solved...,negative
691245,*ACCIDENT: Damage Only* - Raleigh Fire Depart...,negative
691246,@reblavoie So happy for her! She’s been incred...,positive


In [12]:
df.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691248 entries, 0 to 691247
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    691248 non-null  object
 1   label   691248 non-null  object
dtypes: object(2)
memory usage: 10.5+ MB


Checking how much of the observations' sentiment are positive, negative, and uncertain

In [13]:
df['label'].value_counts()

label
positive       248516
negative       244146
uncertainty    198586
Name: count, dtype: int64

## Execution

### Data cleaning
Making a data pipeline to:
* Denoise: removing the twitter usernames and non-alphabetical characters and stripping it of white space
* Stopwords removal: stripping out the stopwords in the content such as `[a, an, the, and, but, or]` to improve data quality
* Lemmatization: reducing words to their base form e.g. `[changing, changed, change] -> change`

Removing the username and non-alphabetical characters in the content to reduce noise and improve data quality

In [14]:
def denoiser(df: pd.DataFrame):
  def strip(text: str):
    text = re.sub(r'@\w+', '', text) 
    text = re.sub(r'[^a-zA-Z ]', '', text)
    text = re.sub(r'https\w+', '', text)
    text = re.sub(r'http\w+', '', text)
    text = text.strip()
    return text.lower()

  df['text'] = df['text'].apply(strip)
  return df

Removing stopwords for standardization

In [15]:
def stopwords_remover(df: pd.DataFrame):
  matcher = re.compile(r"|".join([fr"\b{word}\b" for word in stopwords.words("english")]))
  def remove_stopwords(text: str):
    return " ".join(matcher.sub('', text).split())

  df['text'] = df['text'].apply(remove_stopwords)
  return df

Removing observations with null text values

In [16]:
def null_content_observation_remover(df: pd.DataFrame):
  df = df[~df['text'].isnull() | df['text'].isin([''])]
  df = df.reset_index(drop=True)
  return df

Reducing words to their base or lemmatizing to enhance the effectiveness of the model

In [17]:
def lemmatizer(df: pd.DataFrame):
  wordnet_lemmatizer = WordNetLemmatizer()
  tokenizer = WordPunctTokenizer()

  wordnet_pos_tag_map = {
    "J": wordnet.ADJ,
    "N": wordnet.NOUN,
    "V": wordnet.VERB,
    "R": wordnet.ADV,
  }

  def lemmatize(text: str):
    tokens = tokenizer.tokenize(text)
    pos_tags = pos_tag(tokens)

    lemmatized_tokens = []
    for token, tag in pos_tags:
      wordnet_tag = wordnet_pos_tag_map.get(tag[0].upper())
      if wordnet_tag is None:
        lemmatized_tokens.append(token)
      else:
        lemmatized_tokens.append(wordnet_lemmatizer.lemmatize(token, wordnet_tag))

    return ' '.join(lemmatized_tokens)

  df['text'] = df['text'].apply(lemmatize)
  return df
    

Running the pipeline and exporting to csv to skip reprocessing of the dataset

In [18]:
if os.path.isfile('data/dataset_processed.csv'):
  df = pd.read_csv('data/dataset_processed.csv')
else:
  df = (
    df
    .pipe(denoiser)
    .pipe(stopwords_remover)
    .pipe(null_content_observation_remover)
    .pipe(lemmatizer))

  df.to_csv('data/dataset_processed.csv', index=False)

In [19]:
df.head()

Unnamed: 0,text,label
0,rwanda set host headquarters united nation dev...,positive
1,suck since im focus nature aspect thing enviro...,negative
2,also relate art lot people dismay start art ki...,uncertainty
3,social security constant political crisis dist...,negative
4,broken rib puncture lung lead collapse lung mu...,negative


### Modelling

Adding a target column

In [20]:
target_label_map = {
  'negative': 0,
  'positive': 1,
  'uncertainty': 2,
}

df['target'] = df['label'].apply(lambda label: target_label_map.get(label))

In [21]:
df.head()

Unnamed: 0,text,label,target
0,rwanda set host headquarters united nation dev...,positive,1
1,suck since im focus nature aspect thing enviro...,negative,0
2,also relate art lot people dismay start art ki...,uncertainty,2
3,social security constant political crisis dist...,negative,0
4,broken rib puncture lung lead collapse lung mu...,negative,0


Splitting the training and testing data

In [22]:
x = df['text'].to_list()
y = df['target'].to_list()

x_train, x_test, y_train, y_test = train_test_split(
  x, y,
  test_size=0.33,
  random_state=42,
  stratify=y, )

Getting the split details

In [23]:
print("Split size details")
print(f"X-original size: {len(x)}, X-train: {len(x_train)}, X-test: {len(x_test)}")
print(f"y-original size: {len(y)}, y-train: {len(y_train)}, y-test: {len(y_test)}")
print("Split percentage")
print(f"Train: {round(len(x_train)/len(x) * 100)}%")
print(f"Test: {round(len(x_test)/len(x) * 100)}%")

Split size details
X-original size: 691248, X-train: 463136, X-test: 228112
y-original size: 691248, y-train: 463136, y-test: 228112
Split percentage
Train: 67%
Test: 33%


Making the model pipeline

In [24]:
model_naive_bayes = Pipeline([
    ('transformer', TfidfVectorizer()),
    ('model', MultinomialNB(alpha=1)),
]).fit(x_train, y_train)

In [25]:
y_pred = model_naive_bayes.predict(x_test)

df_results = pd.DataFrame({
  'actual_value': y_test,
  'predicted_value': y_pred
})

df_results['classification'] = df_results.apply(
  lambda x:
    'true positive' if x['actual_value'] == 1 and x['predicted_value'] == 1 else
    'true negative' if x['actual_value'] == 0 and x['predicted_value'] == 0 else
    'true neutral' if x['actual_value'] == 2 and x['predicted_value'] == 2 else
    'false positive' if x['actual_value'] != 1 and x['predicted_value'] == 1 else
    'false negative' if x['actual_value'] != 0 and x['predicted_value'] == 0 else
    'false neutral' if x['actual_value'] != 2 and x['predicted_value'] == 2 else
    None
  , axis=1
)

df_results

Unnamed: 0,actual_value,predicted_value,classification
0,0,1,false positive
1,0,0,true negative
2,2,0,false negative
3,2,2,true neutral
4,0,0,true negative
...,...,...,...
228107,1,2,false neutral
228108,1,1,true positive
228109,2,2,true neutral
228110,2,2,true neutral


In [27]:
df_results['classification'].value_counts()

classification
true negative     73708
true positive     71078
true neutral      47867
false negative    19678
false positive    11888
false neutral      3893
Name: count, dtype: int64

In [28]:
confusion_matrix(
  y_true = y_test,
  y_pred = y_pred
)

array([[73708,  4858,  2002],
       [ 9041, 71078,  1891],
       [10637,  7030, 47867]])

In [36]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.91      0.85     80568
           1       0.86      0.87      0.86     82010
           2       0.92      0.73      0.82     65534

    accuracy                           0.84    228112
   macro avg       0.86      0.84      0.84    228112
weighted avg       0.85      0.84      0.84    228112



Predicting a list of responses from a dummy question given:
<br>
**How do you think about the impact of online anonymity on user behavior in social media platforms?**

In [69]:
response_list = [
  "I believe online anonymity encourages more honest and open communication, allowing users to express their true opinions",
  "In my view, online anonymity can lead to a significant increase in negative behaviors, such as trolling and cyberbullying, because users feel shielded from accountability.",
  "I think anonymity provides a double-edged sword; while it allows for free expression, it also creates an environment where people may engage in harmful or deceitful actions.",
  "Online anonymity empowers marginalized voices to speak out, but it also makes it difficult to identify and address harmful content effectively.",
  "I see online anonymity as a critical factor in fostering diverse discussions, but it also contributes to the spread of misinformation, as sources cannot always be verified.",
  "I think that online anonymity can lead to more genuine interactions in certain communities, but it may also reduce the quality of discourse by enabling users to avoid responsibility for their words.",
  "Anonymity online is essential for privacy, but it can also encourage users to engage in behavior they might avoid if their identity were known.",
  "In my opinion, the impact of online anonymity is largely context-dependent; it can promote both positive and negative behaviors depending on the platform and community norms.",
  "I believe online anonymity amplifies both the best and worst aspects of human behavior, providing a space for both creativity and cruelty.",
  "I think online anonymity allows people to connect more authentically, but it can also lead to a lack of trust and credibility in online interactions."
]

test = (pd.DataFrame({'text': response_list})
  .pipe(denoiser)
  .pipe(stopwords_remover)
  .pipe(null_content_observation_remover)
  .pipe(lemmatizer))

model_naive_bayes.predict(test['text'].to_list())


array([1, 0, 0, 1, 0, 1, 1, 1, 1, 0])

In [50]:
df[df['text'].apply(lambda x: len(x) < 50)].sort_values(by='text', ascending=False)

Unnamed: 0,text,label,target
193626,zzzzzzzz best u get historic news day,positive,1
316179,zygote isnt person seriously body problem choice,negative,0
105893,zydrate gun go somewhere anatomyhah hah,uncertainty,2
196357,zydrate gun go somewhere anatomy,uncertainty,2
401586,zviriko l try best,positive,1
...,...,...,...
581567,,negative,0
520632,,negative,0
598838,,uncertainty,2
147071,,negative,0


In [52]:
df.iloc[581567]['text']

''

In [60]:
df[df['text'].isnull() | df['text'].isin([''])]

Unnamed: 0,text,label,target
13168,,negative,0
19273,,negative,0
22721,,positive,1
25081,,negative,0
28104,,positive,1
...,...,...,...
633347,,negative,0
647641,,positive,1
651390,,positive,1
668489,,negative,0


In [58]:
df['text'].isin(['']).sum()

np.int64(109)