# Sentiment Analysis

Hi! Mario here! In this project I'll perform Sentiment Analysis on a Kaggle dataset (https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset/data), which contains the following:

1) Text ID
2) Text
3) Selected text
4) Sentiment
5) Time of tweet
6) Age of User
7) Country
8) Population
9) Land Area
10) Density

It comes in 2 separate files, one for training and one for testing. Let's first import some packages and load both files.

In [1]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import re
import string
from bs4 import BeautifulSoup
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

import warnings

warnings.filterwarnings("ignore")

In [2]:
train = pd.read_csv('Sentiment_train.csv', encoding='unicode_escape')

train

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26
...,...,...,...,...,...,...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative,night,31-45,Ghana,31072940,227540.0,137
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative,morning,46-60,Greece,10423054,128900.0,81
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive,noon,60-70,Grenada,112523,340.0,331
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive,night,70-100,Guatemala,17915568,107160.0,167


In [3]:
test = pd.read_csv('Sentiment_test.csv', encoding='unicode_escape')

test

Unnamed: 0,textID,text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,morning,0-20,Afghanistan,38928346.0,652860.0,60.0
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,noon,21-30,Albania,2877797.0,27400.0,105.0
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,night,31-45,Algeria,43851044.0,2381740.0,18.0
3,01082688c6,happy bday!,positive,morning,46-60,Andorra,77265.0,470.0,164.0
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,noon,60-70,Angola,32866272.0,1246700.0,26.0
...,...,...,...,...,...,...,...,...,...
4810,,,,,,,,,
4811,,,,,,,,,
4812,,,,,,,,,
4813,,,,,,,,,


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   textID            27481 non-null  object 
 1   text              27480 non-null  object 
 2   selected_text     27480 non-null  object 
 3   sentiment         27481 non-null  object 
 4   Time of Tweet     27481 non-null  object 
 5   Age of User       27481 non-null  object 
 6   Country           27481 non-null  object 
 7   Population -2020  27481 non-null  int64  
 8   Land Area (Km²)   27481 non-null  float64
 9   Density (P/Km²)   27481 non-null  int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 2.1+ MB


In [5]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4815 entries, 0 to 4814
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   textID            3534 non-null   object 
 1   text              3534 non-null   object 
 2   sentiment         3534 non-null   object 
 3   Time of Tweet     3534 non-null   object 
 4   Age of User       3534 non-null   object 
 5   Country           3534 non-null   object 
 6   Population -2020  3534 non-null   float64
 7   Land Area (Km²)   3534 non-null   float64
 8   Density (P/Km²)   3534 non-null   float64
dtypes: float64(3), object(6)
memory usage: 338.7+ KB


Let's see the frequency of each sentiment in the training set. Also, we can see that the test set has some null values in that column, so let's see the null values in both sets.

In [6]:
train['sentiment'].value_counts()

sentiment
neutral     11118
positive     8582
negative     7781
Name: count, dtype: int64

In [7]:
print('Null values in the training set: \n', train.isnull().sum())
print('')
print('Null values in the test set: \n', test.isnull().sum())

Null values in the training set: 
 textID              0
text                1
selected_text       1
sentiment           0
Time of Tweet       0
Age of User         0
Country             0
Population -2020    0
Land Area (Km²)     0
Density (P/Km²)     0
dtype: int64

Null values in the test set: 
 textID              1281
text                1281
sentiment           1281
Time of Tweet       1281
Age of User         1281
Country             1281
Population -2020    1281
Land Area (Km²)     1281
Density (P/Km²)     1281
dtype: int64


The classes in the training set are quite balanced. Also, there are a lot of null values in the test set, and only 1 for the train set. Clearly they correspond to important columns in order to perform the sentiment analysis, so we'll drop them.

In [8]:
train = train.dropna()

test = test.dropna()

print('Null values in the training set: \n', train.isnull().sum())
print('')
print('Null values in the test set: \n', test.isnull().sum())
print('')
print('Rows number in the training set: ', len(train))
print('Rows number in the test set: ', len(test))


Null values in the training set: 
 textID              0
text                0
selected_text       0
sentiment           0
Time of Tweet       0
Age of User         0
Country             0
Population -2020    0
Land Area (Km²)     0
Density (P/Km²)     0
dtype: int64

Null values in the test set: 
 textID              0
text                0
sentiment           0
Time of Tweet       0
Age of User         0
Country             0
Population -2020    0
Land Area (Km²)     0
Density (P/Km²)     0
dtype: int64

Rows number in the training set:  27480
Rows number in the test set:  3534


Now we don't have any null values, we have a training dataset consisting of 27,480 rows and a test dataset of 3,534. Now we can begin the text analysis. First, we need the sentiment labels to be numeric, so I'll create a function that assign the following values: negative = -1, neutral = 0, positive = 1. Then, I'll apply it to both Data Frames.

In [9]:
def sentiment_encoding(sentiment):
    
    if sentiment == 'negative':
        return -1
    
    elif sentiment == 'neutral':
        return 0
    
    elif sentiment == 'positive':
        return 1
    
#Apply function to the train and test sets.
train['sentiment'] = train['sentiment'].apply(lambda x: sentiment_encoding(x))
test['sentiment'] = test['sentiment'].apply(lambda x: sentiment_encoding(x))

print(train['sentiment'].head())
print('')
print(test['sentiment'].head())

0    0
1   -1
2   -1
3   -1
4   -1
Name: sentiment, dtype: int64

0    0
1    1
2   -1
3    1
4    1
Name: sentiment, dtype: int64


We now have the sentiment labels numerically encoded. Our next step is to preprocess the text. In the training set I'll just use the raw text, because it's closer to reality. The first thing I'll do is to convert everything to lower case.

In [10]:
train['text'] = train['text'].apply(lambda x: ' '.join(x.lower() for x in str(x).split()))

test['text'] = test['text'].apply(lambda x: ' '.join(x.lower() for x in str(x).split()))


Let's remove the HTML tags and websites.

In [11]:
train['text'] = train['text'].apply(lambda x: BeautifulSoup(x).get_text())
train['text'] = train['text'].apply(lambda x: re.sub(r"http\S+", "", x))

test['text'] = test['text'].apply(lambda x: BeautifulSoup(x).get_text())
test['text'] = test['text'].apply(lambda x: re.sub(r"http\S+", "", x))

Let's expand the abbreviations of some terms. For example, from "I won't" to "I will not".

In [12]:
def expand(s):

    s = re.sub(r"won't", "will not",s)
    s = re.sub(r"would't", "would not",s)
    s = re.sub(r"could't", "could not",s)
    s = re.sub(r"\'d", " would",s)
    s = re.sub(r"can\'t", "can not",s)
    s = re.sub(r"n\'t", " not", s)
    s= re.sub(r"\'re", " are", s)
    s = re.sub(r"\'s", " is", s)
    s = re.sub(r"\'ll", " will", s)
    s = re.sub(r"\'t", " not", s)
    s = re.sub(r"\'ve", " have", s)
    s = re.sub(r"\'m", " am", s)
 
    return s

train['text']  = train['text'].apply(lambda x: expand(x))  
test['text']  = test['text'].apply(lambda x: expand(x)) 

Let's remove non alpha-numeric characters.

In [13]:
train['text'] = train['text'].apply(lambda x: " ".join([re.sub('[^A-Za-z]+','', x) for x in nltk.word_tokenize(x)])) 
test['text'] = test['text'].apply(lambda x: " ".join([re.sub('[^A-Za-z]+','', x) for x in nltk.word_tokenize(x)])) 

Now let's remove the stop words.

In [14]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
train['text'] = train['text'].apply(lambda x: ' '.join([x for x in x.split() if x not in stop]))
test['text'] = test['text'].apply(lambda x: ' '.join([x for x in x.split() if x not in stop]))


Time to lemmatize.

In [15]:

lemmatizer = WordNetLemmatizer()
train['text'] = train['text'].apply(lambda x: 
                                            ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x)]))
test['text'] = test['text'].apply(lambda x: 
                                            ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x)]))
train

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,responded going,"I`d have responded, if I were going",0,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,sooo sad miss san diego,Sooo SAD,-1,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,bos bullying,bullying me,-1,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,interview leave alone,leave me alone,-1,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,son put release already bought,"Sons of ****,",-1,noon,60-70,Angola,32866272,1246700.0,26
...,...,...,...,...,...,...,...,...,...,...
27476,4eac33d1c0,wish could come see u denver husband lost job ...,d lost,-1,night,31-45,Ghana,31072940,227540.0,137
27477,4f4c4fc327,wondered rake client made clear net force devs...,", don`t force",-1,morning,46-60,Greece,10423054,128900.0,81
27478,f67aae2310,yay good enjoy break probably need hectic week...,Yay good for both of you.,1,noon,60-70,Grenada,112523,340.0,331
27479,ed167662a5,worth,But it was worth it ****.,1,night,70-100,Guatemala,17915568,107160.0,167


We can't classify the text unless it has a numeric format. In order to achieve that, I'll use the method called TF-IDF, which is a numeric measure that expresses how important is a word for a document in a collection. Its use is more recommended than the normal Count Vectorizer. First we convert the training and test texts to a list format and then apply the transformation.

In [16]:
train_text = train['text'].astype(str).tolist()
test_text = test['text'].astype(str).tolist()

In [17]:
vectorizer = TfidfVectorizer()

#Create the X_train and X_test sets.
X_train = vectorizer.fit_transform(train_text).toarray()

X_test = vectorizer.transform(test_text).toarray()

In [18]:
print(X_train.shape)

(27480, 22570)


Let's create the target features.

In [19]:
y_train = train['sentiment']
y_test = test['sentiment']


Now I'll try 3 models, a Logistic Regression, a Multinomial Naive Bayes (widely used in text mining) and a Random Forest, for different values of C, alpha and maximum depth, respectively. Both Logistic Regression and Random Forest will have a balanced class weight. From there I'll decide which one performs better. First, a function. 

In [20]:
def prediction(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred = model.predict(X_test)

    print('Confusion matrix train set: \n', confusion_matrix(y_train, y_pred_train))
    print('Confusion matrix test set: \n', confusion_matrix(y_test, y_pred))
    print(' ')
    report_train = classification_report(y_train, y_pred_train, output_dict = True)
    report_test = classification_report(y_test, y_pred, output_dict = True)
    
    labels = [-1, 0, 1]
    precision_train = [report_train['-1']['precision'], report_train['0']['precision'], report_train['1']['precision']]
    precision_test = [report_test['-1']['precision'], report_test['0']['precision'], report_test['1']['precision']]
    
    recall_train = [report_train['-1']['recall'], report_train['0']['recall'], report_train['1']['recall']]
    recall_test = [report_test['-1']['recall'], report_test['0']['recall'], report_test['1']['recall']]
    
    F1score_train = [report_train['-1']['f1-score'], report_train['0']['f1-score'], report_train['1']['f1-score']]
    F1score_test = [report_test['-1']['f1-score'], report_test['0']['f1-score'], report_test['1']['f1-score']]
    
    accuracy_train = [report_train['accuracy'], report_train['accuracy'], report_train['accuracy']]
    accuracy_test = [report_test['accuracy'], report_test['accuracy'], report_test['accuracy']]
    
    report = pd.DataFrame({'Sentiment': labels, 'Precision train': precision_train, 'Precision test': precision_test,
                          'Recall train': recall_train, 'Recall test': recall_test, 'F1 train': F1score_train,
                          'F1 test': F1score_test, 'Accuracy train': accuracy_train, 'Accuracy test': accuracy_test})
    
    return report, model

First, the Logistic Regression.

In [25]:
C = [0.01, 0.1, 1, 10]

for c in C:
    
    lr = LogisticRegression(C = c, class_weight = 'balanced')
    
    print(F'Results for C = {c}: \n')
    
    print(prediction(lr, X_train, y_train, X_test, y_test)[0])
    
    print('')

Results for C = 0.01: 

Confusion matrix train set: 
 [[4887 2375  519]
 [1591 8217 1309]
 [ 582 2518 5482]]
Confusion matrix test set: 
 [[ 631  316   54]
 [ 254 1009  167]
 [  71  319  713]]
 
   Sentiment  Precision train  Precision test  Recall train  Recall test  \
0         -1         0.692210        0.660042      0.628068     0.630370   
1          0         0.626773        0.613747      0.739138     0.705594   
2          1         0.749932        0.763383      0.638779     0.646419   

   F1 train   F1 test  Accuracy train  Accuracy test  
0  0.658581  0.644865        0.676346       0.665818  
1  0.678334  0.656474        0.676346       0.665818  
2  0.689907  0.700049        0.676346       0.665818  

Results for C = 0.1: 

Confusion matrix train set: 
 [[5386 1962  433]
 [1424 8471 1222]
 [ 437 1928 6217]]
Confusion matrix test set: 
 [[ 666  284   51]
 [ 253 1012  165]
 [  62  260  781]]
 
   Sentiment  Precision train  Precision test  Recall train  Recall test  \
0        

Second, the Multinomial Naive Bayes.

In [26]:
alpha = [0.01, 0.1, 1, 8, 10]

for a in alpha:
    
    nb = MultinomialNB(alpha = a)
    
    print(F'Results for alpha = {a}: \n')

    print(prediction(nb, X_train, y_train, X_test, y_test)[0])
    
    print('')

Results for alpha = 0.01: 

Confusion matrix train set: 
 [[ 6554  1026   201]
 [  504 10021   592]
 [  185   911  7486]]
Confusion matrix test set: 
 [[555 373  73]
 [267 907 256]
 [ 92 372 639]]
 
   Sentiment  Precision train  Precision test  Recall train  Recall test  \
0         -1         0.904874        0.607221      0.842308     0.554446   
1          0         0.838016        0.549031      0.901412     0.634266   
2          1         0.904215        0.660124      0.872291     0.579329   

   F1 train   F1 test  Accuracy train  Accuracy test  
0  0.872471  0.579634        0.875582        0.59451  
1  0.868559  0.588579        0.875582        0.59451  
2  0.887966  0.617093        0.875582        0.59451  

Results for alpha = 0.1: 

Confusion matrix train set: 
 [[ 6385  1181   215]
 [  486 10045   586]
 [  183  1038  7361]]
Confusion matrix test set: 
 [[553 388  60]
 [248 945 237]
 [ 73 373 657]]
 
   Sentiment  Precision train  Precision test  Recall train  Recall test  \
0

Third, the Random Forests.

In [23]:
depth = [5, 10, 15, 20]

for d in depth:
    
    rf = RandomForestClassifier(random_state = 0, n_estimators = 50, max_depth = d, class_weight = 'balanced')
    
    print(F'Results for max_depth = {d}: \n')

    print(prediction(rf, X_train, y_train, X_test, y_test)[0])
    
    print('')

Results for max_depth = 5: 

Confusion matrix train set: 
 [[4021 3188  572]
 [1580 8101 1436]
 [ 697 3597 4288]]
Confusion matrix test set: 
 [[536 399  66]
 [264 986 180]
 [ 97 507 499]]
 
   Sentiment  Precision train  Precision test  Recall train  Recall test  \
0         -1         0.638457        0.597547      0.516772     0.535465   
1          0         0.544203        0.521142      0.728704     0.689510   
2          1         0.681067        0.669799      0.499650     0.452403   

   F1 train   F1 test  Accuracy train  Accuracy test  
0  0.571205  0.564805        0.597162       0.571873  
1  0.623082  0.593618        0.597162       0.571873  
2  0.576422  0.540043        0.597162       0.571873  

Results for max_depth = 10: 

Confusion matrix train set: 
 [[4753 2541  487]
 [1282 8426 1409]
 [ 415 2529 5638]]
Confusion matrix test set: 
 [[607 341  53]
 [248 993 189]
 [ 50 360 693]]
 
   Sentiment  Precision train  Precision test  Recall train  Recall test  \
0         -1   

Looking at the results, and allowing a maximum difference of 5 percent points between the accuracies of the training set and the test set to avoid overfitting, we can see that the best model regarding this metric is the Logistic Regression with a parameter C = 0.1. This gives us a 70% accuracy and the respective precisions, recalls and F1 Scores for each class range from 0.65 to 0.74. This could be considered as an acceptable model, although we may need a more advanced one, for example LSTM or other kinds of neural networks. I still have yet to learn those, and tensorflow, so I'll update this notebook in the future. For now, this is a good starting point.