# Twitter Sentiment Analysis Using ML

## What is Sentiment Analysis?
Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative, or neutral. It’s also known as **opinion mining**, which involves deriving the opinion or attitude of a speaker.

## Why Sentiment Analysis?

### 1. **Business**
In the marketing field, companies use sentiment analysis to develop their strategies. It helps them understand customers’ feelings towards products or brands, how people respond to campaigns or product launches, and why consumers don’t buy certain products.

### 2. **Politics**
In the political field, sentiment analysis is used to track political views, detect inconsistencies between statements and actions at the government level, and predict election results.

### 3. **Public Actions**
Sentiment analysis is also used to monitor and analyze social phenomena, such as spotting potentially dangerous situations and determining the general mood of the blogosphere.


In [5]:
# import requires libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import joblib

In [6]:
# download stopwords from nltk library
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mulombi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# print stopwords (common english words with less meaning)
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [None]:
# download dataset from https://www.kaggle.com/datasets/kazanova/sentiment140
# load data from file
dataset = pd.read_csv('twitter-comments.csv', encoding='iso-8859-1')

In [9]:
# get sample 500000
twitter_data = dataset.sample(n=500000, random_state=0, ignore_index=True)

In [10]:
# print first 5 rows of data
print(twitter_data.head())

   0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY _TheSpecialOne_  \
0  4  1881672289  Fri May 22 05:16:44 PDT 2009  NO_QUERY    viry_trivium   
1  4  2009051656  Tue Jun 02 15:04:22 PDT 2009  NO_QUERY      Earlthedog   
2  0  2211886069  Wed Jun 17 13:24:27 PDT 2009  NO_QUERY     StefyyMarie   
3  4  1558734942  Sun Apr 19 09:15:07 PDT 2009  NO_QUERY        tezzer57   
4  4  1834470136  Mon May 18 03:03:30 PDT 2009  NO_QUERY   dave_sherratt   

  @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D  
0                           Happy birthday, sister!                                                                    
1  Just finished eating supper and now I am attac...                                                                   
2                            i hate love right now.                                                                    
3  Photo fest in LDN, Tudor feast last night, don...           

In [11]:
# get shape of data
print(twitter_data.shape)

(500000, 6)


In [12]:
# get columns of data
print(twitter_data.columns)

Index(['0', '1467810369', 'Mon Apr 06 22:19:45 PDT 2009', 'NO_QUERY',
       '_TheSpecialOne_',
       '@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D'],
      dtype='object')


In [13]:
# get info of data
print(twitter_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 6 columns):
 #   Column                                                                                                               Non-Null Count   Dtype 
---  ------                                                                                                               --------------   ----- 
 0   0                                                                                                                    500000 non-null  int64 
 1   1467810369                                                                                                           500000 non-null  int64 
 2   Mon Apr 06 22:19:45 PDT 2009                                                                                         500000 non-null  object
 3   NO_QUERY                                                                                                             500000 non-null  object
 4   _TheSpeci

In [14]:
# rename columns
twitter_data.rename(columns={
    '0': 'target', 
    '1467810369': 'id', 
    'Mon Apr 06 22:19:45 PDT 2009': 'date', 
    'NO_QUERY': 'flag', 
    '_TheSpecialOne_': 'user', 
    "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D": 'text'
}, inplace=True)

In [15]:
twitter_data.columns

Index(['target', 'id', 'date', 'flag', 'user', 'text'], dtype='object')

In [16]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text
0,4,1881672289,Fri May 22 05:16:44 PDT 2009,NO_QUERY,viry_trivium,"Happy birthday, sister!"
1,4,2009051656,Tue Jun 02 15:04:22 PDT 2009,NO_QUERY,Earlthedog,Just finished eating supper and now I am attac...
2,0,2211886069,Wed Jun 17 13:24:27 PDT 2009,NO_QUERY,StefyyMarie,i hate love right now.
3,4,1558734942,Sun Apr 19 09:15:07 PDT 2009,NO_QUERY,tezzer57,"Photo fest in LDN, Tudor feast last night, don..."
4,4,1834470136,Mon May 18 03:03:30 PDT 2009,NO_QUERY,dave_sherratt,"@piercedbrat happy bday for tomoz, all the bes..."


In [17]:
# get unique values of target column
twitter_data['target'].value_counts()

target
4    250266
0    249734
Name: count, dtype: int64

In [18]:
# get unique values of id column
twitter_data['id'].value_counts()

id
2053108314    2
1792325742    2
2190104868    2
1994953686    2
2058144895    2
             ..
1971890900    1
2191548217    1
1968427436    1
1966691090    1
2053368869    1
Name: count, Length: 499835, dtype: int64

In [19]:
# check for missing values
twitter_data.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

In [20]:
# check for duplicate values
twitter_data.duplicated().sum()

0

In [21]:
# convert 4 in target column to 1
twitter_data['target'] = twitter_data['target'].replace(4, 1)

In [22]:
# get unique values of target column (0. Negative, 1. Positive)
twitter_data['target'].value_counts()

target
1    250266
0    249734
Name: count, dtype: int64

In [23]:
# stemming (reducing a word to its root form)
post_stem = PorterStemmer()

In [24]:
# create function to clean text
def clean_text(text):
    text = re.sub(r'@[A-Za-z0-9]+', ' ', text) # remove @mentions
    text = re.sub(r'#', ' ', text) # remove '#' symbol
    text = re.sub(r'RT[\s]+', ' ', text) # remove RT
    text = re.sub(r'https?:\/\/\S+', ' ', text) # remove links
    text = re.sub(r'[^\w\s]', ' ', text) # remove punctuation
    text = re.sub(r'\d+', ' ', text) # remove digits
    text = text.lower() # convert to lowercase
    text = text.split() # split text into words
    text = [post_stem.stem(word) for word in text if not word in stopwords.words('english')] # remove stopwords and stemming
    text = ' '.join(text) # join words with space
    return text

In [25]:
# stemmed content column
twitter_data['stemmed_text'] = twitter_data['text'].apply(clean_text)

In [26]:
# print first 5 rows of data
print(twitter_data.head())

   target          id                          date      flag           user  \
0       1  1881672289  Fri May 22 05:16:44 PDT 2009  NO_QUERY   viry_trivium   
1       1  2009051656  Tue Jun 02 15:04:22 PDT 2009  NO_QUERY     Earlthedog   
2       0  2211886069  Wed Jun 17 13:24:27 PDT 2009  NO_QUERY    StefyyMarie   
3       1  1558734942  Sun Apr 19 09:15:07 PDT 2009  NO_QUERY       tezzer57   
4       1  1834470136  Mon May 18 03:03:30 PDT 2009  NO_QUERY  dave_sherratt   

                                                text  \
0                           Happy birthday, sister!    
1  Just finished eating supper and now I am attac...   
2                            i hate love right now.    
3  Photo fest in LDN, Tudor feast last night, don...   
4  @piercedbrat happy bday for tomoz, all the bes...   

                                      stemmed_text  
0                            happi birthday sister  
1                   finish eat supper attack daddi  
2                      

In [27]:
# print stemmed texts
print(twitter_data['stemmed_text'])

0                                     happi birthday sister
1                            finish eat supper attack daddi
2                                           hate love right
3           photo fest ldn tudor feast last night think abl
4                      happi bday tomoz best peopl born may
                                ...                        
499995                            come came could come come
499996                                        stomach cramp
499997                                    whoop sound happi
499998                       _sg weird scari email facebook
499999    prospect exam bgt figur near work magic im suf...
Name: stemmed_text, Length: 500000, dtype: object


In [28]:
# split the dataset into features (X) and target (y)
X = twitter_data['stemmed_text'] # Independent variable
y = twitter_data['target'] # 0. Negative, 1. Positive (dependent variable)

In [29]:
print(X)

0                                     happi birthday sister
1                            finish eat supper attack daddi
2                                           hate love right
3           photo fest ldn tudor feast last night think abl
4                      happi bday tomoz best peopl born may
                                ...                        
499995                            come came could come come
499996                                        stomach cramp
499997                                    whoop sound happi
499998                       _sg weird scari email facebook
499999    prospect exam bgt figur near work magic im suf...
Name: stemmed_text, Length: 500000, dtype: object


In [30]:
print(y)

0         1
1         1
2         0
3         1
4         1
         ..
499995    0
499996    0
499997    1
499998    1
499999    0
Name: target, Length: 500000, dtype: int64


In [31]:
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
print(X.shape, X_train.shape, X_test.shape)

(500000,) (400000,) (100000,)


In [34]:
# convert textual data to numerical data
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [35]:
print(X_train)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2696201 stored elements and shape (400000, 102625)>
  Coords	Values
  (0, 78245)	0.3433135973344137
  (0, 67879)	0.5491859389656334
  (0, 11762)	0.3996104700628026
  (0, 14350)	0.2656436007677318
  (0, 9611)	0.24478563455266164
  (0, 56542)	0.19873654795718387
  (0, 52983)	0.29944388434735103
  (0, 26002)	0.36245333456939377
  (0, 55260)	0.17268539580488468
  (1, 23850)	0.6471040548515594
  (1, 73917)	0.2668231250746326
  (1, 40436)	0.49417356599639395
  (1, 13936)	0.5156105592309805
  (2, 85593)	0.20373694733441325
  (2, 39354)	0.21052644832014503
  (2, 29087)	0.18542407890086796
  (2, 91704)	0.4309381692479243
  (2, 21917)	0.24366751892488503
  (2, 80088)	0.23939212490428563
  (2, 99169)	0.273485116069025
  (2, 40144)	0.16478162474110616
  (2, 96681)	0.1740556247628811
  (2, 74608)	0.25958913154746094
  (2, 41833)	0.17078585658703535
  (2, 44388)	0.35964915669175657
  :	:
  (399996, 93007)	0.2670293989252079
  (399996, 134

In [36]:
print(X_test)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 655604 stored elements and shape (100000, 102625)>
  Coords	Values
  (0, 45137)	0.7776909420103503
  (0, 85968)	0.4658145038362093
  (0, 88839)	0.422153581923546
  (1, 13627)	0.30022393446405915
  (1, 23357)	0.18528760143835749
  (1, 36539)	0.2422261319638227
  (1, 37043)	0.1975816890524984
  (1, 41833)	0.19819576561383356
  (1, 42871)	0.26105556740416097
  (1, 44324)	0.1794843267071652
  (1, 56542)	0.1810194215938191
  (1, 64824)	0.39803836432632517
  (1, 65154)	0.40100776788477577
  (1, 70136)	0.21977059752242833
  (1, 77676)	0.2021041471322486
  (1, 78524)	0.17381651563149175
  (1, 85630)	0.2431204123145439
  (1, 91934)	0.3281785467886472
  (2, 43183)	0.4718882894271042
  (2, 55260)	0.3494125092597571
  (2, 84293)	0.8094642306330538
  (3, 22613)	0.36570893177467323
  (3, 24081)	0.404266562246309
  (3, 26194)	0.5705543273123663
  (3, 52876)	0.27171363617161104
  :	:
  (99995, 97137)	0.190008077610133
  (99996, 14149)	0.294

In [37]:
# trainign machine learning model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [41]:
# model evaluation for testing set
def model_evaluation(model, X_test, y_test):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    r_score = recall_score(y_test, y_pred)
    f_score = f1_score(y_test, y_pred)
    
    print(f"Confusion matrix : {cm}")
    print(f"Accuracy : {accuracy}" )
    print(f"Precision score : {precision}" )
    print(f"Recall score : {r_score}" )
    print(f"F1 score : {f_score}" )

In [62]:
# model evaluation for testing set
print('Model evaluation for testing set')
model_evaluation(model, X_test, y_test)

Model evaluation for testing set
Confusion matrix : [[37348 12555]
 [10481 39616]]
Accuracy : 0.76964
Precision score : 0.7593490636560541
Recall score : 0.7907858754017206
F1 score : 0.7747486994954433


In [63]:
# model evaluation for training set
print('Model evaluation for training set')
model_evaluation(model, X_train, y_train)

Model evaluation for training set
Confusion matrix : [[154968  44863]
 [ 36609 163560]]
Accuracy : 0.79632
Precision score : 0.7847502434951996
Recall score : 0.817109542436641
F1 score : 0.8006030465598934


In [60]:
# function to predict sentiment
def predict_sentiment(text):
    text = clean_text(text)
    text = vectorizer.transform([text])
    sentiment = model.predict(text)
    
    if sentiment[0] == 0:
        return f"Predicted Sentiment : Negative"
    else:
        return f"Predicted Sentiment : Positive"

In [47]:
# save the model
joblib.dump(model, 'twitter_comments_model.pkl')

['twitter_comments_model.pkl']

In [61]:
example_text = "I love this product, it's amazing!"
print(predict_sentiment(example_text))

Predicted Sentiment : Positive
