## Twitter Sentiment Analysis - Analyzes tweets to determine wether they express positive or negative sentiments

Step1 : Collect tweet data
Step2 : Preprocess - Convert textual data to numerical
Step3 : Train-Test Split
Step4 : Using Logistic Regress because it is a Classification model
Step5 : After training the model if we give it a tweet it will analyze its sentiment as Positive or Negative

In [1]:
# Load datset from kaggle via api
# installing kaggle library

! pip install kaggle 

Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting certifi>=2023.7.22 (from kaggle)
  Using cached certifi-2024.12.14-py3-none-any.whl.metadata (2.3 kB)
Collecting requests (from kaggle)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from kaggle)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting urllib3 (from kaggle)
  Using cached urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)
Collecting bleach (from kaggle)
  Downloading bleach-6.2.0-py3-none-any.whl.metadata (30 kB)
Co


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Uploaded kaggle.json file which contains my api details

In [4]:
# configuring path of kaggle.json file

import os
import shutil

# Create the .kaggle directory in the user's home directory
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

# Copy kaggle.json to the .kaggle directory
shutil.copy("kaggle.json", os.path.expanduser("~/.kaggle/kaggle.json"))

# Set permissions for kaggle.json
os.chmod(os.path.expanduser("~/.kaggle/kaggle.json"), 0o600)

print("Configuration of kaggle.json completed.")

Configuration of kaggle.json completed.


Importing Twitter Sentiment dataset

In [5]:
# API to fetch the dataset from kaggle

!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to c:\Users\Asus\OneDrive\Desktop\Twitter Sentiment Analysis




  0%|          | 0.00/80.9M [00:00<?, ?B/s]
  1%|          | 1.00M/80.9M [00:03<04:01, 347kB/s]
  2%|▏         | 2.00M/80.9M [00:06<04:30, 306kB/s]
  4%|▎         | 3.00M/80.9M [00:09<03:47, 359kB/s]
  5%|▍         | 4.00M/80.9M [00:11<03:34, 376kB/s]
  6%|▌         | 5.00M/80.9M [00:21<06:34, 202kB/s]
  7%|▋         | 6.00M/80.9M [00:23<05:25, 241kB/s]
  9%|▊         | 7.00M/80.9M [00:27<05:17, 244kB/s]
 10%|▉         | 8.00M/80.9M [00:32<05:29, 232kB/s]
 11%|█         | 9.00M/80.9M [00:37<05:24, 233kB/s]
 12%|█▏        | 10.0M/80.9M [00:44<06:21, 195kB/s]
 14%|█▎        | 11.0M/80.9M [00:50<06:31, 187kB/s]
 15%|█▍        | 12.0M/80.9M [00:56<06:31, 185kB/s]
 16%|█▌        | 13.0M/80.9M [01:04<07:15, 164kB/s]
 17%|█▋        | 14.0M/80.9M [01:08<06:15, 187kB/s]
 19%|█▊        | 15.0M/80.9M [01:14<06:15, 184kB/s]
 20%|█▉        | 16.0M/80.9M [01:17<05:21, 212kB/s]
 21%|██        | 17.0M/80.9M [01:21<04:52, 229kB/s]
 22%|██▏       | 18.0M/80.9M [01:24<04:16, 258kB/s]
 23%|██▎       | 19

Extract zip file

In [9]:
# Extracting the compressed dataset

from zipfile import ZipFile

dataset = 'sentiment140.zip'

with ZipFile(dataset , 'r') as zip:
    zip.extractall()
    print('The dataset is extracted')

The dataset is extracted


Importing the Dependencies

In [10]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [11]:
# Download Stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
# Printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Processing

In [14]:
# Loading the data from csv file to pandas dataframe
twitter_data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding = 'ISO-8859-1')

In [15]:
# Checking the number of rows and columns

twitter_data.shape

(1599999, 6)

In [16]:
# Printing first 5 rows of the dataframe
twitter_data.head

<bound method NDFrame.head of          0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY  \
0        0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
1        0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
2        0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
3        0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4        0  1467811372  Mon Apr 06 22:20:00 PDT 2009  NO_QUERY   
...     ..         ...                           ...       ...   
1599994  4  2193601966  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599995  4  2193601969  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599996  4  2193601991  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599997  4  2193602064  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599998  4  2193602129  Tue Jun 16 08:40:50 PDT 2009  NO_QUERY   

         _TheSpecialOne_  \
0          scotthamilton   
1               mattycus   
2                ElleCTF   
3                 Karoli   
4               joy_wolf   
...      

In [17]:
# Naming the columns and reading the dataset again

column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
twitter_data = pd.read_csv('training.1600000.processed.noemoticon.csv',names= column_names, encoding = 'ISO-8859-1')

In [18]:
# Checking the number of rows and columns after including the column names

twitter_data.shape

(1600000, 6)

In [19]:
# Printing first 5 rows of the dataframe
twitter_data.head

<bound method NDFrame.head of          target          id                          date      flag  \
0             0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1             0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2             0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3             0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4             0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
...         ...         ...                           ...       ...   
1599995       4  2193601966  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599996       4  2193601969  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599997       4  2193601991  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599998       4  2193602064  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599999       4  2193602129  Tue Jun 16 08:40:50 PDT 2009  NO_QUERY   

                    user                                               text  
0        _TheSpecialOne_  @switchfoot h

In [20]:
# Check for Missing Values in the dataset

twitter_data.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

In [25]:
# Checking the distribution of target Column - how many positive tweets are there and how many negative tweets are there
twitter_data['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

Since the distribution of data is equal for both positive and negative sentiments it is fine otherwise we would have done upsampling and downsampling to equalize the dataset for ml model to perform accurately

In [26]:
# Convert the target '4' to '1'

twitter_data.replace({'target' : {4:1}}, inplace = True)

In [27]:
# Checking the distribution of target Column after converting 4 to 1 - how many positive tweets are there and how many negative tweets are there
twitter_data['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

0 --> Negative Tweet 

1 --> Positive Tweet

In [28]:
# Stemming - Reducing word to root word - For reducing dimensions and complexities

#Losding the porterstemmer module to the variable
port_stem = PorterStemmer()

In [30]:
# Building the template for the stemming function
def stemming(content):
     stemmed_content = re.sub('{^a-zA-Z}',' ', content)
     stemmed_content = stemmed_content.lower()
     stemmed_content = stemmed_content.split()
     stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
     stemmed_content = ' '.join(stemmed_content)

     return stemmed_content

In [31]:
# Applying stemming to my data

twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)

# takes long time

In [32]:
# Displaying dataset after stemming
twitter_data.head

<bound method NDFrame.head of          target          id                          date      flag  \
0             0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1             0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2             0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3             0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4             0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
...         ...         ...                           ...       ...   
1599995       1  2193601966  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599996       1  2193601969  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599997       1  2193601991  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599998       1  2193602064  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599999       1  2193602129  Tue Jun 16 08:40:50 PDT 2009  NO_QUERY   

                    user                                               text  \
0        _TheSpecialOne_  @switchfoot 

In [33]:
# stemmed_content(16 lakh tweets) and the target variable are required for training my model

print(twitter_data['stemmed_content'])

0          @switchfoot http://twitpic.com/2y1zl - awww, t...
1          upset can't updat facebook text it... might cr...
2          @kenichan dive mani time ball. manag save 50% ...
3                            whole bodi feel itchi like fire
4          @nationwideclass no, behav all. i'm mad. here?...
                                 ...                        
1599995                       woke up. school best feel ever
1599996    thewdb.com - cool hear old walt interviews! â...
1599997                      readi mojo makeover? ask detail
1599998    happi 38th birthday boo alll time!!! tupac ama...
1599999    happi #charitytuesday @thenspcc @sparkschar @s...
Name: stemmed_content, Length: 1600000, dtype: object


In [34]:
# Seperating the data - tweet(x) and label-target(y)

X = twitter_data['stemmed_content'].values
Y = twitter_data['target'].values

In [35]:
print(X)
print(Y)

["@switchfoot http://twitpic.com/2y1zl - awww, that' bummer. shoulda got david carr third day it. ;d"
 "upset can't updat facebook text it... might cri result school today also. blah!"
 '@kenichan dive mani time ball. manag save 50% rest go bound' ...
 'readi mojo makeover? ask detail'
 'happi 38th birthday boo alll time!!! tupac amaru shakur'
 'happi #charitytuesday @thenspcc @sparkschar @speakinguph4h']
[0 0 0 ... 1 1 1]


Splitting the data to training and test data

In [36]:
X_train, X_test , Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)

In [37]:
print(X.shape, X_train.shape, X_test.shape)

(1600000,) (1280000,) (320000,)


Before Going to ml model we need to convert textual data to numeric by feature extraction

In [38]:
# Converting the textual data to numerical data

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test) 

# words will get importance on the basis of repitition and corresponds to positive or negative sentiment

In [39]:
print(X_train)
print(X_test)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10033748 stored elements and shape (1280000, 578847)>
  Coords	Values
  (0, 548734)	0.272460245511403
  (0, 453116)	0.35795997954657904
  (0, 254267)	0.5248773503338933
  (0, 166830)	0.37842601975344053
  (0, 311959)	0.4213989894780769
  (0, 555967)	0.448720931098355
  (1, 226262)	0.9067376665899747
  (1, 247278)	0.4216951552804088
  (2, 166830)	0.46135196432581543
  (2, 184426)	0.19030000160550456
  (2, 514638)	0.18602125106063966
  (2, 190522)	0.2919883395488319
  (2, 513655)	0.32017162311057684
  (2, 545128)	0.32796751926832923
  (2, 129252)	0.3117944515681845
  (2, 556376)	0.3361458059516176
  (2, 349418)	0.24085594662276874
  (2, 516778)	0.15260594669966598
  (2, 245846)	0.16108271281506445
  (2, 214788)	0.1869628699148727
  (2, 193769)	0.2018446658815102
  (2, 375364)	0.16689112338294135
  (3, 513655)	0.2889371658693274
  (3, 223959)	0.4482843351756085
  (3, 215997)	0.27659430167424054
  :	:
  (1279996, 494789)	0.21762

# Training the ML Model - Logistic Regression 

In [40]:
model = LogisticRegression(max_iter=1000)

In [41]:
model.fit(X_train, Y_train)

Model Evaluation

In [42]:
# For Model Evaluation - We use Accuracy Score

# Acuracy Score on the training_data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [43]:
print('Accuracy Score on the training data : ', training_data_accuracy)

Accuracy Score on the training data :  0.8119328125


In [44]:
# Acuracy Score on the testing_data
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(Y_test, X_test_prediction)

In [51]:
print('Accuracy Score on the testing data : ', testing_data_accuracy)

Accuracy Score on the testing data :  0.781615625


Model Accuracy = testing_data_accuracy

# Saving the trained model

In [46]:
import pickle

In [47]:
filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))

# Using the saved model for future predictions

In [48]:
# Loading the saved model
loaded_model = pickle.load(open('trained_model.sav', 'rb'))

In [49]:
X_new = X_test[200]
Y_new = Y_test[200]

prediction = loaded_model.predict(X_new)
print(prediction)

if (prediction[0] == 0):
    print('Negative Tweet')
else:
    print('Positive Tweet')

[1]
Positive Tweet


In [50]:
# Saving the vectorizer

# Save the vectorizer after fitting on training data
pickle.dump(vectorizer, open('vectorizer.sav', 'wb'))