# TWITTER SENTIMENT ANALYSIS

The following project is about analyzing the sentiments of tweets on social networking website
‘Twitter’. The dataset for this project is scraped from Twitter. It contains 1,600,000 tweets
extracted using Twitter API. It is a labeled dataset with tweets annotated with the sentiment (0 =
negative, 2 = neutral, 4 = positive).
It contains the following 6 fields:

1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet .
3. date: The date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query. If there is no query, then this value is NO_QUERY.
5. user: The user that tweeted
6. text: The text of the tweet.


## Problem Statement

In today's digital age, social media platforms like Twitter have become a crucial medium for individuals to express their opinions and sentiments. For businesses, understanding the sentiment of Twitter users towards their brand is essential for maintaining a positive brand image and making informed marketing decisions. However, manually analyzing thousands of tweets to gauge sentiment is time-consuming and prone to human error. Therefore, there is a need for an automated system that can accurately analyze the sentiment of tweets related to a brand and provide actionable insights for brand management and marketing strategies. Design a classification model that correctly predicts the polarity of the tweets provided in the
dataset.



In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
import re
import string
from nltk.corpus import stopwords
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score,f1_score,roc_auc_score,roc_curve
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
import nltk
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import warnings
warnings.filterwarnings("ignore")

In [4]:
column_names = ['status', 'id', 'date', 'query', 'name', 'tweet']

data = pd.read_csv(r"D:\datascience\twitter_guvi_project\twitter_new.csv", names = column_names, encoding='latin1')

df = pd.DataFrame(data)

df

Unnamed: 0,status,id,date,query,name,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [5]:
df.dtypes

status     int64
id         int64
date      object
query     object
name      object
tweet     object
dtype: object

In [6]:
df1 = df[['status', 'tweet']]
df1

Unnamed: 0,status,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
...,...,...
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...


In [7]:
df1['status'].replace(4, 1, inplace = True)
df1['tweet'] = df1['tweet'].str.lower()
df1['status'].value_counts()

status
0    800000
1    800000
Name: count, dtype: int64

Select A minimum Number of data from the huge dataset for Model creation

In [8]:
df1_p = df1[df1['status'] == 1]
df1_n = df1[df1['status'] == 0]

df1p = df1_p.iloc[:int(20000)]
df1p2 = df1_p.iloc[20000:40000]

df1n = df1_n.iloc[:int(20000)]
df1n2 = df1_n.iloc[20000:40000]

df2 = pd.concat([df1n2, df1p2])
# df2 = pd.concat([df1n, df1p])

df2

Unnamed: 0,status,tweet
20000,0,@heather2711 good thing i didn't find any then...
20001,0,dea's are no fun
20002,0,@tommcfly no i haven't hey but you guys are b...
20003,0,i googled up remedies for jonzing and reading ...
20004,0,trying to fix my phone fail
...,...,...
839995,1,oficially done with drivers ed stuff. now i ju...
839996,1,@jimhunt thank you for sharing and caring! gl...
839997,1,@buberzionist ok objection sustained
839998,1,is twittin' while my baby girl sleeps


In [9]:
stop_words = stopwords.words('english')
', '.join(stop_words)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

Cleaning the tweet and removing the special characters

In [10]:
def clean_text(tweet):
    stop_words = set(stopwords.words('english'))
    tweet = " ".join([word for word in str(tweet).split() if word not in stop_words])

    eng_punctuations = string.punctuation
    translator =  str.maketrans('', '', eng_punctuations)

    tweet = tweet.translate(translator)

    tweet = re.sub(r'(.)\1+', r'\1', tweet)

    tweet = re.sub('@[^\s]+', ' ', tweet)

    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))',' ', tweet)

    tweet = re.sub('[0-9]+', '', tweet)

    return tweet

In [11]:
df2['tweet'] = df2['tweet'].apply(lambda x: clean_text(x))
df2

Unnamed: 0,status,tweet
20000,0,heather god thing find none ones like come siz...
20001,0,deas fun
20002,0,tomcfly hey guys back england wel super great ...
20003,0,gogled remedies jonzing reading one sugestions...
20004,0,trying fix phone fail
...,...,...
839995,1,oficialy done drivers ed stuf wait june licens...
839996,1,jimhunt thank sharing caring glad like quotes
839997,1,buberzionist ok objection sustained
839998,1,twitin baby girl sleps


The tweets has to be lemmatized and tokenized for the model creation

Lemmatization: Thiis is a process of getting a root of the word. In example the words 'leave' and 'leaves' both word are comming from the root word of  'leaf', lemmatizing is the process of getting the root of the word. Tokenization is the process of deviding the sentences into tokens (single word) to analyse the word and use them for the model.

In the following proces Iam spliting the words into tokens then lemmatizing the texts.

In [12]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    st = ""
    for w in w_tokenizer.tokenize(text):
        st = st + lemmatizer.lemmatize(w) + " "
    return st

df2['tweet'] = df2.tweet.apply(lemmatize_text)
df2

Unnamed: 0,status,tweet
20000,0,heather god thing find none one like come size...
20001,0,dea fun
20002,0,tomcfly hey guy back england wel super great l...
20003,0,gogled remedy jonzing reading one sugestions t...
20004,0,trying fix phone fail
...,...,...
839995,1,oficialy done driver ed stuf wait june license...
839996,1,jimhunt thank sharing caring glad like quote
839997,1,buberzionist ok objection sustained
839998,1,twitin baby girl sleps


Analyzing the average length of the words in a tweet and the polarity of the selected dataset

In [13]:
s = 0.0
for i in df2['tweet']:
    word_list = i.split()
    s = s + len(word_list)
print("Average length of each tweet : ",s/df2.shape[0])
pos = 0
for i in range(df2.shape[0]):
    if df2.iloc[i]['status'] == 1:
        pos = pos + 1
neg = df2.shape[0]-pos
print("Percentage of reviews with positive sentiment is "+str(pos/df2.shape[0]*100)+"%")
print("Percentage of reviews with negative sentiment is "+str(neg/df2.shape[0]*100)+"%")

Average length of each tweet :  7.718925
Percentage of reviews with positive sentiment is 50.0%
Percentage of reviews with negative sentiment is 50.0%


Selecting the feature and target variable

In [14]:
tweets = df2['tweet'].values
labels = df2['status'].values

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(tweets, labels, stratify = labels)

In [16]:
vocab_size = 4000 
oov_tok = ''
embedding_dim = 100
max_length = 200
padding_type = 'post'
trunc_type = 'post'

tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_tok)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

train_sequences = tokenizer.texts_to_sequences(X_train)
train_padded = sequence.pad_sequences(train_sequences, padding = 'post', maxlen = max_length)
test_sequences = tokenizer.texts_to_sequences(X_test)
test_padded = sequence.pad_sequences(test_sequences, padding = 'post', maxlen = max_length)

Here I am using Sequential model from LSTM, One of the natural language processing tools too create  the clasification model to oredict the polarity of the tweets

In [18]:
model = Sequential([
    layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    layers.Bidirectional(layers.LSTM(64)),
    layers.Dense(24, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 200, 100)          400000    
                                                                 
 bidirectional_1 (Bidirecti  (None, 128)               84480     
 onal)                                                           
                                                                 
 dense_2 (Dense)             (None, 24)                3096      
                                                                 
 dense_3 (Dense)             (None, 1)                 25        
                                                                 
Total params: 487601 (1.86 MB)
Trainable params: 487601 (1.86 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


During each epoch, the model updates its parameters (weights and biases) based on the gradients of the loss function with respect to those parameters. The number of epochs is a hyperparameter that is typically set before training begins and can be adjusted based on the performance of the model on a validation dataset.

In [19]:
num_epochs = 10
history = model.fit(train_padded, Y_train, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [20]:
prediction = model.predict(test_padded)

pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
print("Accuracy of prediction on test set : ", accuracy_score(Y_test, pred_labels))

Accuracy of prediction on test set :  0.7233


In [21]:
ypred = model.evaluate(test_padded, Y_test)

print(f"Accuracy :",ypred[1])

Accuracy : 0.7232999801635742


Testing the model using the random sentenses

In [22]:
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the acting was amazing and refreshing"]

sequences = tokenizer.texts_to_sequences(sentence)

padded = sequence.pad_sequences(sequences, padding='post', maxlen=max_length)
prediction = model.predict(padded)
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
for i in range(len(sentence)):
    print(sentence[i])
    if pred_labels[i] == 1:
        s = 'Positive'
    else:
        s = 'Negative'
    print("Predicted sentiment : ",s)

The movie was very touching and heart whelming
Predicted sentiment :  Positive
I have never seen a terrible movie like this
Predicted sentiment :  Negative
the acting was amazing and refreshing
Predicted sentiment :  Positive


Savinng and loading the model

In [36]:
import pickle

with open(r'D:\datascience\twitter_guvi_project\Twitter_Sentiment_Analysis.pkl', 'wb') as f:
    pickle.dump(model, f)

In [37]:
with open(r'D:\datascience\twitter_guvi_project\Twitter_Sentiment_Analysis.pkl', 'rb') as f:
    tw = pickle.load(f)



In [38]:
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the acting was amazing and refreshing"]

sequences = tokenizer.texts_to_sequences(sentence)

padded = sequence.pad_sequences(sequences, padding='post', maxlen=max_length)
prediction = tw.predict(padded)
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
for i in range(len(sentence)):
    print(sentence[i])
    if pred_labels[i] == 1:
        s = 'Positive'
    else:
        s = 'Negative'
    print("Predicted sentiment : ",s)

The movie was very touching and heart whelming
Predicted sentiment :  Positive
I have never seen a terrible movie like this
Predicted sentiment :  Negative
the acting was amazing and refreshing
Predicted sentiment :  Positive
