# Twitter Sentiment Analysis

<img src='twitter-sentiment.jpeg'>
<a href='https://www.google.com/url?sa=i&url=https%3A%2F%2Filiyaz.hashnode.dev%2Ftwitter-sentiment-analysis&psig=AOvVaw0LNtLrAs0ei_6CU--8deFt&ust=1747686331974000&source=images&cd=vfe&opi=89978449&ved=0CBQQjRxqFwoTCIi-_5LtrY0DFQAAAAAdAAAAABAm'>Image SRC</a>

In this project, i aim to apply sentiment analysis on Twitter messages. Here is the
<a href='https://raw.githubusercontent.com/amankharwal/Website-data/master/twitter.csv'>dataset source</a>

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import re
import nltk

df = pd.read_csv('twitter-comments.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [5]:
# cleaning data with NLTK tools
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer('english')
from nltk.corpus import stopwords
import string
stopword=(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text

    
df['tweet'] = df['tweet'].apply(clean)

df.head()


[nltk_data] Downloading package stopwords to /Users/mac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,rt mayasolov woman shouldnt complain clean ho...
1,1,3,0,3,0,1,rt boy dat coldtyga dwn bad cuffin dat hoe ...
2,2,3,0,3,0,1,rt urkindofbrand dawg rt ever fuck bitch sta...
3,3,3,0,2,1,1,rt cganderson vivabas look like tranni
4,4,6,0,6,0,1,rt shenikarobert shit hear might true might f...


In [13]:
# calculating sentiment scores and settin as feature
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sentiments = SentimentIntensityAnalyzer()
df['Positive'] = [sentiments.polarity_scores(i)['pos'] for i in df['tweet']]
df['Negative'] = [sentiments.polarity_scores(i)['neg'] for i in df['tweet']]
df['Neutral'] = [sentiments.polarity_scores(i)['neu'] for i in df['tweet']]

df['class'].head()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mac/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


0    2
1    1
2    1
3    1
4    1
Name: class, dtype: int64

In [15]:
# selecting features that we will use
df = df[['tweet', 'Positive', 'Negative','Neutral']]
df.head(100)

Unnamed: 0,tweet,Positive,Negative,Neutral
0,rt mayasolov woman shouldnt complain clean ho...,0.147,0.157,0.696
1,rt boy dat coldtyga dwn bad cuffin dat hoe ...,0.000,0.280,0.720
2,rt urkindofbrand dawg rt ever fuck bitch sta...,0.000,0.577,0.423
3,rt cganderson vivabas look like tranni,0.333,0.000,0.667
4,rt shenikarobert shit hear might true might f...,0.154,0.407,0.440
...,...,...,...,...
95,causewereguy go back school suck dick hoe attend,0.000,0.508,0.492
96,causewereguy way fuck yo bitch year old,0.000,0.593,0.407
97,celeynichol whitethunduh come never bring food...,0.258,0.000,0.742
98,chadmfverbeck richnow doesnt show hella tinder...,0.406,0.000,0.594


In [29]:
# Lets determine which sentiment is in majority
x = sum(df['Positive'])
y = sum(df['Negative'])
z = sum(df['Neutral'])

def sentiment_score(a,b,c):
    if (a>b) and (a>c):
        print("Positive 😊 ")
    if (b>a) and (b>c):
        print("Negative 😠 ")
    else:
        print("Neutral 🙂 ")
sentiment_score(x,y,z)


Neutral 🙂 


In [30]:
# Lets have a look at the total of the sentiment score
print('Positive: ', x)
print('Negative: ', y)
print('Neutral: ', z)

Positive:  2880.086000000009
Negative:  7201.020999999922
Neutral:  14696.887999999733


In [36]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
#this returns a very large matrix and it is assigned to variable sparse_matrix 
sparse_matrix=cv.fit_transform(df['tweet'])

In [25]:
# add new column review
df['review'] = df[['Positive', 'Negative', 'Neutral']].idxmax(axis=1)


x=sparse_matrix
y=df['review']


from sklearn.model_selection import train_test_split as tts
x_train,x_test,y_train,y_test=tts(x,y,test_size=0.2,stratify=y,random_state=42)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df[['Positive', 'Negative', 'Neutral']].idxmax(axis=1)


In [26]:
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()
nb.fit(x_train,y_train)
nb_pred=nb.predict(x_test)

from sklearn.metrics import accuracy_score as score,classification_report as creport
print(" accuracy score of naive bayes is :", score(y_test,nb_pred))

 accuracy score of naive bayes is : 0.7651805527536817


In [27]:
from sklearn.ensemble import RandomForestClassifier as RFC
random_forest=RFC(class_weight='balanced')
random_forest.fit(x_train,y_train)
random_pred=random_forest.predict(x_test)
print(" accuracy score of randomforest is :",score(random_pred,y_test))


 accuracy score of randomforest is : 0.778696792414767


In [28]:
from sklearn.neighbors import KNeighborsClassifier
knn_model=KNeighborsClassifier(n_neighbors=1)
knn_model.fit(x_train,y_train)
knn_pred=knn_model.predict(x_test)

print("accuracy score of KNN is: ",score(knn_pred,y_test))


accuracy score of KNN is:  0.6088359895097841


In [29]:
from sklearn.linear_model import LogisticRegression as LR
lr=LR()
lr.fit(x_train,y_train)
lr_pred=lr.predict(x_test)

print("accuracy score of logistic regression is: ",score(lr_pred,y_test))


accuracy score of logistic regression is:  0.8737139398829937


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
### we successfully done sentiment analysis on given twitter comments data and tested wtih some NLP ML models. As a result LogisticRegression got a fairly good result. 