#Twitter Sentimennt Analysis - Binary Classification with Machine Learning
In this Machine Learning Project, we’ll build binary classification that puts tweets texts into one of two categories — negative or positive sentiment. We’re going to have a brief look at the Bayes theorem and relax its requirements using the Naive assumption.

In [229]:
#Importing  Required Library
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB,ComplementNB, CategoricalNB
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
import pickle

In [230]:
dataset = pd.read_csv('/content/dataset.csv', usecols=[0,1,2],encoding="ISO-8859-1")
dataset.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


In [231]:
dataset.shape

(99982, 3)

In [232]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99982 entries, 0 to 99981
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ItemID         99982 non-null  object
 1   Sentiment      99982 non-null  object
 2   SentimentText  99982 non-null  object
dtypes: object(3)
memory usage: 2.3+ MB


No Missing Values in any Columns

In [233]:
dataset.Sentiment.value_counts

<bound method IndexOpsMixin.value_counts of 0        0
1        0
2        1
3        0
4        0
        ..
99977    0
99978    1
99979    0
99980    1
99981    1
Name: Sentiment, Length: 99982, dtype: object>

In [234]:
dataset.Sentiment.unique()

array(['0', '1', " we're jizztastic"], dtype=object)

Droping the rows with value " we're jizztastic" from the dataframes removing the inconsistency

In [253]:
dataset = dataset[dataset["Sentiment"].str.contains(" we're jizztastic")==False]

In [255]:
dataset.Sentiment.unique()

array(['0', '1'], dtype=object)

In [256]:
dataset.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


##Cleaning the Dataset

In [257]:
#Removing HTML tags
def clean(text):
    cleaned = re.compile(r'<.*?>')
    return re.sub(cleaned,'',text)

In [258]:
dataset.SentimentText = dataset.SentimentText.apply(clean)
dataset.SentimentText[3]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


"          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)..."

In [259]:
#Cleaing Special Characters
def is_special(text):
    rem = ''
    for i in text:
        if i.isalnum():
            rem = rem + i
        else:
            rem = rem + ' '
    return rem

In [260]:
dataset.SentimentText = dataset.SentimentText.apply(is_special)
dataset.SentimentText[3]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


'             Omgaga  Im sooo  im gunna CRy  I ve been at this dentist since 11   I was suposed 2 just get a crown put on  30mins    '

As Python is Case Sensitive, Converting all characters to lower case

In [261]:
def to_lower(text):
    return text.lower()

In [118]:
dataset.SentimentText = dataset.SentimentText.apply(to_lower)
dataset.SentimentText[3]

'             omgaga  im sooo  im gunna cry  i ve been at this dentist since 11   i was suposed 2 just get a crown put on  30mins    '

In [262]:
#Removing Stop words with the help of NLTK and also Tokenizing the Sentence

import nltk
nltk.download('stopwords')
nltk.download('punkt')
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    return [w for w in words if w not in stop_words]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [263]:
dataset.SentimentText = dataset.SentimentText.apply(remove_stopwords)
dataset.SentimentText[3]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


['Omgaga',
 'Im',
 'sooo',
 'im',
 'gunna',
 'CRy',
 'I',
 'dentist',
 'since',
 '11',
 'I',
 'suposed',
 '2',
 'get',
 'crown',
 'put',
 '30mins']

In [264]:
#Stemming the Word
def stem_word(text):
    ss = nltk.stem.SnowballStemmer('english')
    return " ".join([ss.stem(w) for w in text])


In [265]:
dataset.SentimentText = dataset.SentimentText.apply(stem_word)
dataset.SentimentText[3]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


'omgaga im sooo im gunna cri i dentist sinc 11 i supos 2 get crown put 30min'

In [266]:
dataset.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,sad apl friend
1,2,0,i miss new moon trailer
2,3,1,omg alreadi 7 30 o
3,4,0,omgaga im sooo im gunna cri i dentist sinc 11 ...
4,5,0,think mi bf cheat t t


## Creating the Model


In [267]:
X = np.array(dataset.iloc[:,0].values)
y = np.array(dataset.Sentiment.values)
cv = CountVectorizer(max_features=1000)
X = cv.fit_transform(dataset.SentimentText).toarray()
print("X.shape = ",X.shape)
print("y.shape = ",y.shape)

X.shape =  (99981, 1000)
y.shape =  (99981,)


In [268]:
print(X)
print(y)


[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
['0' '0' '1' ... '0' '1' '1']


### Creating Test Train Slit

In [269]:
trainx,testx,trainy,testy = train_test_split(X,y,test_size=0.2,random_state=9)
print("Train shapes : X = {}, y = {}".format(trainx.shape,trainy.shape))
print("Test shapes : X = {}, y = {}".format(testx.shape,testy.shape))

Train shapes : X = (79984, 1000), y = (79984,)
Test shapes : X = (19997, 1000), y = (19997,)


In [270]:
#Defining the models and Training them
gnb,mnb,bnb = GaussianNB(),MultinomialNB(alpha=1.0,fit_prior=True),BernoulliNB(alpha=1.0,fit_prior=True),
conb,canb =ComplementNB(alpha=1.0,fit_prior=True),CategoricalNB(alpha=1.0,fit_prior=True)
gnb.fit(trainx,trainy)
mnb.fit(trainx,trainy)
bnb.fit(trainx,trainy)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [271]:
ypg = gnb.predict(testx)
ypm = mnb.predict(testx)
ypb = bnb.predict(testx)

print("Gaussian = ",accuracy_score(testy,ypg))
print("Multinomial = ",accuracy_score(testy,ypm))
print("Bernoulli = ",accuracy_score(testy,ypb))

Gaussian =  0.691553733059959
Multinomial =  0.7282092313847077
Bernoulli =  0.7270090513577037


In [272]:
pickle.dump(bnb,open('model1.pkl','wb'))

In [323]:
def SentimentAnalysis(txt) :
  f1 = clean(txt)
  f2 = is_special(f1)
  f3 = to_lower(f2)
  f4 = remove_stopwords(f3)
  f5 = stem_word(f4)

  bow,words = [],word_tokenize(f5)
  for word in words:
    bow.append(words.count(word))
  word_dict = cv.vocabulary_
  pickle.dump(word_dict,open('bow.pkl','wb'))
  inp = []
  for i in word_dict:
    inp.append(f5.count(i[0]))
  y_pred = mnb.predict(np.array(inp).reshape(1,1000))
  return y_pred


In [324]:
text = 'very positive'

In [325]:
SentimentAnalysis(text)

array(['1'], dtype='<U1')