# Sentiment Analysis

## Contents

* What is sentiment analysis?
* Data preparation
* Bayes method
* Naïve Bayes method for finding sentiments
* Accuracy and Confusion matrix


### What is Sentiment Analysis?

It is the process of identifying and categorizing opinions expressed in a piece of text to determine the attitude towards a particular topic, product etc is positive, negative or neutral. IT is known with other names called as Opinion Mining, Sentiment Mining, Verbatim analysis, Subjectivity detection and many more. Consider scenarios like Blog post reviews, Survey verbatim responses, service reviews, movie reviews etc falls under the category of sentiment analysis.

Generic algorithms might not work well with all the types of text data. We need to train the model well as  the generic parameters leads to low accuracy.

Naive Bayes model is used in sentiment analysis. Bayes theorem describes the probability of an event, based on prior knowledge of conditions that is related to the event. Sentiment analysis is widely used as it helps to understand customer attitude and take decisions accordingly.

* Is the service review positive or negative?
* Positive and negative responses in a survey verbatim
* In our given context is that statement positive or negative ?
* Is that blog post positive or negative?
* How are people writing reviews for a movie? Positively or negatively?
* Also Known As below 
   *  Opinion mining
   *  Sentiment mining
   *  Verbatim Analysis
   *  Subjectivity detection
        
### Limitations of Finding Sentiments

* Text data itself is unstructured / semi structured 
* Sarcasm is very difficult to understand 
* Sometimes training data doesn’t have any strong opinion. Neutral statements
* Strong short documents are often overshadowed by large individual documents 
      

### Case Study: Movie Reviews data


In [1]:
#Data Import

In [1]:
import pandas as pd

In [2]:
input_data = pd.read_csv("User_movie_review.csv")

In [4]:
#Basic Details of the data

In [4]:
input_data.shape

(2000, 2)

In [5]:
input_data.columns

Index(['class', 'text'], dtype='object')

In [6]:
input_data.head(10)

Unnamed: 0,class,text
0,Pos,stuart little is one of the best family ...
1,Neg,a movie like mortal kombat annihilation wor...
2,Neg,and just when you thought joblo was getting a...
3,Pos,every now and then a movie comes along from a...
4,Neg,for about twenty minutes into mission impossi...
5,Neg,for better or worse the appearance of basic...
6,Neg,i have a great idea for a movie one that ca...
7,Pos,if he doesn=92t watch out mel gibson is in ...
8,Pos,if there s one thing in common about all of h...
9,Neg,if you haven t plunked down your hard earned ...


 <img src="Images/import.png" style="width: 500px;"/>
   

In [8]:
#Frequency of sentiment col

In [7]:
input_data['class'].value_counts()

Neg    1000
Pos    1000
Name: class, dtype: int64

In [8]:
#Creating Document Term Matrix

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

countvec1 = CountVectorizer()
dtm_v1 = pd.DataFrame(countvec1.fit_transform(input_data['text']).toarray(), columns=countvec1.get_feature_names(), index=None)
dtm_v1['class'] = input_data['class']
dtm_v1.head()

Unnamed: 0,00,000,0009f,007,00s,03,04,05,05425,10,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

countvec1 = CountVectorizer()
dtm_v1 = pd.DataFrame(countvec1.fit_transform(input_data['text']).toarray(), columns=countvec1.get_feature_names(), index=None)
dtm_v1['class'] = input_data['class']
dtm_v1.head()

Unnamed: 0,00,000,0009f,007,00s,03,04,05,05425,10,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Output

 <img src="Images/doctm.png" style="width: 500px;"/>

### Refining DTM

      * We need to spend a lot of time on this. 
      * Final result depends on the refinements in this step .
      * Few things we are going to do:
        *  Remove Numbers
        *  Remove Punctuations
        *  Remove Stop Words
        *  Stemming
        
      * We will also lower the frequency threshold for a word to be in our Matrix.
      
### LAB: Refining DTM

In [5]:
import pandas as pd
import re
import nltk

In [6]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

In [14]:
#Writing a Custom Tokenizer

In [7]:
stemmer = PorterStemmer()
def tokenize(text):
    text = stemmer.stem(text)               #stemming
    text = re.sub(r'\W+|\d+|_', ' ', text)    #removing numbers and punctuations and Underscores
    tokens = nltk.word_tokenize(text)       #tokenizing
    return tokens
countvec = CountVectorizer(min_df= 5, tokenizer=tokenize, stop_words=stopwords.words('english'))
dtm = pd.DataFrame(countvec.fit_transform(input_data['text']).toarray(), columns=countvec.get_feature_names(), index=None)

  'stop_words.' % sorted(inconsistent))


In [9]:
import nltk

In [None]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package brown_tei to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown_tei.zip.
[nltk_data]    | Downloading packa

[nltk_data]    |   Unzipping corpora\problem_reports.zip.
[nltk_data]    | Downloading package propbank to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package ptb to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\ptb.zip.
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\product_reviews_1.zip.
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\product_reviews_2.zip.
[nltk_data]    | Downloading package pros_cons to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\pros_cons.zip.
[nltk_data]    | Downloading pack

[nltk_data]    |   Unzipping grammars\large_grammars.zip.
[nltk_data]    | Downloading package tagsets to C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping help\tagsets.zip.
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\bllip_wsj_no_aux.zip.
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     C:\Users\HEPHTECH
[nltk_data]    |     SOLUTIONS\AppData\Roaming\nltk_data...


In [8]:
#Adding label Column
dtm['class'] = input_data['class']
dtm.head()

Unnamed: 0,aaron,abandon,abandoned,abandons,abby,abduction,abel,abilities,ability,able,...,zoe,zombie,zombies,zone,zoo,zoom,zooms,zorro,zucker,zwick
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


 ### Output: 
 
 <img src="Images/refine.png" style="width: 500px;"/>

## Naïve Bayes model

### Bayes Theorem

Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event
      
 <img src="Images/bayes.png" style="width: 500px;"/>
 
 ### Understanding Bayes Theorem
 
 In a factory two machines produce bolts. Given a faulty bolt, what is the probability that it is produced by machine-1
      
 <img src="Images/bayes2.png" style="width: 500px;"/>  

     
  <img src="Images/bays3.png" style="width: 500px;"/>  
  
      
      * Overall defect percentage is 0.026, take that as final reference. What proportion of 0.026 
        is taken by M1 and what portion M2 takes
      * Overall defective = Weighted Defectives from M1 + Weighted Defectives from M2
      * 0.026 = (0.01)*0.6      +  (0.05)*(0.4)
      * P(B)   =P(B1/A1)*P(A1) + P(B2/A2)*P(A2)
      
      
   <img src="Images/bayes4.png" style="width: 500px;"/>  
   
   <img src="Images/bayes5.png" style="width: 500px;"/>  
   
   <img src="Images/bayes7.png" style="width: 500px;"/>  
   
      * Given a bolt is defective what is the probability that it is coming from a particular machine
      * Given a new document what is the probability that it is coming from positive set / negative set
      
  <img src="Images/bayes8.png" style="width: 300px;"/>  
  
  
  ### Naïve Bayes theorem for sentiment analysis
  
      * New document d;
      * Classes={c1,c2}
      * Compute the Bayes probability that d  is in each class c C 
      
  <img src="Images/bayes9.png" style="width: 300px;"/>  
  
              *  P(d) -> probability of words in a specific document, across all doc
              *  P(d/c) -> Probability of words in a specific class
              *  P(c) -> Probability of a class
              
  Naïve Bayes method gives us the positive or negative sentiment of a given document 
   

### Building training and testing sets

In [14]:
df_train = dtm[:1900]
df_test = dtm[1900:]

In [10]:
df_train = dtm_v1[:1900]
df_test = dtm_v1[1900:]

### LAB:  Building Naive Bayes Model

In [16]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
X_train= df_train.drop(['class'], axis=1)

In [19]:
#Fitting model to our data

In [17]:
clf.fit(X_train, df_train['class'])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
clf.fit(X_train, df_train['class'])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [21]:
#Accuracy

In [18]:
X_test= df_test.drop(['class'], axis=1)
clf.score(X_test,df_test['class'])

0.8

In [22]:
X_test= df_test.drop(['class'], axis=1)
clf.score(X_test,df_test['class'])

0.8

In [23]:
#Prediction

In [24]:
pred_sentiment=clf.predict(df_test.drop('class', axis=1))
print(pred_sentiment)

['Pos' 'Pos' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos'
 'Neg' 'Neg' 'Neg' 'Pos' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg'
 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Pos' 'Pos'
 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg'
 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Pos'
 'Neg' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg'
 'Neg' 'Pos' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos' 'Pos' 'Pos'
 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg' 'Neg'
 'Neg' 'Pos' 'Pos' 'Pos']


In [19]:
pred_sentiment=clf.predict(df_test.drop('class', axis=1))
print(pred_sentiment)

['Pos' 'Pos' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos'
 'Neg' 'Neg' 'Neg' 'Pos' 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg'
 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Pos' 'Pos'
 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg'
 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos' 'Pos'
 'Neg' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg'
 'Neg' 'Pos' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Pos' 'Pos'
 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos' 'Pos' 'Neg'
 'Neg' 'Pos' 'Pos' 'Pos']
