# SENTIMENT ANALYSIS USING INTERNET MOVIE DATABASE


## OBJECTIVE:
**The primary goal of this research is to identify sentiments (positive or negative) of a people based on movies review**. 

Sentiment analysis is the study of customer's sentiments towards any object of interest. It provides powerful insights on how customers think about a certiain topics, services, new products etc. Sentiment anaysis using the techniques of NLP to process large unstructured text data provides valuable information to executives to make informed decisions.

### Import the necessary libraries

In [39]:
import nltk
import pandas as pd
import numpy as np

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer, SnowballStemmer

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import (metrics, cross_validation, linear_model, preprocessing)
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso

from textblob import TextBlob

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS


### STEP 1: LOAD THE DATA

In [2]:
file_name = "C:\\Datasets\\sentiment labelled sentences\\imdb_labelled.txt"

In [3]:
data = [line.split("\t") for line in open(file_name)]
data

[['A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  ',
  '0\n'],
 ['Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  ',
  '0\n'],
 ['Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  ',
  '0\n'],
 ['Very little music or anything to speak of.  ', '0\n'],
 ['The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.  ',
  '1\n'],
 ["The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  ",
  '0\n'],
 ['Wasted two hours.  ', '0\n'],
 ['Saw the movie today and thought it was a good effort, good messages for kids.  ',
  '1\n'],
 ['A bit predictable.  ', '0\n'],
 ['Loved the casting of Jimmy Buffet as the science teacher.  ', '1\n'],
 ['And those baby owls were ado

#### Convert the raw data into pandas dataframe

In [4]:
x = pd.DataFrame(data, columns=['review', 'label'])

In [5]:
df = x.astype({"label": int, "review": str})

In [6]:
df.groupby('label').count()

Unnamed: 0_level_0,review
label,Unnamed: 1_level_1
0,500
1,500


#### Initial raw observation - The labels are evenly distributed with 50% of positive reviews and 50% of negative reviews

## Step 2: BASIC FEATURE EXTRACTION
- **Number of words** - One of the most basic features we can extract is the number of words in each movie review. The basic intuition behind this is that generally, the negative sentiments contain a lesser amount of words than the positive ones.
- **Number of characters** -  The less number of characters usually denotes negative sentiments. Here, we calculate the number of characters in each tweet. This is done by calculating the length of the review.
- **Average word length** - We will also extract another feature which will calculate the average word length of each review. This can also potentially help us in improving our model. Here, we simply take the sum of the length of all the words and divide it by the total length of the review:
- **Number of stopwords** - Generally, while solving an NLP problem, the first thing we do is to remove the stopwords. But sometimes calculating the number of stopwords can also give us some extra information which we might have been losing before.
- **Number of special characters** - One more interesting feature which we can extract from a review is calculating the number of special characters present in it. This also helps in extracting extra information from our text data.
- **Number of numerics** - Just like words, we calculate the number of numerals present in each review
- **Number of uppercase words** - Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.


In [7]:
# No of words in each review
df['word_count'] = df['review'].apply(lambda x: len(str(x).split(" ")))
df[['review','word_count']].head()

Unnamed: 0,review,word_count
0,"A very, very, very slow-moving, aimless movie ...",15
1,Not sure who was more lost - the flat characte...,21
2,Attempting artiness with black & white and cle...,33
3,Very little music or anything to speak of.,10
4,The best scene in the movie was when Gerardo i...,23


In [8]:
# No of characters in each review
df['char_count'] = df['review'].str.len() ## this also includes spaces
df[['review','char_count']].head()

Unnamed: 0,review,char_count
0,"A very, very, very slow-moving, aimless movie ...",87
1,Not sure who was more lost - the flat characte...,99
2,Attempting artiness with black & white and cle...,188
3,Very little music or anything to speak of.,44
4,The best scene in the movie was when Gerardo i...,108


In [9]:
# Average word length
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

df['avg_word'] = df['review'].apply(lambda x: avg_word(x))
df[['review','avg_word']].head()

Unnamed: 0,review,avg_word
0,"A very, very, very slow-moving, aimless movie ...",5.615385
1,Not sure who was more lost - the flat characte...,4.157895
2,Attempting artiness with black & white and cle...,5.032258
3,Very little music or anything to speak of.,4.375
4,The best scene in the movie was when Gerardo i...,4.095238


In [10]:
stop = stopwords.words('english')

In [11]:
# No of stopwords in each review
df['stopwords'] = df['review'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['review','stopwords']].head()

Unnamed: 0,review,stopwords
0,"A very, very, very slow-moving, aimless movie ...",3
1,Not sure who was more lost - the flat characte...,8
2,Attempting artiness with black & white and cle...,10
3,Very little music or anything to speak of.,2
4,The best scene in the movie was when Gerardo i...,10


In [12]:
# No of numerics in eacg review
df['numerics'] = df['review'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['review','numerics']].head()

Unnamed: 0,review,numerics
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,0


In [13]:
df.head()

Unnamed: 0,review,label,word_count,char_count,avg_word,stopwords,numerics
0,"A very, very, very slow-moving, aimless movie ...",0,15,87,5.615385,3,0
1,Not sure who was more lost - the flat characte...,0,21,99,4.157895,8,0
2,Attempting artiness with black & white and cle...,0,33,188,5.032258,10,0
3,Very little music or anything to speak of.,0,10,44,4.375,2,0
4,The best scene in the movie was when Gerardo i...,1,23,108,4.095238,10,0


## STEP 3: BASIC TEXT PRE-PROCESSING

- **Stemming** - Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. There are mainly two errors in stemming – overstemming and under stemming. Over-stemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.
- **Ngrams** -  N-grams of texts are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). N-grams are used to develop not just unigram models but also bigram and trigram models. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.
- **Tokenization** - Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation. 
- **Lemmatization** - Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
- **Term Frequency (TF)** - is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize. Therefore, we can generalize term frequency as:
TF = (Number of times term T appears in the particular row) / (number of terms in that row)
- **Inverse Document Frequency (IDF)** - IDF is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. Rarer the term, more is the IDF score.

In [14]:
# Stemming
import pandas as pd
stemmer = SnowballStemmer("english")
original_words = ['greatness', 'flies', 'running', 'mules', 'denied','agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'solving','sensational', 
           'traditional', 'reference', 'plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,greatness,great
1,flies,fli
2,running,run
3,mules,mule
4,denied,deni
5,agreed,agre
6,owned,own
7,humbled,humbl
8,sized,size
9,meeting,meet


In [15]:
# Stemmer with the sentence
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['review'][:10].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0    A very, very, veri slow-moving, aimless movi a...
1    not sure who wa more lost - the flat charact o...
2    attempt arti with black & white and clever cam...
3               veri littl music or anyth to speak of.
4    the best scene in the movi wa when gerardo is ...
5    the rest of the movi lack art, charm, meaning....
6                                      wast two hours.
7    saw the movi today and thought it wa a good ef...
8                                   A bit predictable.
9    love the cast of jimmi buffet as the scienc te...
Name: review, dtype: object

In [16]:
# Ngrams
TextBlob(df['review'][0]).ngrams(2)

[WordList(['A', 'very']),
 WordList(['very', 'very']),
 WordList(['very', 'very']),
 WordList(['very', 'slow-moving']),
 WordList(['slow-moving', 'aimless']),
 WordList(['aimless', 'movie']),
 WordList(['movie', 'about']),
 WordList(['about', 'a']),
 WordList(['a', 'distressed']),
 WordList(['distressed', 'drifting']),
 WordList(['drifting', 'young']),
 WordList(['young', 'man'])]

In [17]:
# Tokenization
TextBlob(df['review'][1]).words

WordList(['Not', 'sure', 'who', 'was', 'more', 'lost', 'the', 'flat', 'characters', 'or', 'the', 'audience', 'nearly', 'half', 'of', 'whom', 'walked', 'out'])

In [18]:
# Lemmatization
from textblob import Word
df['review'] = df['review'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['review'].head(n=10)

0    A very, very, very slow-moving, aimless movie ...
1    Not sure who wa more lost - the flat character...
2    Attempting artiness with black & white and cle...
3           Very little music or anything to speak of.
4    The best scene in the movie wa when Gerardo is...
5    The rest of the movie lack art, charm, meaning...
6                                    Wasted two hours.
7    Saw the movie today and thought it wa a good e...
8                                   A bit predictable.
9    Loved the casting of Jimmy Buffet a the scienc...
Name: review, dtype: object

#### We can see the difference in sentence after the application of stemming and lemmatization

#### One more example of raw text before and after lemmatization and Tokenization

In [19]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [20]:
'''
Preview a document after preprocessing
'''
review_sample = 'This was a lovely movie. I would like to match again and again'

print("Original document: ")
words = []
for word in review_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(review_sample))

Original document: 
['This', 'was', 'a', 'lovely', 'movie.', 'I', 'would', 'like', 'to', 'match', 'again', 'and', 'again']


Tokenized and lemmatized document: 
['love', 'movi', 'like', 'match']


In [21]:
# Term Frequency
tf1 = (df['review'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1.head(n=10)

Unnamed: 0,words,tf
0,the,2
1,"audience,",1
2,half,1
3,nearly,1
4,walked,1
5,or,1
6,sure,1
7,flat,1
8,-,1
9,who,1


In [22]:
# Inverse Document Frequency
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(df.shape[0]/(len(df[df['review'].str.contains(word)])))
tf1.head(n=10)

Unnamed: 0,words,tf,idf
0,the,2,0.70522
1,"audience,",1,6.907755
2,half,1,4.961845
3,nearly,1,6.907755
4,walked,1,6.214608
5,or,1,0.911303
6,sure,1,5.298317
7,flat,1,6.214608
8,-,1,2.171557
9,who,1,3.057608


In [23]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1

Unnamed: 0,words,tf,idf,tfidf
0,the,2,0.70522,1.41044
1,"audience,",1,6.907755,6.907755
2,half,1,4.961845,4.961845
3,nearly,1,6.907755,6.907755
4,walked,1,6.214608,6.214608
5,or,1,0.911303,0.911303
6,sure,1,5.298317,5.298317
7,flat,1,6.214608,6.214608
8,-,1,2.171557,2.171557
9,who,1,3.057608,3.057608


## Calculation of sentiments without using any Machine Learning algorithm - After the cleaning the raw data (feature extraction, processing), we are able to get the sentiments for each review


In [24]:
df['review'][:5].apply(lambda x: TextBlob(x).sentiment)

0                                 (0.18, 0.395)
1    (0.014583333333333337, 0.4201388888888889)
2    (-0.12291666666666666, 0.5145833333333333)
3                  (-0.24375000000000002, 0.65)
4                                    (1.0, 0.3)
Name: review, dtype: object

## Polarity & Subjectivity

In the above example, we have a tuple representing polarity and subjectivity.

 - **Polarity ** - It means emotions expressed in a sentence. Emotions are closely related to sentiments. The strength of a sentiment or opinion is typically linked to the intensity of certain emotions, e.g., joy and anger.
 - **Subjectivity ** - Subjective sentence expresses some personal feelings, views, or beliefs.subjective sentence is “I like iPhone.” Subjective expressions come in many forms, e.g., opinions, allegations, desires, beliefs, suspicions, and speculations. A subjective sentence may not express any sentiment. For example, “I think that he went home” and “I want a camera that can take good photos” are a subjective sentences, but does not express any sentiment.

### Below we extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment.

In [25]:
df['sentiment'] = df['review'].apply(lambda x: TextBlob(x).sentiment[0] )
df[['review','sentiment']].head(n=5)

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0.18
1,Not sure who wa more lost - the flat character...,0.014583
2,Attempting artiness with black & white and cle...,-0.122917
3,Very little music or anything to speak of.,-0.24375
4,The best scene in the movie wa when Gerardo is...,1.0


### Below we extract subjectivity

In [26]:
df['sentiment'] = df['review'].apply(lambda x: TextBlob(x).sentiment[1] )
df[['review','sentiment']].head(n=5)

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0.395
1,Not sure who wa more lost - the flat character...,0.420139
2,Attempting artiness with black & white and cle...,0.514583
3,Very little music or anything to speak of.,0.65
4,The best scene in the movie wa when Gerardo is...,0.3


## STEP 4: APPLY MACHINE LEARNING ALGORITHM

As we have text data, we are think Naive Bayes Classifier will yield better result compared to other ML algorithms. We are going to compare the result with the ensemble model (in this case bagging approach - Random Forest Classifier)

- **Naive Bayes Classifer** 
- **Random Forest Classifier** 

In [27]:
class LemmaTokenizer(object) :
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc) :
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
    
n_top_words = 20
def print_top_words(model, feature_names, n_top_words) :
    for topic_idx, topic in enumerate(model.components_) :
        message = "Topic #%d: " %topic_idx
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1: -1]] )
        print(message)
        print(topic)
    

### First split the data for training (70%) and testing (30%). The splitting criteria is subjective, here we are testing with a simple scenario

In [28]:
x_train, x_test, y_train, y_test = train_test_split(df.review, df.label, test_size=0.3, random_state=101)

### The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. 

In [29]:
# Instatiate the TfidfVectorizer
tf_vectorizer = TfidfVectorizer(analyzer="word", token_pattern=r'\w{1,}', stop_words="english", tokenizer=LemmaTokenizer())
train_documents = tf_vectorizer.fit_transform(x_train)

In [30]:
tf_vectorizer.get_feature_names()

['!',
 '$',
 '&',
 "'",
 "''",
 "'cover",
 "'d",
 "'film",
 "'ive",
 "'ll",
 "'m",
 "'must",
 "'re",
 "'s",
 "'titta",
 "'ve",
 '(',
 ')',
 '***spoilers***',
 ',',
 '-',
 '--',
 '-period',
 '.',
 '..',
 '...',
 '.a',
 '.an',
 '.stylized',
 '0/10',
 '1',
 '1-10',
 '1/10',
 '10',
 '10+',
 '10/10',
 '15',
 '18th',
 '1928',
 '1948',
 '1971',
 '1973',
 '1980',
 '1986',
 '1995',
 '1998',
 '2',
 '20.the',
 '2005',
 '2006',
 '20th',
 '25',
 '3',
 '4',
 '5',
 '5-year',
 '7.50',
 '70',
 '70000',
 '8',
 '8.15pm',
 '80',
 '8pm',
 '9',
 '9/10',
 '90',
 ':',
 ';',
 '?',
 '``',
 'a+',
 'aailiyah',
 'abandoned',
 'ability',
 'abroad',
 'absolutely',
 'abstruse',
 'abysmal',
 'academy',
 'accent',
 'accessible',
 'accolade',
 'accurately',
 'accused',
 'achievement',
 'achille',
 'ackerman',
 'act',
 'acted',
 'acting',
 'acting-wise',
 'action',
 'actor',
 'actress',
 'actually',
 'adam',
 'adaptation',
 'add',
 'added',
 'addition',
 'admins',
 'admitted',
 'adorable',
 'adorable.the',
 'aerial',
 'a

In [31]:
test_documents = tf_vectorizer.transform(x_test)
test_documents

<300x2118 sparse matrix of type '<class 'numpy.float64'>'
	with 1976 stored elements in Compressed Sparse Row format>

### Naive Bayes Algorithm

In [32]:
# Training Phase
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB().fit(train_documents, y_train)

In [33]:
# Test Phase
pred = classifier.predict(tf_vectorizer.transform(x_test))
pred

array([1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1])

### Testing to see with the actual narratives

In [34]:
pred1 = classifier.predict(tf_vectorizer.transform(["such a beautiful movie"]))
pred1

array([1])

### Evaluate the model, here we use simple classification report and accuracy

In [35]:
print("ACCURACY : "+str(accuracy_score(y_test, pred)))
report = classification_report(y_test, pred)
print("Report : \n", report)

ACCURACY : 0.79
Report : 
              precision    recall  f1-score   support

          0       0.74      0.88      0.80       146
          1       0.86      0.70      0.77       154

avg / total       0.80      0.79      0.79       300



### Re-training the Naive Bayes classifier using cross validation

In [40]:
%time
scores = cross_val_score(classifier, train_documents , y_train, scoring='accuracy', cv=10, n_jobs=-1, verbose=1)
print (scores)


Wall time: 0 ns


[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   13.5s remaining:    8.9s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   17.5s finished


[ 0.78873239  0.76056338  0.71830986  0.78873239  0.68571429  0.8
  0.75362319  0.82608696  0.86956522  0.73913043]


In [41]:
scores.mean()

0.77304581109847492

### From the above cross validated score, we can see that the mean accuracy is fewer points lower compared to simple train/test split. Also, 10 fold results show that there could be significant bias in training sample if we use random splitting. Cross validation proves to be a better method for model training

### It's time to experiment with Random Forest Classifier

In [42]:
clf = RandomForestClassifier(n_estimators=50, criterion='gini')
clf.fit(train_documents, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [43]:
# Test Phase
y_pred = clf.predict(tf_vectorizer.transform(x_test))

In [44]:
print("ACCURACY : "+str(accuracy_score(y_test, y_pred)))
report = classification_report(y_test, y_pred)
print("Report :\n ", report)

ACCURACY : 0.703333333333
Report :
               precision    recall  f1-score   support

          0       0.68      0.74      0.71       146
          1       0.73      0.67      0.70       154

avg / total       0.71      0.70      0.70       300



### As expected, the Naive Bayes algorithm does a better job compared to Random Forest. It is well known fact that, Naive Bayes works really well for text data and it is true in our case. Also it is important to note that, we just used plain vanilla parameters for random forest, so results should be used with caution

### Let's experiment hyper parameter tuning using grid search

In [45]:
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

#classifier
rfc = RandomForestClassifier(random_state=42)

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(train_documents, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [4, 5, 6, 7, 8], 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [46]:
CV_rfc.best_params_

{'criterion': 'entropy',
 'max_depth': 6,
 'max_features': 'auto',
 'n_estimators': 500}

### Use the hyperparameter suggested by gridsearch and re-train the model

In [47]:
# Based on the above best_params we obtained from Grid Search we create a classifier 
rfc1 = RandomForestClassifier(random_state=42, max_features='auto', n_estimators= 500, max_depth=6, criterion='entropy')

In [48]:
rfc1.fit(train_documents, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=6, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [49]:
y_pred = rfc1.predict(tf_vectorizer.transform(x_test))

In [50]:
print("Accuracy for Random Forest on CV data: ",accuracy_score(y_test, y_pred))
report = classification_report(y_test, y_pred)
print("Report :\n ", report)

Accuracy for Random Forest on CV data:  0.726666666667
Report :
               precision    recall  f1-score   support

          0       0.68      0.84      0.75       146
          1       0.80      0.62      0.70       154

avg / total       0.74      0.73      0.72       300



### Pipeline having Count Vectorizer, TF-IDF and Random Forest. Excluding Grid Search.

In [51]:

pipe = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier(random_state=42, max_features='auto', n_estimators= 500, max_depth=8, criterion='entropy')),
    ])

In [52]:
pipe.fit(x_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...timators=500, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False))])

In [53]:
y_pred = pipe.predict(x_test)

In [54]:
print("Accuracy for Random Forest on CV data: ",accuracy_score(y_test, y_pred))
report = classification_report(y_test, y_pred)
print("Report :\n ", report)

Accuracy for Random Forest on CV data:  0.723333333333
Report :
               precision    recall  f1-score   support

          0       0.68      0.82      0.74       146
          1       0.78      0.64      0.70       154

avg / total       0.73      0.72      0.72       300



### Pipeline having Count Vectorizer, TF-IDF and Random Forest. Including Grid Search as well
#### NOTE: rf_grid.estimator.get_params().keys() will print all the keys that are allowed in the pipeline. 
####  Remember when you dont have params to be passed then you can use the params of classifier itself as in the above case. 

In [55]:
pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier(random_state=42)),
    ])

params = { 
    'clf__n_estimators': [200, 500],
    'clf__max_features': ['auto', 'sqrt', 'log2'],
    'clf__max_depth' : [4,5,6,7,8],
    'clf__criterion' :['gini', 'entropy']
}

# Grid Search Execute
rf_grid = GridSearchCV(estimator=pipeline , param_grid=params, cv=5)


In [56]:
rf_grid.estimator.get_params().keys()

dict_keys(['memory', 'steps', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__bootstrap', 'clf__class_weight', 'clf__criterion', 'clf__max_depth', 'clf__max_features', 'clf__max_leaf_nodes', 'clf__min_impurity_decrease', 'clf__min_impurity_split', 'clf__min_samples_leaf', 'clf__min_samples_split', 'clf__min_weight_fraction_leaf', 'clf__n_estimators', 'clf__n_jobs', 'clf__oob_score', 'clf__random_state', 'clf__verbose', 'clf__warm_start'])

In [57]:
rf_detector = rf_grid.fit(x_train, y_train)
print(rf_grid.cv_results_)

{'mean_fit_time': array([ 0.61269455,  1.33230753,  0.5638063 ,  1.27906957,  0.49635997,
        1.20035667,  0.54898748,  1.35443945,  0.53071694,  1.35857539,
        0.51726689,  1.2220376 ,  0.56393771,  1.32962813,  0.58443847,
        1.3428226 ,  0.50630364,  1.26476526,  0.54467735,  1.49427104,
        0.58368835,  1.45431991,  0.56804152,  1.33377547,  0.64613905,
        1.48336763,  0.62531986,  1.48045979,  0.53963795,  1.3165659 ,
        0.50399132,  1.37848058,  0.52176647,  1.29536376,  0.50663033,
        1.2503366 ,  0.57177649,  1.31760621,  0.5987792 ,  1.41819067,
        0.55493011,  1.2766222 ,  0.55320067,  1.39741502,  0.60183549,
        1.43824806,  0.57910285,  1.2650846 ,  0.55631266,  1.50403495,
        0.62841492,  1.39372458,  0.5566597 ,  1.40872531,  0.63775649,
        1.47325296,  0.66165314,  1.50322585,  0.50054874,  1.39256182]), 'std_fit_time': array([ 0.10278953,  0.10942847,  0.06686972,  0.0861379 ,  0.04585753,
        0.09766967,  0.06279

In [58]:
y_pred = pipe.predict(x_test)

In [59]:
print("Accuracy for Random Forest on CV data: ",accuracy_score(y_test, y_pred))
report = classification_report(y_test, y_pred)
print("Report :\n ", report)

Accuracy for Random Forest on CV data:  0.723333333333
Report :
               precision    recall  f1-score   support

          0       0.68      0.82      0.74       146
          1       0.78      0.64      0.70       154

avg / total       0.73      0.72      0.72       300



### Categorization of Topics using LDA (Latent Dirichlet Allocation) algorithm
- Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
- Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
- LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.
- It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

In [60]:
lda = LatentDirichletAllocation(n_components=10, max_iter=100, learning_method="online", learning_offset=50., random_state=123)
lda_model = lda.fit(train_documents, y_train)
lda_features = lda_model.transform(train_documents)

In [61]:
len(lda_features)

700

In [62]:
# Use LDA model to extract topic and map extracted topics to the corresponding features from TFIDF
tf_features = tf_vectorizer.get_feature_names()

print_top_words(lda_model, tf_features, n_top_words)

Topic #0: 10/10 turn limited evidently casted reasonable atrocity ready role budget explanation widmark unintentionally comical attractive eye-pleasing maybe gem script set
[ 0.10002472  0.10000326  0.10000329 ...,  0.10000281  0.37639038
  0.10000236]
Topic #1: jerky hole . well-done aerial legendary camerawork movement time zillion charles ray away reality started semi truck annoying drive wind
[ 0.32690107  0.10000246  0.10000373 ...,  0.60009269  0.10000316
  0.10000307]
Topic #2: . director perplexing football sandra bullock history cinema girl lacked scene end easily 2 speed favourite depth comment talented imagination
[ 0.10000331  0.100003    0.10000299 ...,  0.10000309  0.10000301
  0.10000309]
Topic #3: cost avoid . chilly unconvincing kieslowski amaze cease bring wish save carrell range clear ruthless film ability pull talented hour
[ 0.10000262  0.39809211  0.10000264 ...,  0.10000282  0.10000279
  0.10000329]
Topic #4: journey , baby owl mishima uninteresting extremely ado

### Re-build Random Forest model using features from LDA

In [63]:
# classifier
clf = RandomForestClassifier(n_estimators=50)
clf.fit(lda_features, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [64]:
lda_test_features = lda_model.transform(tf_vectorizer.transform(x_test))
# Test Phase
y_pred = clf.predict(lda_test_features)

In [65]:
print("ACCURACY : "+str(accuracy_score(y_test, y_pred)))
report = classification_report(y_test, y_pred)
print("Report :\n ", report)

ACCURACY : 0.556666666667
Report :
               precision    recall  f1-score   support

          0       0.54      0.58      0.56       146
          1       0.57      0.54      0.56       154

avg / total       0.56      0.56      0.56       300



### Explore other techniques

- **Word2vec** - is a technique to find continuous embeddings for words. It learns from reading massive amounts of text and memorizing which words tend to appear in similar contexts. However, it is very likely that if we deploy this model, we will encounter words that we have not seen in our training set before. The previous model will not be able to accurately classify these reviews, even if it has seen very similar words during training.

- To solve this problem, we need to capture the semantic meaning of words, meaning we need to understand that words like ‘good’ and ‘positive’ are closer than ‘apricot’ and ‘continent'.