# Naive Bayes
Naive Bayes is a classification method which makes use of the Bayes theorm to find the probability of a item to be of a specific class. 

### Aim

A classifier of the movie reviews in IMDB database, which can be used to filter the reviews of a movie as **positive** or **negative**

Note: **Naive_bayes** module of sklearn library is used to classify the reviews of the movies. <br>

### Data
Data is Downloaded from Kaggle from the below link and author is Arunava <br>
https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset

Data set consists of **30000** text files in which **5000** is reserved for testing. <br>
Both train and test datasets are again divided into positive and negative reviews which are equal in number. <br>

### Training Data
The Training data consists of **12500** positive reviews and **12500** negative reviews.
### Test Data
The Test data also consists of **2500** positive reviews and **2500** negative reviews.

**NOTE** : The execution may be slow because of the huge dataset for training the text classifier

### Libraries Used

In [1]:
import os
import io
import numpy as np
import collections
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB,GaussianNB
from sklearn.metrics import confusion_matrix

### Helper Functions to read the file and create DataFrames
This function reads a file and concatenate all lines. 

In [2]:
def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)
            #print(path)
            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')            
            for line in f:                           
                lines.append(line)               
            f.close()
            message = '\n'.join(lines)
            #print(lines)
            yield path, message

This function iterates through each file and creates the dataFrame

In [3]:
def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)
    return DataFrame(rows, index=index)

### Loading the data 

The previously defined helper functions is used to load the data into a DataFrame called movieReview. 

In [4]:
movieReview = DataFrame({'message': [], 'class': []})

movieReview = movieReview.append(dataFrameFromDirectory('Data/imdb-movie-reviews-dataset/train/pos/', 'pos'))
movieReview = movieReview.append(dataFrameFromDirectory('Data/imdb-movie-reviews-dataset/train/neg/', 'neg'))
movieReview.head()

Unnamed: 0,class,message
Data/imdb-movie-reviews-dataset/train/pos/0_9.txt,pos,Bromwell High is a cartoon comedy. It ran at t...
Data/imdb-movie-reviews-dataset/train/pos/10000_8.txt,pos,Homelessness (or Houselessness as George Carli...
Data/imdb-movie-reviews-dataset/train/pos/10001_10.txt,pos,Brilliant over-acting by Lesley Ann Warren. Be...
Data/imdb-movie-reviews-dataset/train/pos/10002_7.txt,pos,This is easily the most underrated film inn th...
Data/imdb-movie-reviews-dataset/train/pos/10003_8.txt,pos,This is not the typical Mel Brooks film. It wa...


The variable movieReview is loaded with data with label wheather a review by the user is a positive or negative review.

### CountVectorizer
CountVectorizer is used to split up the reviews into a list of words(a spare matrix with count of each word)

In [5]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(movieReview['message'].values)

In [6]:
counts.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Classifier 
MultinomialNB classifier is created and used to fit/train the model using the training data

In [7]:
classifier = MultinomialNB()
targets = movieReview['class'].values
classifier.fit(counts, targets)
#classifier.fit(counts.toarray(), targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Prediction
The Model is now equipped with data to predict a new review is positive or negative. A **positive review** is fed into the classifier to predict.

In [8]:
examples = ["I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."]
example_counts = vectorizer.transform(examples)
predictions= classifier.predict(example_counts)
print("The review presented is : " + str(predictions))

The review presented is : ['pos']


So according to current model the review which is given is a positive review. <br>
As we have demostrated for one review, now a we will test it with the test data set we have reserved.

## Testing

Now, the model has to be tested on the test data set so as mearure the model maturity. 

### Reading the Dataset from file and predicting the feedback class
#### Positive Reviews

In [9]:
Testdata = DataFrame({'message': [], 'class': []})
Testdata = Testdata.append(dataFrameFromDirectory('Data/imdb-movie-reviews-dataset/test/pos/', 'pos'))

### Extracting the messages from the dataframe for positive reviews
positive_counts = vectorizer.transform(Testdata['message'])

### Loading the prediction results for the positive reviews
pos_predictions= classifier.predict(positive_counts)
print(pos_predictions)

['pos' 'pos' 'pos' ..., 'pos' 'neg' 'pos']


#### Negative Reviews

In [10]:
Testdata = DataFrame({'message': [], 'class': []})
Testdata = Testdata.append(dataFrameFromDirectory('Data/imdb-movie-reviews-dataset/test/neg/', 'neg'))

### Extracting the messages from the dataframe for negative reviews
negative_counts = vectorizer.transform(Testdata['message'])

### Loading the prediction results for the positive reviews
neg_predictions= classifier.predict(negative_counts)
print(neg_predictions)

['neg' 'neg' 'neg' ..., 'neg' 'neg' 'neg']


### Results

In [11]:
collections.Counter(pos_predictions)

Counter({'neg': 633, 'pos': 1867})

#### Positive
Out of **2500** positive reviews, <br> 
**1867** were correctly predicted as **positive** itself and <br> 
rest **633** as **negative**

In [12]:
collections.Counter(neg_predictions)

Counter({'neg': 2229, 'pos': 271})

#### Negative
Out of **2500** negative reviews, <br> 
**2229** were correctly predicted as **negative** itself and <br> 
rest **271** as **positive**

## Confusion Matrix


In order to evaluate our model for the movie review classifer we are going to use Confusion matrix

### Expected Results Array
A combined array of values which were expected for the testing data.<br>
2500 positive and 2500 negative reviews

In [13]:
expectedpos = np.full((2500), 'pos')
expectedneg = np.full((2500), 'neg')
# concatinating to get a single array
expected = np.concatenate((expectedpos, expectedneg), axis=0)
expected

array(['pos', 'pos', 'pos', ..., 'neg', 'neg', 'neg'],
      dtype='<U3')

### Predicted Results Array
A combined array of values which were predicted by the model for the testing data.

In [14]:
predictions =np.concatenate((pos_predictions, neg_predictions), axis=0) 
predictions

array(['pos', 'pos', 'pos', ..., 'neg', 'neg', 'neg'],
      dtype='<U3')

### Calculating Confusion Matrix
Based on predicted and expected results. Confusion matrix is created


In [15]:
CM =  confusion_matrix(expected, predictions)

### Extracting TN,FN,FP,TP
Extracting True Negative , False Negative, True Positive, False positive count from the confusion matrix

In [16]:
ActualPositive = 2500
ActualNegative = 2500
Total = ActualPositive + ActualNegative
TrueNegative = CM[0][0]
FalseNegative = CM[1][0]
FalsePositive = CM [0][1]
TruePositive = CM[1][1]

### Varies parameters for evaluting our model

#### Accuracy: 
Overall, how often is the classifier correct?

In [17]:
(TruePositive+TrueNegative)/Total

0.81920000000000004

#### Misclassification Rate (Error Rate) : 
Overall, how often is it wrong? 

In [18]:
(FalsePositive+FalseNegative)/Total

0.18079999999999999

#### True Positive Rate: 
When it's actually yes, how often does it predict yes?

In [19]:
TruePositive/ActualPositive

0.74680000000000002

#### False Positive Rate: 
When it's actually no, how often does it predict yes?

In [20]:
FalsePositive/ActualNegative

0.1084

#### True Negative Rate: 
When it's actually no, how often does it predict no?

In [21]:
TrueNegative/ActualNegative

0.89159999999999995

#### Precision: 
When it predicts yes, how often is it correct?

In [22]:
TruePositive/(FalsePositive+TruePositive)

0.8732460243217961

#### Prevalence: 
How often does the yes condition actually occur in our sample?

In [23]:
ActualPositive/Total

0.5

### Conclusion

A sentiment analyzer was created using naive bayes method for classifing the movie reviews in IMDB website. A confusion matrix is presented in order to evaluate the accuracy of the model.