#Naive Bayes Tutorial

This notebook is used for explaining the steps involved in creating a Naive Bayes model 

1. Import the required libraries
2. Connect your google drive with Colab
3. Read the IMDB Dataset 
4. Observe the IMDB Dataset
5. Text Preprocessing 
6. Creating features and target sets
7. Splitting the data in train and test sets
8. Creating the model pipeline
9. Training the model
10. Making predictions on test set
10. Printing the classification report

## Note:- Before running this notebook, make sure you ran the Load Datasets notebook given in the website. 

##Import the required libraries

In [4]:
import matplotlib.pyplot as plt       # This library is used to plot curves
import pandas as pd                   # This library is used for data analysis
import numpy as np                     #This library is used for working with arrays and performing various linear algebra operations
from sklearn.model_selection import train_test_split  #This library is used for performing test_train_splits on the data         


import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import csv
import string
import random
import sklearn
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Connect your google drive with Colab

In [5]:
from google.colab import drive                #This library is used for mounting drive

In [6]:
drive.mount('/content/gdrive')                       #Mount google drive

Mounted at /content/gdrive


## Read the IMDB Dataset 

In [7]:
train = pd.read_csv("/content/gdrive/MyDrive/trainData.tsv", delimiter="\t")

## Display the IMDB Dataset 

In [8]:
train

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...


## Text Preprocessing

In [9]:
punc_list = string.punctuation
list_of_stopwords = stopwords.words('english')

In [10]:
def textProcessing(text):
    text_withoutPunctuation = []
    for i in text:
      if(i not in punc_list):
        text_withoutPunctuation.append(i)
    
    text_withoutPunctuation = ''.join(text_withoutPunctuation)
    list_of_words = text_withoutPunctuation.split()
    list_of_processed_words =[]

    for w in list_of_words:
      if(w not in list_of_stopwords):
        lowerCase_word = w.lower()
        list_of_processed_words.append(lowerCase_word) 
    return list_of_processed_words

## Creating features and target sets

In [11]:
y = train['sentiment']

In [12]:
train = train.drop('sentiment',1)
X = train.drop('id',1)

  """Entry point for launching an IPython kernel.
  


In [13]:
X = np.array(X)
y = np.array(y)

In [14]:
y.shape

(25000,)

In [15]:
X.shape

(25000, 1)

In [16]:
y

array([1, 1, 0, ..., 0, 0, 1])

## Splitting the data in train and test sets

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

## Creating the model pipeline

In [18]:
model = Pipeline([('bow', CountVectorizer(analyzer=textProcessing)),  ('tfidf', TfidfTransformer()),  ('classifier', MultinomialNB()), ])

## Training the model

In [19]:
model.fit(X_train,y_train)

Pipeline(steps=[('bow',
                 CountVectorizer(analyzer=<function textProcessing at 0x7fe90bbb7170>)),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultinomialNB())])

## Making predictions on test set

In [20]:
y_pred = model.predict(X_test)

In [21]:
y_pred

array([1, 0, 1, ..., 1, 1, 1])

## Printing the classification report

In [22]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.87      0.89      0.88      2502
           1       0.88      0.86      0.87      2498

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000

