## Naive Bayes Classifier For Inshorts News Data
This is notebook in which we are going to use naive bayes classifier from scikit learn. There are many method for text classification like Naive Bayes, Svm, LDA and many more. This notebook will walk through simple implementation of Naive Bayes Classifier.

### Text Classification -Feature Representation-
Most of the Ml algorithms take numerical input. So in case of text classification we need a representation that map words to number and those number will be feeded to algorithm. Most popular representation is __BOW__(Bag of Words Model). In bag of word model columns represent feature(word) and each row represents a document.   
 __EX :__  
   __Doc1__ = "This line contains word1 and word2."  
   __Doc1__ = "This sentence contains word2 and word2."   
   
|      | this | line | contains | word1 | and | word2 | sentence |
|------|------|------|----------|-------|-----|-------|----------|
| Doc1 | 1    | 1    | 1        | 1     | 1   | 1     | 0        |
| Doc2 | 1    | 0    | 1        | 0     | 1   | 2     | 1        |

### Naive Bayes Model 
Naive Bayes model is fast and produce very good result for text classification task. There are 3 type of probability in Naive Bayes formula.  
>1.Prior probability   (Its class probability how likly this class appear in data.)  
2.likelihood   (This is what we calculate from training data.)  
3.posterior probability.  (Using prior prob. and likelihood we calculate posterior.)  

Note that here predictors prior probabilty is ignored because its constant for all classes so posterior is only depends on prior and likelihood.  
![alt text](Bayes_rule.png "Logo Title Text 1")  
We calculate posterior probabilty for all classes for some given document D.    
#### argmax<sub>i</sub>( *P*( C<sub>i</sub> / D) )  
For i range(1,m), where m = total number of classes

### Reading And Cleaning Data

In [284]:
import os
import pandas as pd
import numpy as np
import re

In [285]:
# in this section we read data into a pandas DataFrame
fnames = os.listdir('Data')
frames = [pd.read_csv('Data/'+name_) for name_ in fnames]
data = pd.concat(frames,axis=0,ignore_index=True)
data.head(n=3)

Unnamed: 0,ind,catg,headline,body,imageurl,readmoreurl,inblockId
0,1,automobile,Toyota gives increased comfort & safety featur...,Toyota's Etios and Liva have 14 Standard Safet...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/-_PqPwvecyE,dysqqgqf-1
1,2,automobile,Vodafone uses AR in PBL to promote new campaig...,Several pugs gatecrashed the finals of Vodafon...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/yY_MRk3KEhw,dysqqgqf-1
2,3,automobile,Second-Gen Audi Q5 launched at ₹53.25 Lakh,Audi has launched the all-new Q5 with a price ...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/EiJPF_aVbyo,dysqqgqf-1


In [286]:
#Data Shape and total numbers of category
print("Data size : ",data.shape)
data['catg'].value_counts()

Data size :  (9219, 7)


business         1076
world            1075
science          1075
national         1074
entertainment    1066
sports           1065
technology       1063
automobile        949
politics          776
Name: catg, dtype: int64

In [287]:
#dropping unnecessary columns from dataframe
data = data.drop(['imageurl','headline','ind','readmoreurl','inblockId'],axis=1)
data.head()

Unnamed: 0,catg,body
0,automobile,Toyota's Etios and Liva have 14 Standard Safet...
1,automobile,Several pugs gatecrashed the finals of Vodafon...
2,automobile,Audi has launched the all-new Q5 with a price ...
3,automobile,Italian supercar maker Lamborghini on Thursday...
4,automobile,"Japanese automaker Honda is recalling 22,084 u..."


In [288]:
#remove Puncuation and numbers from our data.
def remove_punctuation(s):
    return re.sub('[^A-Za-z ]','',s)
data['body'] = data['body'].apply(lambda x:remove_punctuation(x))

### Splitting data into 2 part Train,Test

In [289]:
# This will split data into 2 sets. One will used for training and one will be for testing.
# sklearn train_test_split method will split data and also randomize data with random_state aregument. We can test different models by using same random_state split every time. 

from sklearn.model_selection import train_test_split
splitter = 0.35
train,test = train_test_split(data,test_size=splitter,random_state=42)

In [290]:
# resetting index of pandas data_frame to 0 based index

train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [291]:
train.head()

Unnamed: 0,catg,body
0,automobile,Technology giant Apple is using the same model...
1,science,A new image from NASAs Chandra Xray Observator...
2,entertainment,The Mark Hamill late actress Carrie Fisher and...
3,automobile,Israeli startup StoreDot has created a battery...
4,world,Facebook has agreed to investigate the spread ...


### Implementing CountVectorizer model from sklearn.

In [292]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = 'english',ngram_range = (1,4),max_features=20000)
train_cv = cv.fit_transform(train['body'])

In [293]:
# transform our test data into countvectorizer matrix.
test_cv = cv.transform(test['body'])

### Implementing MultinomialNB using sklearn.

In [294]:
# Creat a New naive bayes classifier names nb_clf and fit our training data through fit method to calculate probabilty.
from sklearn.naive_bayes import MultinomialNB
nb_cv_clf = MultinomialNB(alpha=0.01)
nb_cv_clf.fit(train_cv,train['catg'])

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

### Testing model on test dataset and analysing missclassified examples

In [295]:
print("Test Accuracy : ",nb_cv_clf.score(test_cv,test['catg'])*100,"%")
print("Train Accuracy : ",nb_cv_clf.score(train_cv,train['catg'])*100,"%")

Test Accuracy :  78.3700030989 %
Train Accuracy :  93.7082777036 %


In [296]:
test_cv_output = nb_cv_clf.predict(test_cv)
train_cv_output = nb_cv_clf.predict(train_cv)

In [297]:
print("Confusion Matrix for test data.")
pd.crosstab(test['catg'],test_cv_output,rownames=['true'],colnames=['predicted'],margins=True)

Confusion Matrix for test data.


predicted,automobile,business,entertainment,national,politics,science,sports,technology,world,All
true,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
automobile,292,14,3,6,0,4,1,24,1,345
business,16,296,4,19,8,3,0,37,14,397
entertainment,0,1,303,39,5,0,17,3,14,382
national,3,23,17,227,49,6,2,4,32,363
politics,0,3,4,40,218,0,0,1,0,266
science,4,3,1,4,0,338,1,9,6,366
sports,3,1,24,10,3,1,303,2,13,360
technology,25,39,3,4,1,21,0,263,13,369
world,4,15,4,37,0,11,1,18,289,379
All,347,395,363,386,284,384,325,361,382,3227


In [298]:
missclassified_id_cv = []
for i in range(len(test_cv_output)):
    if test_cv_output[i]!=test['catg'].loc[i]:
        missclassified_id_cv.append((i,test_cv_output[i],test['catg'].loc[i]))
print("Number of misscalssified Examples : ",len(missclassified_id_cv))

Number of misscalssified Examples :  698


### Use TF-IDF feature for naive bayes. Using TfIdfTransformer From sklearn.

In [299]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer()
tf_transformer.fit(train_cv)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [300]:
train_tfidf = tf_transformer.transform(train_cv)
test_tfidf = tf_transformer.transform(test_cv)

### Naive bayes with TF-IDF feature Vector

In [301]:
nb_tfidf_clf = MultinomialNB(alpha = 0.01)
nb_tfidf_clf.fit(train_tfidf,train['catg'])

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

### Testing tf-ifd naive bayes model and analysing miss-classified examples.

In [302]:
print("Test Accuracy : ",nb_tfidf_clf.score(test_tfidf,test['catg'])*100,"%")
print("Train Accuracy : ",nb_tfidf_clf.score(train_tfidf,train['catg'])*100,"%")

Test Accuracy :  78.4939572358 %
Train Accuracy :  93.5747663551 %


In [303]:
test_tfidf_output = nb_tfidf_clf.predict(test_tfidf)
train_tfidf_output = nb_tfidf_clf.predict(train_tfidf)

In [304]:
print("Confusion Matrix for test data.")
pd.crosstab(test['catg'],test_tfidf_output,rownames=['true'],colnames=['predicted'],margins=True)

Confusion Matrix for test data.


predicted,automobile,business,entertainment,national,politics,science,sports,technology,world,All
true,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
automobile,296,12,3,6,0,3,1,23,1,345
business,16,296,4,17,8,3,0,39,14,397
entertainment,1,1,301,38,6,0,17,2,16,382
national,4,25,17,221,52,6,2,4,32,363
politics,1,3,4,40,218,0,0,0,0,266
science,4,3,0,3,0,341,2,7,6,366
sports,1,1,24,8,3,1,309,1,12,360
technology,24,39,4,3,1,23,0,264,11,369
world,3,16,4,38,0,12,1,18,287,379
All,350,396,361,374,288,389,332,358,379,3227
