## Naive Bayes Classifier For Inshorts News Data
This is notebook in which we are going to use naive bayes classifier from scikit learn. There are many method for text classification like Naive Bayes, Svm, LDA and many more. This notebook will walk through simple implementation of Naive Bayes Classifier.

### Text Classification -Feature Representation-
Most of the Ml algorithms take numerical input. So in case of text classification we need a representation that map words to number and those number will be feeded to algorithm. Most popular representation is __BOW__(Bag of Words Model). In bag of word model columns represent feature(word) and each row represents a document.   
 __EX :__  
   __Doc1__ = "This line contains word1 and word2."  
   __Doc1__ = "This sentence contains word2 and word2."   
   
|      | this | line | contains | word1 | and | word2 | sentence |
|------|------|------|----------|-------|-----|-------|----------|
| Doc1 | 1    | 1    | 1        | 1     | 1   | 1     | 0        |
| Doc2 | 1    | 0    | 1        | 0     | 1   | 2     | 1        |

### Naive Bayes Model 
Naive Bayes model is fast and produce very good result for text classification task. There are 3 type of probability in Naive Bayes formula.  
>1.Prior probability   (Its class probability how likly this class appear in data.)  
2.likelihood   (This is what we calculate from training data.)  
3.posterior probability.  (Using prior prob. and likelihood we calculate posterior.)  

Note that here predictors prior probabilty is ignored because its constant for all classes so posterior is only depends on prior and likelihood.  
![alt text](Bayes_rule.png "Logo Title Text 1")  
We calculate posterior probabilty for all classes for some given document D.    
#### argmax<sub>i</sub>( *P*( C<sub>i</sub> / D) )  
For i range(1,m), where m = total number of classes

In [1]:
import os
import pandas as pd
import numpy as np
import re

In [4]:
# in this section we read data 
fnames = os.listdir('Data')
frames = [pd.read_csv('Data/'+name_) for name_ in fnames]
data = pd.concat(frames,axis=0,ignore_index=True)
data.head()

Unnamed: 0,ind,catg,headline,body,imageurl,readmoreurl,inblockId
0,1,automobile,Toyota gives increased comfort & safety featur...,Toyota's Etios and Liva have 14 Standard Safet...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/-_PqPwvecyE,dysqqgqf-1
1,2,automobile,Vodafone uses AR in PBL to promote new campaig...,Several pugs gatecrashed the finals of Vodafon...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/yY_MRk3KEhw,dysqqgqf-1
2,3,automobile,Second-Gen Audi Q5 launched at ₹53.25 Lakh,Audi has launched the all-new Q5 with a price ...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/EiJPF_aVbyo,dysqqgqf-1
3,4,automobile,World's fastest SUV Lamborghini Urus launched ...,Italian supercar maker Lamborghini on Thursday...,http://images.newsinshorts.com.edgesuite.net/a...,http://www.india.com/auto/car-news/lamborghini...,dysqqgqf-1
4,5,automobile,22000 Honda City units recalled in India over ...,"Japanese automaker Honda is recalling 22,084 u...",http://images.newsinshorts.com.edgesuite.net/a...,http://www.livemint.com/Companies/gR3h813hD6ex...,dysqqgqf-1


In [7]:
#dropping unnecessary columns from dataframe
data = data.drop(['imageurl','headline','ind','readmoreurl','inblockId'],axis=1)

In [8]:
data.head()

Unnamed: 0,catg,body
0,automobile,Toyota's Etios and Liva have 14 Standard Safet...
1,automobile,Several pugs gatecrashed the finals of Vodafon...
2,automobile,Audi has launched the all-new Q5 with a price ...
3,automobile,Italian supercar maker Lamborghini on Thursday...
4,automobile,"Japanese automaker Honda is recalling 22,084 u..."


In [23]:
#remove Puncuation
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))
sw = [(re.sub('[^A-Za-z ]','',x)) for x in sw]
data['body'] = data['body'].apply(lambda x:re.sub('[^A-Za-z ]','',x))

In [24]:
data.head()

Unnamed: 0,catg,body
0,automobile,Toyotas Etios and Liva have Standard Safety f...
1,automobile,Several pugs gatecrashed the finals of Vodafon...
2,automobile,Audi has launched the allnew Q with a price ra...
3,automobile,Italian supercar maker Lamborghini on Thursday...
4,automobile,Japanese automaker Honda is recalling units o...


In [25]:
# This will split data into 2 sets. One will used for training and one will be for testing.
# sklearn train_test_split method will split data and also randomize data with random_state aregument. We can test different models by using same random_state split every time. 

from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(data['body'],data['catg'],test_size=0.35,random_state=42)

In [29]:
# since we are going to use Naive Bayes Classifier we will use word count as feature and not any tf-idf.(bcs raw word count works better for naive bayes in comparison of tf-idf)
# sklearn provied a very simple implementation of BOW model(CountVectorizer)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = sw,ngram_range = (1,3),max_features = 20000)
xtrain_vec = cv.fit_transform(xtrain)

In [30]:
# transform our 
xtest_vec = cv.transform(xtest)

In [34]:
# Creat a New naive bayes classifier names nb_clf and fit our training data through fit method to calculate probabilty.
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(xtrain_vec,ytrain)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [35]:
print(nb_clf.score(xtest_vec,ytest))

0.730530339226


In [36]:
test_output = nb_clf.predict(xtest_vec)

In [68]:
def find_wrong():
    wrong_arr = []
    i=0
    for c in ytest.items():
        if test_output[i]!=c[1]:
            wrong_arr.append(i)
        i+=1
    return wrong_arr
wrong_arr = find_wrong()

In [80]:
len(wrong_arr)

1128

In [82]:
len(ytest)

4186

In [59]:
print(nb_clf.classes_)

['automobile' 'business' 'entertainment' 'hatke' 'miscellaneous' 'national'
 'politics' 'science' 'sports' 'startup' 'technology' 'world']


In [89]:
i=4
print(wrong_arr[i])
print("My output :",test_output[wrong_arr[i]],"Actual output :",ytest.iloc[wrong_arr[i]])
print(xtest.iloc[wrong_arr[i]])
print(nb_clf.predict_proba(xtest_vec[wrong_arr[i]]))
print(nb_clf.predict_log_proba(xtest_vec[wrong_arr[i]]))

32
My output : startup Actual output : automobile
Online automobile classifieds portal CarTrade on Wednesday announced that it has raised about  crore  million in a funding round led by Temasek Holdings and a USbased family office Temasek which is a governmentowned investment firm from Singapore had led a  crore funding round in CarTrade in January  CarTrade also provides automobile reviews onroad prices and comparisons
[[  3.91049692e-23   2.50074669e-28   1.85836050e-36   4.29028609e-37
    9.58721828e-35   4.20270658e-38   1.21649822e-39   8.09756242e-39
    1.09065133e-39   1.00000000e+00   7.31365410e-22   1.68756182e-37]]
[[-51.59579268 -63.55579324 -82.2733687  -83.73929502 -78.33004747
  -86.06250479 -89.60484221 -87.70925555 -89.71404356   0.         -48.66712902
  -84.67236366]]


In [90]:
print(nb_clf.predict(cv.transform([xtest.iloc[wrong_arr[i]]])))
print(nb_clf.predict_proba(cv.transform([xtest.iloc[wrong_arr[i]]])))
print(nb_clf.predict_log_proba(cv.transform([xtest.iloc[wrong_arr[i]]])))

['startup']
[[  3.91049692e-23   2.50074669e-28   1.85836050e-36   4.29028609e-37
    9.58721828e-35   4.20270658e-38   1.21649822e-39   8.09756242e-39
    1.09065133e-39   1.00000000e+00   7.31365410e-22   1.68756182e-37]]
[[-51.59579268 -63.55579324 -82.2733687  -83.73929502 -78.33004747
  -86.06250479 -89.60484221 -87.70925555 -89.71404356   0.         -48.66712902
  -84.67236366]]


In [44]:
xtest.head()

9269     USbased startup Roman has launched a cloud pha...
10633    Photosharing app Snapchat went down for at lea...
1142     Diesel price touched a record high of  per lit...
9562     nChinabased ecommerce giant Alibabas Founder J...
7559     UKbased scientists have discovered trace fossi...
Name: body, dtype: object

In [99]:
conf = pd.crosstab(ytest,test_output)

In [100]:
conf.head(n=12)

col_0,automobile,business,entertainment,hatke,miscellaneous,national,politics,science,sports,startup,technology,world
catg,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
automobile,274,13,2,8,0,1,1,0,1,17,12,2
business,11,288,3,11,1,12,4,1,0,15,27,11
entertainment,0,2,302,14,4,27,3,0,15,0,1,15
hatke,2,7,4,249,17,19,3,12,6,3,14,37
miscellaneous,2,3,6,82,61,3,1,32,2,2,4,6
national,2,31,18,24,1,211,61,3,3,2,2,28
politics,0,2,5,1,0,37,228,0,0,0,0,2
science,0,2,1,5,0,4,1,343,1,3,3,4
sports,0,3,17,9,2,10,1,0,335,0,1,7
startup,30,21,2,4,1,3,0,7,0,272,44,1


In [95]:
pd.Series(test_output).value_counts()

hatke            445
science          433
business         409
world            370
technology       367
sports           366
entertainment    363
automobile       345
national         345
startup          343
politics         306
miscellaneous     94
dtype: int64

In [102]:
pd.Series(ytest).value_counts()

national         386
startup          385
sports           385
business         384
entertainment    383
hatke            373
science          367
technology       365
world            348
automobile       331
politics         275
miscellaneous    204
Name: catg, dtype: int64