## Naive Bayes Classifier For Inshorts News Data
This is notebook in which we are going to use naive bayes classifier from scikit learn. There are many method for text classification like Naive Bayes, Svm, LDA and many more. This notebook will walk through simple implementation of Naive Bayes Classifier.

### Text Classification -Feature Representation-
Most of the Ml algorithms take numerical input. So in case of text classification we need a representation that map words to number and those number will be feeded to algorithm. Most popular representation is __BOW__(Bag of Words Model). In bag of word model columns represent feature(word) and each row represents a document.   
 __EX :__  
   __Doc1__ = "This line contains word1 and word2."  
   __Doc1__ = "This sentence contains word2 and word2."   
   
|      | this | line | contains | word1 | and | word2 | sentence |
|------|------|------|----------|-------|-----|-------|----------|
| Doc1 | 1    | 1    | 1        | 1     | 1   | 1     | 0        |
| Doc2 | 1    | 0    | 1        | 0     | 1   | 2     | 1        |

### Naive Bayes Model 
Naive Bayes model is fast and produce very good result for text classification task. There are 3 type of probability in Naive Bayes formula.  
>1.Prior probability   (Its class probability how likly this class appear in data.)  
2.likelihood   (This is what we calculate from training data.)  
3.posterior probability.  (Using prior prob. and likelihood we calculate posterior.)  

Note that here predictors prior probabilty is ignored because its constant for all classes so posterior is only depends on prior and likelihood.  
![alt text](Bayes_rule.png "Logo Title Text 1")  
We calculate posterior probabilty for all classes for some given document D.    
#### argmax<sub>i</sub>( *P*( C<sub>i</sub> / D) )  
For i range(1,m), where m = total number of classes

In [1]:
import os
import pandas as pd
import numpy as np
import re

In [4]:
# in this section we read data 
fnames = os.listdir('Data')
frames = [pd.read_csv('Data/'+name_) for name_ in fnames]
data = pd.concat(frames,axis=0,ignore_index=True)
data.head()

Unnamed: 0,ind,catg,headline,body,imageurl,readmoreurl,inblockId
0,1,automobile,Toyota gives increased comfort & safety featur...,Toyota's Etios and Liva have 14 Standard Safet...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/-_PqPwvecyE,dysqqgqf-1
1,2,automobile,Vodafone uses AR in PBL to promote new campaig...,Several pugs gatecrashed the finals of Vodafon...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/yY_MRk3KEhw,dysqqgqf-1
2,3,automobile,Second-Gen Audi Q5 launched at ₹53.25 Lakh,Audi has launched the all-new Q5 with a price ...,http://images.newsinshorts.com.edgesuite.net/a...,https://youtu.be/EiJPF_aVbyo,dysqqgqf-1
3,4,automobile,World's fastest SUV Lamborghini Urus launched ...,Italian supercar maker Lamborghini on Thursday...,http://images.newsinshorts.com.edgesuite.net/a...,http://www.india.com/auto/car-news/lamborghini...,dysqqgqf-1
4,5,automobile,22000 Honda City units recalled in India over ...,"Japanese automaker Honda is recalling 22,084 u...",http://images.newsinshorts.com.edgesuite.net/a...,http://www.livemint.com/Companies/gR3h813hD6ex...,dysqqgqf-1


In [7]:
#dropping unnecessary columns from dataframe
data = data.drop(['imageurl','headline','ind','readmoreurl','inblockId'],axis=1)

In [8]:
data.head()

Unnamed: 0,catg,body
0,automobile,Toyota's Etios and Liva have 14 Standard Safet...
1,automobile,Several pugs gatecrashed the finals of Vodafon...
2,automobile,Audi has launched the all-new Q5 with a price ...
3,automobile,Italian supercar maker Lamborghini on Thursday...
4,automobile,"Japanese automaker Honda is recalling 22,084 u..."


In [23]:
#remove Puncuation
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))
sw = [(re.sub('[^A-Za-z ]','',x)) for x in sw]
data['body'] = data['body'].apply(lambda x:re.sub('[^A-Za-z ]','',x))

In [24]:
data.head()

Unnamed: 0,catg,body
0,automobile,Toyotas Etios and Liva have Standard Safety f...
1,automobile,Several pugs gatecrashed the finals of Vodafon...
2,automobile,Audi has launched the allnew Q with a price ra...
3,automobile,Italian supercar maker Lamborghini on Thursday...
4,automobile,Japanese automaker Honda is recalling units o...


In [25]:
# This will split data into 2 sets. One will used for training and one will be for testing.
# sklearn train_test_split method will split data and also randomize data with random_state aregument. We can test different models by using same random_state split every time. 

from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(data['body'],data['catg'],test_size=0.35,random_state=42)

In [29]:
# since we are going to use Naive Bayes Classifier we will use word count as feature and not any tf-idf.(bcs raw word count works better for naive bayes in comparison of tf-idf)
# sklearn provied a very simple implementation of BOW model(CountVectorizer)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = sw,ngram_range = (1,3),max_features = 20000)
xtrain_vec = cv.fit_transform(xtrain)

In [30]:
# transform our 
xtest_vec = cv.transform(xtest)

In [34]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(xtrain_vec,ytrain)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [35]:
print(nb_clf.score(xtest_vec,ytest))

0.730530339226


In [36]:
test_output = nb_clf.predict(xtest_vec)

In [None]:
def find_wrong():
    for i