### 20_NEWSGROUPS IMPLEMENTED AS PER ALGORITHM GIVEN BY TOM MITCHELL IN HIS BOOK

**ALGORITHM**
 - calculate the size of vocabulary in newsgroups dataset
 - calculate P(vj) for each class
 - to calculate P(wk/vj): combine all preprocessed docs of class vj into single doc then count
 - number of total words and number of times word wk has appeared in vj 
 - you got P(wk/vj) now we need this for all words of test document
 - multiply the P(wk/vj) with the frequency of occurence class vj, i.e. P(vj)
 - this gives P(vj/test_example) 
 - similarly, we get the probability for each of 20 classes then print class with highest probability as predicted
 target class 
<br>
 
<font color=RED>*THIS IS AN IMPLEMENTATION OF THE NAIVE-BAYES CLASSIFIER WITH LAPLACE CORRECTION WIHTOUT USING MAJOR INBUILT FUNCTIONS FOR PROCESSING OR CLASSIFYING*</font>   
<BR>
The preprocessing has been done specific to the dataset by evaluating it first on a single news_article 

In [159]:
import pandas as pd
import numpy as np
import string 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn import datasets
import re

In [167]:
def process(raw_data):
    words = raw_data.split()
    text = []
    for i in words:
        new_str = re.sub('[^ a-zA-Z0-9]', '', i)
        text.append(new_str)
    # REMOVED ALL PUNCTUATION
    words = [str.lower(i) for i in text] 
    stop = stopwords.words('english')
    words_2 = [i for i in words if not i in stop]
    # REMOVED STOP WORDS
    words = [i for i in words_2 if str.isalpha(i)]
    
    # NOW WE SHALL REMOVE ALL THOSE WORDS WHOSE FREQUENCY=1
    unique, freq = np.unique(words, return_counts=True)
    freq_dict = {}
    for i in range(len(unique)):
        freq_dict[unique[i]] = freq[i]
    freq_dict_temp = freq_dict.copy()    
    for i in freq_dict_temp:
        if(freq_dict_temp[i]<2):        
            del freq_dict[i]
    final_doc = []
    for i in words:
        if(i in freq_dict):
            final_doc.append(i)
    return final_doc
# we return each document as a list of words not as a string

In [168]:
news = datasets.fetch_20newsgroups(subset='all', shuffle=True)
print(news.target_names)
print(len(news.data))

dataset_x = []
for i in news.data:           
    document = process(i)
    dataset_x.append(document)
        
# dataset_x contains processed data, news.target contains targets 
print(len(dataset_x))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
18846
18846


In [231]:
from sklearn import cross_validation as cv
X_train, X_test, Y_train, Y_test = cv.train_test_split(dataset_x, news.target, test_size=0.25)

In [232]:
print(len(X_train))

14134


In [233]:
corpus_x = []
for i in range(20):
    corpus_x.append([])
    
for j in range(len(Y_train)):
    corpus_x[Y_train[j]]  = corpus_x[Y_train[j]]+ X_train[j]
    

In [234]:
print(len(corpus_x))
len(corpus_x[0])

20


42382

In [235]:
vocab_ = []
for i in corpus_x:
   vocab_ = vocab_+i
vocabulary = np.unique(vocab_)
print(len(vocabulary))

33402


Stop words were removed, and any word occurring one time was also removed. The resulting vocabulary contains *33408*  words. now we need to evaluate the formula: 

<font color=blue>P(vj/wk) = P(wk/vj) * P(vj)</font>
<br>
<font color=blue>P(wk/vj) = (no_of_times_word_present_class_vj +1)/(no_of_words_in_trainingset_class_vj + vocabulary_size)</font>

> We have the *vocabulary_size*. 
<br>
> we need to get a frequency dictionary for each training class in corpus_x
<br>
> we also need a vector storing the number of words in each class in corpus_x

In [236]:
#this represents the vector stated above
words_class = []
for i in corpus_x:
    words_class.append(len(i))
#words_class

> - now we create a classwise dictionary of the words in the training set 
<br>
> - where keys = word, value = frequency in the class
<br>
> - name of this dictionary - glossary

In [237]:
glossary = []
for i in corpus_x: 
    unique, freq = np.unique(i, return_counts=True)
    freq_dict = {}
    for i in range(len(unique)):
        freq_dict[unique[i]] = freq[i]
    glossary.append(freq_dict)
print(type(glossary))
print(len(glossary))

<class 'list'>
20


In [238]:
# we calculate the number of news_articles belonging to each class present in the training data
classes, no_of_docs = np.unique(Y_train, return_counts=True)
print(no_of_docs)
print(sum(no_of_docs))

[611 724 745 725 718 731 714 739 751 739 757 717 749 763 745 745 720 686
 578 477]
14134


In [239]:
test_doc = X_test[1000]
ans = Y_test[1000]

In [240]:
def predict(test_doc):
    max_prob = 0
    max_class = 0
    v_size = len(vocabulary)
    class_proba = no_of_docs/sum(no_of_docs)
    for j in range(20):
        # testing for jth class
        proba = 1
        for word in test_doc:
            dictn = glossary[j]
            if(word in dictn):
                term_freq = dictn[word]
                doc_freq = words_class[j]
                p = (term_freq+1)/(doc_freq+v_size)
                proba *= p
        proba *= class_proba[j]
        if(proba>max_prob):
            max_prob = proba
            max_class = j
    return max_class

In [241]:
Y_pred = []
for i in X_test:
    ar = predict(i)
    Y_pred.append(ar)

In [242]:
# evaluating accuracy
count=0
for i in range(len(Y_test)):
    if(Y_test[i]==Y_pred[i]):
        count+=1
accuracy = (count/len(X_test))*100
print(accuracy)

0.8913412563667233


### HENCE MY CODE LEARNS TO PREDICT THE CORRECT NEWSGROUP AN ARTICLE BELONGS TO OUT OF THE 20 CLASSES WITH 89% ACCURACY 