In [59]:
import pandas as pd 
import numpy as np 
import re 

In [60]:
data = pd.read_csv('datasets/r8-train-stemmed.txt', header=None, sep='\t')
data

Unnamed: 0,0,1
0,earn,champion product approv stock split champion p...
1,acq,comput termin system cpml complet sale comput ...
2,earn,cobanco inc cbco year net shr ct dlr net asset...
3,earn,intern inc qtr jan oper shr loss two ct profit...
4,earn,brown forman inc bfd qtr net shr dlr ct net ml...
5,earn,dean food see strong qtr earn dean food expect...
6,earn,brown forman bfdb set stock split up payout br...
7,earn,esquir radio and electron inc qtr shr profit c...
8,earn,unit presidenti corp upco qtr net shr ct ct ne...
9,earn,owen and minor inc obod rais qtly dividend qtl...


In [61]:
labels = data[0].unique()
train_data = data[1]
train_data

0       champion product approv stock split champion p...
1       comput termin system cpml complet sale comput ...
2       cobanco inc cbco year net shr ct dlr net asset...
3       intern inc qtr jan oper shr loss two ct profit...
4       brown forman inc bfd qtr net shr dlr ct net ml...
5       dean food see strong qtr earn dean food expect...
6       brown forman bfdb set stock split up payout br...
7       esquir radio and electron inc qtr shr profit c...
8       unit presidenti corp upco qtr net shr ct ct ne...
9       owen and minor inc obod rais qtly dividend qtl...
10      comput languag research clri qtr shr loss ct l...
11      cinram qtr net shr ct ct net mln sale mln mln ...
12      standard trustco see year standard trustco exp...
13      handi and harman hnh qtr loss shr loss ct loss...
14      chemlawn chem rise hope for higher bid chemlaw...
15      brazil anti inflat plan limp anniversari infla...
16      agenc report ship wait panama canal panama can...
17      americ

In [62]:
train_data.shape

(5485,)

## CountVectorizer 

Countvectorizer takes in the text corpus, builds its term-document matrix and returns it.

Every word is assigned a fixed unique integer id and value of each cell of this matrix represents the word count - BoW 

X_train_counts[i, j] - where i refers to a document which in this case specifies a training example and j refers to the index of a word w in its respective document i - would return count of word j



In [63]:
from sklearn.feature_extraction.text import CountVectorizer 
count_vect = CountVectorizer() 
X_train_counts = count_vect.fit_transform(train_data) #builds a term-document matrix ands return it
print (X_train_counts.shape)

(5485, 14575)


In [64]:
print(X_train_counts)

  (0, 10911)	1
  (0, 8329)	2
  (0, 1989)	1
  (0, 917)	1
  (0, 6376)	1
  (0, 8041)	1
  (0, 598)	1
  (0, 10608)	1
  (0, 13979)	1
  (0, 2674)	1
  (0, 680)	2
  (0, 10616)	1
  (0, 11662)	2
  (0, 11661)	2
  (0, 2660)	1
  (0, 4900)	2
  (0, 13409)	1
  (0, 3610)	1
  (0, 1503)	2
  (0, 6356)	1
  (0, 12167)	2
  (0, 12360)	3
  (0, 675)	2
  (0, 10164)	2
  (0, 2241)	2
  :	:
  (5484, 5448)	3
  (5484, 11721)	1
  (5484, 2972)	1
  (5484, 4750)	1
  (5484, 9116)	2
  (5484, 11396)	1
  (5484, 3766)	1
  (5484, 13201)	1
  (5484, 1859)	1
  (5484, 13114)	1
  (5484, 2064)	1
  (5484, 1291)	1
  (5484, 131)	2
  (5484, 7265)	1
  (5484, 14276)	1
  (5484, 12961)	1
  (5484, 9548)	1
  (5484, 7740)	1
  (5484, 9848)	1
  (5484, 14488)	2
  (5484, 561)	4
  (5484, 10911)	1
  (5484, 8329)	1
  (5484, 8041)	1
  (5484, 13409)	1


In [65]:
labels, indexed_train_labels = np.unique(data[0], return_inverse=True)
indexed_labels

array([2, 0, 2, ..., 2, 5, 6])

In [66]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() 
clf.fit(X_train_counts, indexed_train_labels)  #calling the fit method trains it
print ("Training Completed")

Training Completed


In [67]:
data1 = pd.read_csv('datasets/r8-test-stemmed.txt', header=None, sep='\t')
data1

Unnamed: 0,0,1
0,trade,asian export fear damag japan rift mount trade...
1,grain,china daili vermin eat pct grain stock survei ...
2,ship,australian foreign ship ban end nsw port hit t...
3,acq,sumitomo bank aim quick recoveri merger sumito...
4,earn,amatil propos two for bonu share issu amatil a...
5,earn,bowat pretax profit rise mln stg shr div make ...
6,acq,cra sold forrest gold for mln dlr whim creek w...
7,earn,bowat industri profit exce expect bowat indust...
8,earn,citibank norwai unit lose six mln crown citiba...
9,earn,vieill montagn condit unfavour sharp fall doll...


In [68]:
test_data = data1[1]
test_data.shape

(2189,)

In [69]:
labels, indexed_test_labels = np.unique(data1[0], return_inverse=True)
indexed_test_labels.shape

(2189,)

The same count_vect object that was instantiated for training dataset will be used for test dataset. But rememeber that we are not calling fit_transform since we only want to transform the test data into a term-document matrix whereas fit_transform learns the vocabulary dictionary first and then returns a term-document matrix. We are supposed to learn the vocabulary on training dataset only. 

Remember: 
- fit_transform: learn the vocabulary dictionary and return term-document matrix 
- transform: transform documents to document-term matrix 

In [70]:
X_test_counts = count_vect.transform(test_data) #transforms test data to numerical form
print (X_test_counts.shape)

(2189, 14575)


In [71]:
predicted=clf.predict(X_test_counts)
print ("Test Set Accuracy : ",np.sum(predicted==indexed_test_labels)/float(len(predicted))) 

Test Set Accuracy :  0.9607126541799909
