## 20_NEWSGROUPS ANALYSIS USING INBUILT CLASSIFIERS 

 - In 'newsgroups_implementation',when we append processed lists from all files together into a nested list, each row has a variable length.
<br>
 - Sklearn doesn't work on datasets with variable length features. 
<br>
 - Hence none of the inbuilt classifiers can be used when if we have features in the form of words 
<br>
**For using inbuilt naive_bayes classifer or SGDClassifier:**
<br>
> We need to use WordTokenizer which finds a minimal vocabulary out of entire data
<br>
> Treats each example as a vector of size=len(vocabulary) 
<br>
> It has 1s for words that are present in example and 0s for absent words 
<br>
> X_train would have dimensions- 20000 x len(vocabulary)
<br>
> This can be used to fit the Multinomial naive_bayes classifer   

In [12]:
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.cross_validation import train_test_split
from sklearn import datasets, metrics

LOADING DATA

In [3]:
news = datasets.fetch_20newsgroups(subset='all', shuffle=True)
print(news.target_names)
print(len(news.data))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
18846


In [31]:
train_data,test_data,train_target,test_target= train_test_split(news.data, news.target, test_size = 0.20)

In [41]:
stop = stopwords.words('english')
punc = list(string.punctuation)
extras = ["``", "--", "''", '""', "...", '']
stop = stop+punc+extras

USING COUNT VECTORIZER TO VECTORIZE THE TEXT DOCUMENTS

In [57]:
vect = CountVectorizer(analyzer='word', tokenizer=word_tokenize, stop_words=stop)
vect.fit(train_data)
X_train =vect.transform(train_data)
print(X_train.shape)
X_test = vect.transform(test_data)
print(X_test.shape)
print(vect.vocabulary_.get(u'algorithm'))

(15076, 214738)
(3770, 214738)
63004


USING TF-IDF WEIGHTS INSTEAD OF WORD COUNTS IN EACH DOCUMENT

In [58]:
tfidf_transformer = TfidfTransformer()
X_train_tf = tfidf_transformer.fit_transform(X_train)
X_test_tf = tfidf_transformer.transform(X_test)

USING MULTINOMIAL NAIVE-BAYES WITH ALPHA=1(DEFAULT) 

In [59]:
from sklearn import naive_bayes
clf = naive_bayes.MultinomialNB()
clf.fit(X_train_tf, train_target)
Y_pred = clf.predict(X_test_tf)
clf.score(X_test_tf, test_target)

0.8835543766578249

CHECKING PERFORMANCE BY SGD CLASSIFIER 

In [51]:
from sklearn.linear_model import SGDClassifier

In [52]:
clf2 = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)
clf2.fit(X_train_tf, train_target)
Y_pred_svm = clf2.predict(X_test_tf)
clf2.score(X_test_tf, test_target)

0.890185676392573

In [53]:
print(metrics.classification_report(test_target, Y_pred_svm, target_names=news.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.89      0.82      0.85       155
           comp.graphics       0.87      0.84      0.86       206
 comp.os.ms-windows.misc       0.83      0.88      0.86       213
comp.sys.ibm.pc.hardware       0.85      0.76      0.80       205
   comp.sys.mac.hardware       0.91      0.87      0.89       197
          comp.windows.x       0.89      0.89      0.89       207
            misc.forsale       0.85      0.88      0.86       203
               rec.autos       0.91      0.94      0.93       181
         rec.motorcycles       0.95      0.96      0.96       198
      rec.sport.baseball       0.95      0.94      0.95       212
        rec.sport.hockey       0.91      0.97      0.94       199
               sci.crypt       0.90      0.99      0.94       192
         sci.electronics       0.91      0.75      0.82       203
                 sci.med       0.92      0.96      0.94       194
         

We observe that both SVM and NAIVE-BAYES perform equally well on the transformed data with an accuracy of 88-89% with the former slightly better. 