# Mini-Project 2: Classification of Textual Data

## 1. Acquiring, preprocessing, and analyzing the data

### 1.1. Loading and looking at data

Importing the required libraries

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()

%pylab inline

Populating the interactive namespace from numpy and matplotlib


Importing the 20 newsgroups dataset and taking a look at the data structure:

In [4]:
from sklearn.datasets import fetch_20newsgroups

In [5]:
newsgroups_train = fetch_20newsgroups(subset = 'train', remove = ('headers', 'footers', 'quotes'))

# Take a look at the data:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [6]:
len(newsgroups_train.data)

11314

Printing the first lines of the first loaded file:


In [7]:
print("\n".join(newsgroups_train.data[0].split("\n")[:]))

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [8]:
print(newsgroups_train.target_names[newsgroups_train.target[0]])

rec.autos


Taking a look at 10 first target values:

In [9]:
newsgroups_train.target[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

Taking a look at the correspoding category names:

In [10]:
for t in newsgroups_train.target[:10]:
    print(newsgroups_train.target_names[t])

rec.autos
comp.sys.mac.hardware
comp.sys.mac.hardware
comp.graphics
sci.space
talk.politics.guns
sci.med
comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc
comp.sys.mac.hardware


### 1.2. Stopwords romoval using NLTK

It was decided to apply NLTK for removing stopwords and punctuation.

In [12]:
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


[nltk_data] Downloading package stopwords to /Users/Nick/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [13]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/Nick/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

The function for removing the punctuation from a dataset:

In [15]:
def remove_punctuation(dataset):

    filtered_dataset = []

    for sentence in dataset:
        word_tokens = word_tokenize(sentence) 

        filtered_words = [w for w in word_tokens if w.isalnum()] 

        filtered_sentence = ""

        for word in filtered_words:
            filtered_sentence += word + " "


        filtered_dataset.append(filtered_sentence)

    return filtered_dataset

The function for removing the stopwords is defined as follows:

In [16]:
def remove_stop_words(dataset):

    filtered_dataset = []

    for sentence in dataset:
        word_tokens = word_tokenize(sentence) 

        filtered_words = [w for w in word_tokens if not w in stopwords] 

        filtered_words = [] 

        for w in word_tokens: 
            if w not in stopwords: 
                filtered_words.append(w)

        filtered_sentence = ""

        for word in filtered_words:
            filtered_sentence += word + " "


        filtered_dataset.append(filtered_sentence)

    return filtered_dataset

Now these functions can be applied to train data. First, the punctuatin is removed:

In [17]:
%time filtered_train = remove_punctuation(newsgroups_train.data)

CPU times: user 31.7 s, sys: 268 ms, total: 32 s
Wall time: 34.6 s


Test whether everything is good for the first sentence:

In [18]:
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [19]:
filtered_train[0]

'I was wondering if anyone out there could enlighten me on this car I saw the other day It was a sports car looked to be from the late early 70s It was called a Bricklin The doors were really small In addition the front bumper was separate from the rest of the body This is all I know If anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please '

Works as it should! Now the stopwords are filtered out:

In [20]:
%time filtered_train = remove_stop_words(filtered_train)

CPU times: user 21.5 s, sys: 139 ms, total: 21.6 s
Wall time: 22 s


Testing once again, on a next sentence just in case:

In [21]:
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [22]:
filtered_train[0]

'I wondering anyone could enlighten car I saw day It sports car looked late early 70s It called Bricklin The doors really small In addition front bumper separate rest body This I know If anyone tellme model name engine specs years production car made history whatever info funky looking car please '

In [23]:
len(newsgroups_train.data)

11314

In [24]:
len(filtered_train)

11314

1.2. Extracting features from text files



Tokenizing text with scikit-learn :

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(filtered_train)
X_train_counts.shape

(11314, 77086)

In [26]:
count_vect.vocabulary_.get(u'algorithm')

9774

Obtaining frequencies from occurancies. First we obtain term frequencies (TF)...

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(11314, 77086)

... and then term frequencies times inverse document frequency (TF-IDF):

In [28]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 77086)

### 1.3. Normalization

In [29]:
from sklearn.preprocessing import Normalizer

normalizer_train = Normalizer().fit(X=X_train_tfidf)
X_train_tfidf_normalized = normalizer_train.transform(X_train_tfidf)

X_train_tfidf_normalized.shape

(11314, 77086)

## 2. Training and testing the classifiers

Before we dive into training and testing, we need to define how do we do cross-validation. Cross-validation is essential for validation of many models and this case is not an exception. It was decided to consider $k=5$ folds, just like in Mini-Project 1.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

def cross_validation(model, X, y, folds):
    pipeline_tfidf = Pipeline([
        ('tfidf',
         TfidfVectorizer(sublinear_tf = True,
                         smooth_idf = True,
                         norm = "l2",
                         lowercase = True,
                         max_features = 50000,
                         use_idf = True,
                         encoding = "utf-8",
                         decode_error = 'ignore',
                         strip_accents = 'unicode',
                         analyzer = "word")),
          ('clf', model)],
          verbose = True)

    scores = cross_val_score(pipeline_tfidf,
                             X,
                             y,
                             cv = folds,
                             scoring = "accuracy")

    print("Cross-validation scores:", scores)
    print("Cross-validation mean score:", scores.mean())

    return scores, scores.mean()

Now we can actually start training the classifiers.

### 2.1. Logistic regression

The logistic regression is used as the first classifier.

In [31]:
from sklearn.linear_model import LogisticRegression

#### 2.1.1. Training the logistic regression classifier

In [32]:
clf = LogisticRegression().fit(X_train_tfidf, newsgroups_train.target)

The model is tested on a couple of custom target values to check whether the trained model can correctly predict them:

In [33]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'Korea, Russia, Iran']

docs_new = remove_punctuation(docs_new)
docs_new = remove_stop_words(docs_new)

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'God love ' => soc.religion.christian
'OpenGL GPU fast ' => rec.autos
'Korea Russia Iran ' => talk.politics.mideast


#### 2.1.2. Hyperparameter tuning for logistic regression

In [0]:
C_par = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0]

mu_max = 0
C_max = C_par[0]

for c in C_par:
    %time scores, mu = cross_validation(model = LogisticRegression(C = c, max_iter = 500), X = filtered_train, y = newsgroups_train.target, folds = 5)
    
    if (mu > mu_max):
        mu_max = mu
        C_max = c

    print("-----------------------------------------------------------------")

print("Best hyperparameter C:", C_max)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  11.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  12.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  12.8s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  13.0s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  12.9s
Cross-validation scores: [0.71232877 0.71188688 0.72381794 0.72779496 0.72369584]
Cross-validation mean score: 0.7199048781126279
CPU times: user 1min 21s, sys: 47.3 s, total: 2min 8s
Wall time: 1min 9s
-----------------------------------------------------------------
[Pipeline] ..........

Tuning the loss function:

In [0]:
penalties = ['l1', 'l2', 'elasticnet']

mu_max = 0
penalty_best = penalties[0]

for p in penalties:
    %time scores, mu = cross_validation(model = LogisticRegression(C = 3, max_iter = 500, penalty = p), X = filtered_train, y = newsgroups_train.target, folds = 5)
    
    if (mu > mu_max):
        mu_max = mu
        penalty_best = p

    print("-----------------------------------------------------------------")

print("Best loss function:", penalty_best)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
Cross-validation scores: [nan nan nan nan nan]
Cross-validation mean score: nan
CPU times: user 4.69 s, sys: 6.87 ms, total: 4.7 s
Wall time: 4.72 s
-----------------------------------------------------------------


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  32.8s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  26.4s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  29.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  26.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  28.3s
Cross-validation scores: [0.73795846 0.74149359 0.75342466 0.74458683 0.7484527 ]
Cross-validation mean score: 0.7451832481393486
CPU times: user 2min 59s, sys: 1min 50s, total: 4min 49s
Wall time: 2min 30s
-----------------------------------------------------------------
[Pipeline] ......

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
Cross-validation scores: [nan nan nan nan nan]
Cross-validation mean score: nan
CPU times: user 4.76 s, sys: 2.61 ms, total: 4.76 s
Wall time: 4.78 s
-----------------------------------------------------------------
Best loss function: l2
Best accuracy: 0.7451832481393486


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.



#### 2.1.3. Building a pipeline for logistic regression

A pipeline is built to make the model to be easier to work with:

In [34]:
from sklearn.pipeline import Pipeline

The model can be trained much simplier now:

In [35]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(C = 3, max_iter = 500, penalty = 'l2')),
])

In [36]:
text_clf.fit(filtered_train, newsgroups_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LogisticRegression(C=3, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
    

#### 2.1.4. Evaluation of the performance of logistic regression on the test set

Loading the test dataset, filtering it, and predicting its target values with logistic regression with the tuned parameter $C = 3$:



In [37]:
newsgroups_test = fetch_20newsgroups(subset = 'test', remove = ('headers', 'footers', 'quotes'))
docs_test = newsgroups_test.data

# Filtering:
filtered_test = remove_punctuation(docs_test)
filtered_test = remove_stop_words(filtered_test)

In [38]:
predicted = text_clf.predict(filtered_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test.target))

Average accuracy: 0.6802973977695167


In [39]:
from sklearn import metrics

In [40]:
print(metrics.classification_report(newsgroups_test.target, predicted,
                                    target_names = newsgroups_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.49      0.46      0.48       319
           comp.graphics       0.62      0.68      0.65       389
 comp.os.ms-windows.misc       0.64      0.60      0.62       394
comp.sys.ibm.pc.hardware       0.63      0.64      0.63       392
   comp.sys.mac.hardware       0.74      0.68      0.71       385
          comp.windows.x       0.78      0.66      0.72       395
            misc.forsale       0.76      0.79      0.77       390
               rec.autos       0.74      0.71      0.72       396
         rec.motorcycles       0.50      0.81      0.62       398
      rec.sport.baseball       0.82      0.79      0.80       397
        rec.sport.hockey       0.90      0.86      0.88       399
               sci.crypt       0.83      0.68      0.75       396
         sci.electronics       0.57      0.61      0.59       393
                 sci.med       0.78      0.75      0.76       396
         

### 2.2. Decision tree classifier

In [41]:
from sklearn.tree import DecisionTreeClassifier

#### 2.2.1. Training the decision tree classifier

In [42]:
clf = DecisionTreeClassifier().fit(X_train_tfidf, newsgroups_train.target)

The model is tested on a couple of custom target values to check whether the trained model can correctly predict them:

In [43]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'Korea, Russia, Iran']

# Filtering:

docs_new = remove_punctuation(docs_new)
docs_new = remove_stop_words(docs_new)


X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'God love ' => soc.religion.christian
'OpenGL GPU fast ' => rec.autos
'Korea Russia Iran ' => rec.autos


#### 2.2.2. Hyperparameter tuning for decision trees

First we tune `min_samples_split`



In [0]:
min_samples_splits = np.linspace(0.01, 0.1, 10, endpoint = True)

mu_max = 0
min_samples_split_max = min_samples_splits[0]

for min_samples_split in min_samples_splits:
    %time scores, mu = cross_validation(model = DecisionTreeClassifier(min_samples_split = min_samples_split), X = filtered_train, y = newsgroups_train.target, folds = 5)
    
    if (mu > mu_max):
        mu_max = mu
        min_samples_split_max = min_samples_split

    print("-----------------------------------------------------------------")

print("Best hyperparameter min_samples_splits:", min_samples_split_max)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   8.4s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   8.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   8.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   8.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   7.9s
Cross-validation scores: [0.47105612 0.45558992 0.48342908 0.48298719 0.47126437]
Cross-validation mean score: 0.4728653348977301
CPU times: user 46.6 s, sys: 0 ns, total: 46.6 s
Wall time: 46.7 s
-----------------------------------------------------------------
[Pipeline] ............. (s

Tuning `min_samples_leaf`:

In [0]:
min_samples_leafs = np.linspace(0.001, 0.01, 10, endpoint=True)

mu_max = 0
min_samples_leaf_max = min_samples_leafs[0]

for min_samples_leaf in min_samples_leafs:
    %time scores, mu = cross_validation(model = DecisionTreeClassifier(min_samples_leaf = min_samples_leaf, min_samples_split = 0.02), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        min_samples_leaf_max = min_samples_leaf

    print("-----------------------------------------------------------------")

print("Best hyperparameter min_samples_leaf:", min_samples_leaf_max)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   4.2s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   4.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   4.2s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   4.2s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   4.2s
Cross-validation scores: [0.43747238 0.43835616 0.44984534 0.44851966 0.45181256]
Cross-validation mean score: 0.44520122072958557
CPU times: user 26.8 s, sys: 23.7 ms, total: 26.8 s
Wall time: 26.9 s
-----------------------------------------------------------------
[Pipeline] ............

#### 2.2.3. Building a pipeline for decision tree

A pipeline is built to make the model to be easier to work with:

In [44]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', DecisionTreeClassifier(min_samples_leaf = 0.001, min_samples_split = 0.02)),
])

The model can be trained much simplier now:

In [45]:
text_clf.fit(filtered_train, newsgroups_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features=None, max_leaf_nodes=None,
              

#### 2.2.4. Evaluation of the performance of decision tree on the test set

Loading the test dataset and predicting its target values:

In [46]:
predicted = text_clf.predict(filtered_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test.target))

Average accuracy: 0.4114445034519384


In [47]:
print(metrics.classification_report(newsgroups_test.target, predicted,
                                    target_names = newsgroups_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.17      0.24      0.20       319
           comp.graphics       0.35      0.45      0.39       389
 comp.os.ms-windows.misc       0.53      0.41      0.46       394
comp.sys.ibm.pc.hardware       0.42      0.41      0.41       392
   comp.sys.mac.hardware       0.60      0.37      0.46       385
          comp.windows.x       0.48      0.39      0.43       395
            misc.forsale       0.64      0.55      0.59       390
               rec.autos       0.45      0.42      0.44       396
         rec.motorcycles       0.64      0.45      0.53       398
      rec.sport.baseball       0.19      0.42      0.26       397
        rec.sport.hockey       0.56      0.62      0.59       399
               sci.crypt       0.68      0.40      0.50       396
         sci.electronics       0.23      0.42      0.30       393
                 sci.med       0.40      0.35      0.37       396
         

### 2.3. Support vector machine (SVM)

In [48]:
from sklearn.svm import LinearSVC

#### 2.3.1. Training the SVM classifier

In [49]:
clf = LinearSVC().fit(X_train_tfidf, newsgroups_train.target)

The model is tested on a couple of custom target values to check whether the trained model can correctly predict them:

In [50]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'Korea, Russia, Iran']

# Filtering:
docs_new = remove_punctuation(docs_new)
docs_new = remove_stop_words(docs_new)

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'God love ' => soc.religion.christian
'OpenGL GPU fast ' => rec.autos
'Korea Russia Iran ' => talk.politics.misc


#### 2.3.2. Hyperparameter tuning for SVM

Tuning the penalty parameter $C$:

In [0]:
C_par = [0.25, 0.5, 0.75, 1, 1.25, 1.5]

mu_max = 0
C_max = C_par[0]

for c in C_par:
    %time scores, mu = cross_validation(model = LinearSVC(C = c), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        C_max = c

    print("-----------------------------------------------------------------")

print("Best hyperparameter C:", C_max)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.8s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.8s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.8s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.7s
Cross-validation scores: [0.7410517  0.75740168 0.75695979 0.75475033 0.75906278]
Cross-validation mean score: 0.7538452552166419
CPU times: user 9.7 s, sys: 49 ms, total: 9.75 s
Wall time: 9.79 s
-----------------------------------------------------------------
[Pipeline] ............. (s

Tuning the loss function:

In [0]:
penalties = ['l1', 'l2']

mu_max = 0
penalty_best = penalties[0]

for p in penalties:
    %time scores, mu = cross_validation(model = LinearSVC(C = 0.5, penalty = p), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        penalty_best = p

    print("-----------------------------------------------------------------")

print("Best loss function:", penalty_best)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s


ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s


ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s


ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s


ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
Cross-validation scores: [nan nan nan nan nan]
Cross-validation mean score: nan
CPU times: user 4.99 s, sys: 38.1 ms, total: 5.03 s
Wall time: 5.05 s
-----------------------------------------------------------------


ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True



[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.8s
Cross-validation scores: [0.74458683 0.76182059 0.75784357 0.75342466 0.75994695]
Cross-validation mean score: 0.7555245202783564
CPU times: user 10.4 s, sys: 37.8 ms, total: 10.4 s
Wall time: 10.5 s
-----------------------------------------------------------------
Best loss function: l2
B

#### 2.3.3. Building a pipeline for SVM

A pipeline is built to make the model to be easier to work with:

In [51]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(C = 0.5, penalty = 'l2')),
])

The model can be trained much simplier now:

In [52]:
text_clf.fit(filtered_train, newsgroups_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LinearSVC(C=0.5, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                     

#### 2.3.4. Evaluation of the performance of SVM on the test set

Loading the test dataset and predicting its target values:

In [53]:
predicted = text_clf.predict(filtered_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test.target))

Average accuracy: 0.6891927774827403


In [54]:
print(metrics.classification_report(newsgroups_test.target, predicted,
                                    target_names = newsgroups_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.53      0.46      0.49       319
           comp.graphics       0.64      0.68      0.66       389
 comp.os.ms-windows.misc       0.63      0.63      0.63       394
comp.sys.ibm.pc.hardware       0.66      0.65      0.65       392
   comp.sys.mac.hardware       0.74      0.71      0.72       385
          comp.windows.x       0.77      0.68      0.72       395
            misc.forsale       0.73      0.79      0.76       390
               rec.autos       0.79      0.70      0.75       396
         rec.motorcycles       0.78      0.77      0.78       398
      rec.sport.baseball       0.55      0.84      0.66       397
        rec.sport.hockey       0.89      0.87      0.88       399
               sci.crypt       0.81      0.71      0.76       396
         sci.electronics       0.61      0.58      0.59       393
                 sci.med       0.79      0.77      0.78       396
         

### 2.4. Ada boost classifier

In [55]:
from sklearn.ensemble import AdaBoostClassifier

#### 2.4.1. Training the Ada boost classifier

In [56]:
clf = AdaBoostClassifier().fit(X_train_tfidf, newsgroups_train.target)

The model is tested on a couple of custom target values to check whether the trained model can correctly predict them:

In [57]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'Korea, Russia, Iran']

# Filtering:
docs_new = remove_punctuation(docs_new)
docs_new = remove_stop_words(docs_new)

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'God love ' => soc.religion.christian
'OpenGL GPU fast ' => sci.electronics
'Korea Russia Iran ' => sci.electronics


#### 2.4.2. Hyperparameter tuning for Ada boost

First we tune the maximum number of estmators: 

In [0]:
ns = [50, 100, 150, 200]

mu_max = 0
n_best = ns[0]

for n in ns:
    %time scores, mu = cross_validation(model = AdaBoostClassifier(n_estimators = n), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        n_best = n

    print("-----------------------------------------------------------------")

print("Best hyperparameter n_estimators:", n_best)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   6.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   6.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   6.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   6.8s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   6.7s
Cross-validation scores: [0.3804684  0.38709677 0.39726027 0.39946973 0.39434129]
Cross-validation mean score: 0.39172729485558044
CPU times: user 39.8 s, sys: 388 ms, total: 40.2 s
Wall time: 40.2 s
-----------------------------------------------------------------
[Pipeline] .............

Tuning the learning rate:

In [0]:
lrs = [0.01, 0.1, 0.5, 1, 1.5, 10]

mu_max = 0
lr_best = ns[0]

for lr in lrs:
    %time scores, mu = cross_validation(model = AdaBoostClassifier(n_estimators = 150, learning_rate = lr), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        lr_best = lr

    print("-----------------------------------------------------------------")

print("Best learning rate:", lr_best)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  20.6s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  20.8s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  20.4s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  20.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  20.5s
Cross-validation scores: [0.27662395 0.29518338 0.27618206 0.28192665 0.28116711]
Cross-validation mean score: 0.2822166298814629
CPU times: user 1min 49s, sys: 1.07 s, total: 1min 51s
Wall time: 1min 51s
-----------------------------------------------------------------
[Pipeline] ........

#### 2.4.3. Building a pipeline for Ada boost

A pipeline is built to make the model to be easier to work with:

In [58]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', AdaBoostClassifier(n_estimators = 150, learning_rate = 0.5)),
])

The model can be trained much simplier now:

In [59]:
text_clf.fit(filtered_train, newsgroups_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
                                    learning_rate=0.5, n_estimators=150,
  

#### 2.4.4. Evaluation of the performance of Ada boost on the test set

Loading the test dataset and predicting its target values:

In [60]:
predicted = text_clf.predict(filtered_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test.target))

Average accuracy: 0.4486192246415295


In [61]:
print(metrics.classification_report(newsgroups_test.target, predicted,
                                    target_names = newsgroups_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.39      0.29      0.33       319
           comp.graphics       0.52      0.41      0.46       389
 comp.os.ms-windows.misc       0.62      0.46      0.53       394
comp.sys.ibm.pc.hardware       0.45      0.46      0.46       392
   comp.sys.mac.hardware       0.78      0.41      0.54       385
          comp.windows.x       0.73      0.46      0.57       395
            misc.forsale       0.72      0.54      0.62       390
               rec.autos       0.58      0.43      0.50       396
         rec.motorcycles       0.91      0.46      0.61       398
      rec.sport.baseball       0.62      0.51      0.56       397
        rec.sport.hockey       0.88      0.51      0.65       399
               sci.crypt       0.81      0.45      0.58       396
         sci.electronics       0.11      0.71      0.19       393
                 sci.med       0.83      0.35      0.49       396
         

### 2.5. Random forest classifier

In [62]:
from sklearn.ensemble import RandomForestClassifier

#### 2.5.1. Training the random forest classifier

In [63]:
clf = RandomForestClassifier().fit(X_train_tfidf, newsgroups_train.target)

The model is tested on a couple of custom target values to check whether the trained model can correctly predict them:

In [64]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'Korea, Russia, Iran']

# Filtering:
docs_new = remove_punctuation(docs_new)
docs_new =remove_stop_words(docs_new)

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'God love ' => soc.religion.christian
'OpenGL GPU fast ' => rec.autos
'Korea Russia Iran ' => rec.autos


#### 2.5.2. Hyperparameter tuning for random forest

First we tune the maximum number of estmators:

In [0]:
ns = [50, 100, 150, 200]

mu_max = 0
n_best = ns[0]

for n in ns:
    %time scores, mu = cross_validation(model = RandomForestClassifier(n_estimators = n), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        n_best = n

    print("-----------------------------------------------------------------")

print("Best hyperparameter n_estimators:", n_best)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  16.5s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   0.9s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  16.3s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  16.2s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  16.3s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  16.0s
Cross-validation scores: [0.6243924  0.63543968 0.63499779 0.64604507 0.63881521]
Cross-validation mean score: 0.6359380305088627
CPU times: user 1min 27s, sys: 140 ms, total: 1min 27s
Wall time: 1min 28s
-----------------------------------------------------------------
[Pipeline] ........

Increasing the number of estimators affects the time required for training while accuracy does not increase much after 150, so it was decided to stop at 150.

Next we tune `min_samples_split`:

In [0]:
min_samples_splits = np.linspace(0.001, 0.01, 10, endpoint=True)

mu_max = 0
min_samples_split_max = min_samples_splits[0]

for min_samples_split in min_samples_splits:
    %time scores, mu = cross_validation(model = RandomForestClassifier(n_estimators = 150, min_samples_split = min_samples_split), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        min_samples_split_max = min_samples_split

    print("-----------------------------------------------------------------")

print("Best hyperparameter min_samples_splits:", min_samples_split_max)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  22.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  23.0s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  22.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  23.0s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  23.3s
Cross-validation scores: [0.6548829  0.65841803 0.65753425 0.66106938 0.66489832]
Cross-validation mean score: 0.6593605743102139
CPU times: user 2min 2s, sys: 534 ms, total: 2min 3s
Wall time: 2min 3s
-----------------------------------------------------------------
[Pipeline] ...........

Tuning `min_samples_leaf`:

In [0]:
min_samples_leafs = np.linspace(0.001, 0.01, 10, endpoint=True)

mu_max = 0
min_samples_leaf_max = min_samples_leafs[0]

for min_samples_leaf in min_samples_leafs:
    %time scores, mu = cross_validation(model = RandomForestClassifier(n_estimators = 150, min_samples_split = 0.002, min_samples_leaf = min_samples_leaf), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        min_samples_leaf_max = min_samples_leaf

    print("-----------------------------------------------------------------")

print("Best hyperparameter min_samples_leaf:", min_samples_leaf_max)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   3.9s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   3.6s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   3.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   3.7s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   3.8s
Cross-validation scores: [0.60848431 0.60362351 0.61157755 0.6296951  0.61892131]
Cross-validation mean score: 0.6144603553962507
CPU times: user 25.7 s, sys: 43.5 ms, total: 25.7 s
Wall time: 25.8 s
-----------------------------------------------------------------
[Pipeline] .............

#### 2.5.3. Building a pipeline for random forest

A pipeline is built to make the model to be easier to work with:

In [65]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(n_estimators = 150, min_samples_split = 0.002, min_samples_leaf = 0.001)),
])

The model can be trained much simplier now:

In [66]:
text_clf.fit(filtered_train, newsgroups_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                

#### 2.5.4. Evaluation of the performance on the test set

In [67]:
predicted = text_clf.predict(filtered_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test.target))

Average accuracy: 0.5872278279341476


In [68]:
print(metrics.classification_report(newsgroups_test.target, predicted,
                                    target_names = newsgroups_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.53      0.27      0.35       319
           comp.graphics       0.54      0.56      0.55       389
 comp.os.ms-windows.misc       0.55      0.59      0.57       394
comp.sys.ibm.pc.hardware       0.56      0.56      0.56       392
   comp.sys.mac.hardware       0.70      0.57      0.63       385
          comp.windows.x       0.64      0.64      0.64       395
            misc.forsale       0.66      0.74      0.70       390
               rec.autos       0.62      0.63      0.63       396
         rec.motorcycles       0.32      0.73      0.44       398
      rec.sport.baseball       0.68      0.68      0.68       397
        rec.sport.hockey       0.74      0.83      0.78       399
               sci.crypt       0.74      0.64      0.69       396
         sci.electronics       0.48      0.40      0.43       393
                 sci.med       0.72      0.60      0.65       396
         

### 2.6. Multinominal Naive Bayes

We see that the results from the previous sections are not very encouraging except for the SVM classifier which approaches a 70% accuracy. It was decided to apply Multinominal Naive Bayes as it is easy to implement and fast to train. 

In [69]:
from sklearn.naive_bayes import MultinomialNB

#### 2.6.1. Training the multinominal naive Bayes classifier

In [70]:
clf = MultinomialNB().fit(X_train_tfidf, newsgroups_train.target)

In [71]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'Korea, Russia, Iran']

# Filtering:
docs_new = remove_punctuation(docs_new)
docs_new = remove_stop_words(docs_new)

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'God love ' => soc.religion.christian
'OpenGL GPU fast ' => comp.graphics
'Korea Russia Iran ' => talk.politics.mideast


#### 2.6.2. Hyperparameter tuning for multinominal naive Bayes

Tuning $\alpha$:

In [73]:
alphas = [0.001, 0.005, 0.01, 0.05, 0.1]

mu_max = 0
alpha_best = alphas[0]

for alpha in alphas:
    %time scores, mu = cross_validation(model = MultinomialNB(alpha = alpha), X = filtered_train, y = newsgroups_train.target, folds = 5)

    if (mu > mu_max):
        mu_max = mu
        alpha_best = alpha

    print("-----------------------------------------------------------------")

print("Best hyperparameter alpha:", alpha_best)
print("Best accuracy:", mu_max)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.6s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.3s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.3s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.5s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.1s
[Pipeline] ............. (step 1 of 2) Processing tfidf, total=   1.5s
[Pipeline] ............... (step 2 of 2) Processing clf, total=   0.1s
Cross-validation scores: [0.72425983 0.72470172 0.73884224 0.74900574 0.74535809]
Cross-validation mean score: 0.7364335270075285
CPU times: user 9.13 s, sys: 388 ms, total: 9.52 s
Wall time: 9.68 s
-----------------------------------------------------------------
[Pipeline] ............. 

 #### 2.6.3. Building a pipeline for multinominal naive Bayes

A pipeline is built to make the model to be easier to work with:

In [74]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(alpha = 0.05)),
])


The model can be trained much simplier now:

In [75]:
text_clf.fit(filtered_train, newsgroups_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=0.05, class_prior=None, fit_prior=True))],
         verbose=False)

#### 2.6.4. Evaluation of the performance on the test set

In [76]:
predicted = text_clf.predict(filtered_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test.target))

Average accuracy: 0.694105151354222


In [77]:
print(metrics.classification_report(newsgroups_test.target, predicted,
                                    target_names = newsgroups_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.63      0.38      0.48       319
           comp.graphics       0.65      0.67      0.66       389
 comp.os.ms-windows.misc       0.69      0.56      0.62       394
comp.sys.ibm.pc.hardware       0.61      0.72      0.66       392
   comp.sys.mac.hardware       0.74      0.68      0.71       385
          comp.windows.x       0.78      0.73      0.76       395
            misc.forsale       0.83      0.74      0.78       390
               rec.autos       0.80      0.72      0.76       396
         rec.motorcycles       0.77      0.76      0.76       398
      rec.sport.baseball       0.91      0.83      0.87       397
        rec.sport.hockey       0.59      0.93      0.72       399
               sci.crypt       0.67      0.76      0.71       396
         sci.electronics       0.71      0.57      0.63       393
                 sci.med       0.85      0.77      0.81       396
         

## 3. Using normalized data

The accuracy of multinominal Naive Bayes is slightly better than of SVM, but there is still some room for improvement. In this section, multinominal Naive Bayes and SVM are applied to normalized data.

First the normalized dataset is loaded:

In [0]:
from sklearn.datasets import fetch_20newsgroups_vectorized

In [0]:
newsgroups_train_normalized = fetch_20newsgroups_vectorized(subset = 'train',
                                                            remove = ('headers', 'footers', 'quotes'),
                                                            normalize = True)

### 3.1. Applying SVM to normalized data



#### 3.1.1. Training the SVM classifier on normalized data

In [0]:
clf = LinearSVC(C = 0.5).fit(newsgroups_train_normalized.data, newsgroups_train_normalized.target)

#### 3.1.2. Building a pipeline for SVM with normalized data

A pipeline is built to make the model to be easier to work with:

In [0]:
text_clf = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(C = 0.5)),
])

The model can be trained much simplier now:

In [0]:
text_clf.fit(newsgroups_train_normalized.data, newsgroups_train_normalized.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LinearSVC(C=0.5, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

#### 3.1.3. Evaluation of the performance of SVM on the normalized test set

Loading the test dataset and predicting its target values:

In [0]:
newsgroups_test_normalized = fetch_20newsgroups_vectorized(subset = 'test',
                                                remove = ('headers', 'footers', 'quotes'),
                                                normalize = True)
docs_test = newsgroups_test_normalized.data
predicted = text_clf.predict(docs_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test_normalized.target))

Average accuracy: 0.6980881571959638


In [0]:
print(metrics.classification_report(newsgroups_test_normalized.target, predicted,
                                    target_names = newsgroups_test_normalized.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.53      0.49      0.51       319
           comp.graphics       0.67      0.72      0.69       389
 comp.os.ms-windows.misc       0.63      0.62      0.63       394
comp.sys.ibm.pc.hardware       0.65      0.66      0.66       392
   comp.sys.mac.hardware       0.74      0.70      0.72       385
          comp.windows.x       0.82      0.71      0.76       395
            misc.forsale       0.77      0.80      0.78       390
               rec.autos       0.76      0.71      0.74       396
         rec.motorcycles       0.79      0.77      0.78       398
      rec.sport.baseball       0.55      0.83      0.66       397
        rec.sport.hockey       0.87      0.87      0.87       399
               sci.crypt       0.85      0.71      0.78       396
         sci.electronics       0.62      0.59      0.61       393
                 sci.med       0.79      0.77      0.78       396
         

### 3.2. Applying multinominal Naive Bayes to normalized data

#### 3.2.1. Training the multinominal Naive Bayes classifier on normalized data

In [0]:
clf = MultinomialNB(alpha = 0.01).fit(newsgroups_train_normalized.data, newsgroups_train_normalized.target)

#### 3.2.2. Building a pipeline for multinominal Naive Bayes with normalized data

A pipeline is built to make the model to be easier to work with:

In [0]:
text_clf = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(alpha = 0.01)),
])

The model can be trained much simplier now:

In [0]:
text_clf.fit(newsgroups_train_normalized.data, newsgroups_train_normalized.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))],
         verbose=False)

#### 3.2.3. Evaluation of the performance of multinominal Naive Bayes on the normalized test set

In [0]:
predicted = text_clf.predict(docs_test)
print("Average accuracy:", np.mean(predicted == newsgroups_test_normalized.target))

Average accuracy: 0.7002124269782263


In [0]:
print(metrics.classification_report(newsgroups_test_normalized.target, predicted,
                                    target_names = newsgroups_test_normalized.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.57      0.44      0.50       319
           comp.graphics       0.65      0.71      0.68       389
 comp.os.ms-windows.misc       0.72      0.52      0.60       394
comp.sys.ibm.pc.hardware       0.59      0.71      0.65       392
   comp.sys.mac.hardware       0.72      0.70      0.71       385
          comp.windows.x       0.81      0.74      0.78       395
            misc.forsale       0.83      0.72      0.77       390
               rec.autos       0.75      0.73      0.74       396
         rec.motorcycles       0.76      0.73      0.75       398
      rec.sport.baseball       0.93      0.81      0.87       397
        rec.sport.hockey       0.60      0.93      0.73       399
               sci.crypt       0.72      0.75      0.74       396
         sci.electronics       0.72      0.58      0.64       393
                 sci.med       0.82      0.78      0.80       396
         

### 3.3. Normalization results

Note that normalization affects the accuracy insignificantly in this case.