# Assignment 1

Date: 03-10-2020 <br>
Nick Radunovic (s2072724) <br>
Cheyenne Heath (s1647865) <br>

The goal of this assignment is to categorize the 20newsgroup dataset using several classifiers. For each classifier, multiple features are tested and compared.
After analyzing the obtained results, it should be clear what classifier peforms best and with what features.

In [1]:
import sys
import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.datasets import fetch_20newsgroups

First, the train and test data are fetched from the 20newsgroup.

In [2]:
# Download the train dataset into 'twenty_train'
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

# Print global statistic of data
print('Train:', len(twenty_train.data))
print('Test:', len(twenty_test.data))
total = len(twenty_train.data) + len(twenty_test.data)
print('Ratio:', round(len(twenty_train.data)/total*100, 2), '(train) /' , round(len(twenty_test.data)/total*100, 2), '(test)')

Train: 11314
Test: 7532
Ratio: 60.03 (train) / 39.97 (test)


Second, The pipelines for each classifier are initialized.
The classifiers that will be compared are Naive Bayes, Support Vector Machine (SVM) and Random Forest.
Each classifier will be initialized three times utilizing each time a different feature. The different features that are considered are CountVectorizer, TF and TF-IDF. Note, that 9 different pipelines are initialized.

In [3]:
# Initialize pipelines for NB
NB_clf_counts = Pipeline([('vect', CountVectorizer()), 
                          ('clf', MultinomialNB()),
                         ])

NB_clf_tf = Pipeline([('vect', CountVectorizer()), 
                      ('tf', TfidfTransformer(use_idf=False)),
                      ('clf', MultinomialNB()),
                      ])

NB_clf_tfidf = Pipeline([('vect', CountVectorizer()), 
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultinomialNB()),
                         ])

#Initialize pipelines for LSVM
LSVM_clf_counts = Pipeline([('vect', CountVectorizer()), 
                          ('clf', LinearSVC()),
                         ])

LSVM_clf_tf = Pipeline([('vect', CountVectorizer()), 
                      ('tf', TfidfTransformer(use_idf=False)),
                      ('clf', LinearSVC()),
                      ])

LSVM_clf_tfidf = Pipeline([('vect', CountVectorizer()), 
                         ('tfidf', TfidfTransformer()),
                         ('clf', LinearSVC()),
                         ])

#Initialize pipelines for RF
RF_clf_counts = Pipeline([('vect', CountVectorizer()), 
                          ('clf', RandomForestClassifier()),
                         ])

RF_clf_tf = Pipeline([('vect', CountVectorizer()), 
                      ('tf', TfidfTransformer(use_idf=False)),
                      ('clf', RandomForestClassifier()),
                     ])

RF_clf_tfidf = Pipeline([('vect', CountVectorizer()), 
                         ('tfidf', TfidfTransformer()),
                         ('clf', RandomForestClassifier()),
                        ])  

The third step is to fit the classifiers and make new predictions.
Remark, that the metrics of precision/recall can be understand in the following way: <br>

__High precision, low recall__ = returns very few results, but most of its predicted labels are correct when compared to the actual labels. <br>
__Low precision, high recall__ = returns many results, but most of its predicted labels are incorrect when compared to the actual labels.

Below, the quality results of Naive Bayes are calculated for each of the three features.

In [4]:
# Naive Bayes
print("Naive Bayes")

features = ["Counts:", "TF:", "TF-IDF:"]
naive_bayes = [NB_clf_counts, NB_clf_tf, NB_clf_tfidf]
for i in range(len(naive_bayes)):
    nb = naive_bayes[i].fit(twenty_train.data, twenty_train.target)
    predicted = nb.predict(twenty_test.data)
    avg_metrics = metrics.classification_report(twenty_test.target, predicted,
                                    target_names=twenty_test.target_names, output_dict=True)['weighted avg']
    print(features[i])
    print("accuracy:", round(np.mean(predicted == twenty_test.target), 2))
    for metric in list(avg_metrics.keys())[:3]:
        print("%s: %r" % (metric, round(avg_metrics[metric], 2)))
    print()

Naive Bayes
Counts:
accuracy: 0.77
precision: 0.76
recall: 0.77
f1-score: 0.75

TF:
accuracy: 0.71
precision: 0.79
recall: 0.71
f1-score: 0.69

TF-IDF:
accuracy: 0.77
precision: 0.82
recall: 0.77
f1-score: 0.77



In the same manner, the quality results of SVM are calculated for each of the three features.

In [8]:
# Linear Support Vector Machine
print("Linear Support Vector Machine")

support_vector_machine = [LSVM_clf_counts, LSVM_clf_tf, LSVM_clf_tfidf]
for i in range(len(support_vector_machine)):
    svm = support_vector_machine[i].fit(twenty_train.data, twenty_train.target)
    predicted = svm.predict(twenty_test.data)
    avg_metrics = metrics.classification_report(twenty_test.target, predicted,
                                    target_names=twenty_test.target_names, output_dict=True)['weighted avg']
    print(features[i])
    print("accuracy:", round(np.mean(predicted == twenty_test.target), 2))
    for metric in list(avg_metrics.keys())[:3]:
        print("%s: %r" % (metric, round(avg_metrics[metric], 2)))
    print()

Linear Support Vector Machine
Counts:
accuracy: 0.79
precision: 0.79
recall: 0.79
f1-score: 0.78

TF:
accuracy: 0.83
precision: 0.83
recall: 0.83
f1-score: 0.82

TF-IDF:
accuracy: 0.85
precision: 0.85
recall: 0.85
f1-score: 0.85



Lastly, the quality results of Random Forest are calculated for each of the three features.

In [6]:
# Random Forest
print("Random Forest")

random_forest = [RF_clf_counts, RF_clf_tf, RF_clf_tfidf]
for i in range(len(random_forest)):
    rf = random_forest[i].fit(twenty_train.data, twenty_train.target)
    predicted = rf.predict(twenty_test.data)
    avg_metrics = metrics.classification_report(twenty_test.target, predicted,
                                    target_names=twenty_test.target_names, output_dict=True)['weighted avg']
    print(features[i])
    print("accuracy:", round(np.mean(predicted == twenty_test.target), 2))
    for metric in list(avg_metrics.keys())[:3]:
        print("%s: %r" % (metric, round(avg_metrics[metric], 2)))
    print()

Random Forest
Counts:
accuracy: 0.77
precision: 0.78
recall: 0.77
f1-score: 0.76

TF:
accuracy: 0.75
precision: 0.76
recall: 0.75
f1-score: 0.75

TF-IDF:
accuracy: 0.76
precision: 0.77
recall: 0.76
f1-score: 0.76



The classifier SVM appeared to be the most accurate in categorizing the 20newsgroup dataset, achieving an accuracy of 85% when utilizing the TF-IDF feature. <br>
However, some other parameters could be tweaked as well which may lead to an increase in accuracy. <br>
Below, the quality results of the classifier SVM are calculated and shown using different parameters values of the CountVectorizer feature. Only different parameter values for the SVM classifier are tested, as previous tests showed that SVM obtains the best categorization. 

In [6]:
#Linear Support Vector Machine
print("Linear Support Vector Machine (CountVectorizer)\n")

LSVM_clf_counts_lowercase_true = Pipeline([('vect', CountVectorizer()), 
                          ('clf', LinearSVC()),
                         ])
LSVM_clf_counts_lowercase_false = Pipeline([('vect', CountVectorizer(lowercase=False)), 
                          ('clf', LinearSVC()),
                         ])
LSVM_clf_counts_lowercase_stop_words_english = Pipeline([('vect', CountVectorizer(stop_words='english')), 
                          ('clf', LinearSVC()),
                         ])
LSVM_clf_counts_lowercase_analyzer_word_12 = Pipeline([('vect', CountVectorizer(ngram_range = (1,2))), 
                          ('clf', LinearSVC()),
                         ])
LSVM_clf_counts_lowercase_analyzer_word_22 = Pipeline([('vect', CountVectorizer(ngram_range = (2,2))), 
                          ('clf', LinearSVC()),
                         ])
LSVM_clf_counts_lowercase_analyzer_char_11 = Pipeline([('vect', CountVectorizer(ngram_range = (1,1), analyzer = 'char')), 
                          ('clf', LinearSVC()),
                         ])
LSVM_clf_counts_lowercase_analyzer_char_12 = Pipeline([('vect', CountVectorizer(ngram_range = (1,2), analyzer = 'char')), 
                          ('clf', LinearSVC()),
                         ])
LSVM_clf_counts_lowercase_analyzer_char_22 = Pipeline([('vect', CountVectorizer(ngram_range = (2,2), analyzer = 'char')), 
                          ('clf', LinearSVC()),
                         ])

LSVM = [LSVM_clf_counts_lowercase_true, LSVM_clf_counts_lowercase_false, LSVM_clf_counts_lowercase_stop_words_english,
             LSVM_clf_counts_lowercase_analyzer_word_12, LSVM_clf_counts_lowercase_analyzer_word_22,
             LSVM_clf_counts_lowercase_analyzer_char_11, LSVM_clf_counts_lowercase_analyzer_char_12,
             LSVM_clf_counts_lowercase_analyzer_char_22]

LSVM_labels = ["Lowercase True", "Lowercase False", "Stop-words = 'english'", "analyzer = 'word', ngram_range = (1,2)",
               "analyzer = 'word', ngram_range = (2,2)", "analyzer = 'char', ngram_range = (1,1)", 
               "analyzer = 'char', ngram_range = (1,2)", "analyzer = 'char', ngram_range = (2,2)"]

for i in range(len(LSVM)):
    svm = LSVM[i].fit(twenty_train.data, twenty_train.target)
    predicted = svm.predict(twenty_test.data)
    avg_metrics = metrics.classification_report(twenty_test.target, predicted,
                                                target_names=twenty_test.target_names, output_dict=True)['weighted avg']
    print(LSVM_labels[i])
    print("accuracy:", round(np.mean(predicted == twenty_test.target), 2))
    for metric in list(avg_metrics.keys())[:3]:
        print("%s: %r" % (metric, round(avg_metrics[metric], 2)))
    print()

Linear Support Vector Machine (CountVectorizer)

Lowercase True
accuracy: 0.79
precision: 0.79
recall: 0.79
f1-score: 0.78

Lowercase False
accuracy: 0.79
precision: 0.79
recall: 0.79
f1-score: 0.79

Stop-words = 'english'
accuracy: 0.8
precision: 0.8
recall: 0.8
f1-score: 0.8

analyzer = 'word', ngram_range = (1,2)
accuracy: 0.81
precision: 0.81
recall: 0.81
f1-score: 0.81

analyzer = 'word', ngram_range = (2,2)
accuracy: 0.75
precision: 0.75
recall: 0.75
f1-score: 0.75

analyzer = 'char', ngram_range = (1,1)
accuracy: 0.12
precision: 0.33
recall: 0.12
f1-score: 0.1

analyzer = 'char', ngram_range = (1,2)
accuracy: 0.6
precision: 0.62
recall: 0.6
f1-score: 0.6

analyzer = 'char', ngram_range = (2,2)
accuracy: 0.61
precision: 0.61
recall: 0.61
f1-score: 0.61



In order to obtain the parameter combination that leads to the best categorization, a grid search is being used.

In [None]:
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)],
    'vect__lowercase': (True, False),
    'vect__stop_words': (None, 'english'),
    'vect__analyzer': ('word', 'char', 'char_wb'),
    'vect__max_features': (None, 10, 50, 100, 500, 1000, 5000, 10000),
}

gs_clf = GridSearchCV(LSVM_clf_counts_lowercase_true, parameters, cv=5, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

print("Best mean score:", gs_clf.best_score_)

for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

The result above shows the feature combination of CountVectorizer that produces the best results.
To see wether or not the accuracy of SVM is higher with these parameter values, we computer the accuracy of the classifier.

In [10]:
LSVM = Pipeline([('vect', CountVectorizer(ngram_range = (1,3), stop_words='english')), 
                          ('clf', LinearSVC()),
                         ])

svm = LSVM.fit(twenty_train.data, twenty_train.target)
predicted = svm.predict(twenty_test.data)
print("accuracy:", round(np.mean(predicted == twenty_test.target), 2))

accuracy: 0.83


We see that the accuracy using these parameter values is higher than than the accuracy obtained using default values.
However, the accuracy of the classifier when using ID-IDF is still the highest.