# Overview

The following notebook supports a study of classifier methods used in contemporary machine learning. This specific experiment provides python implementations as a basis for comparisons of performance using the same dataset and evaluation criteria. This is by no means exhaustive, but illustrative in the differences between each method, and informative in which perform better on different types of data.

# Data

The data being used for this experiment is part of the sci-kit package, the 20 newsgroups dataset that contains 18000 newsgroups posts on 20 topics. Like the MNIST dataset it comes cleansed, labeled, and packaged with sci-kit. It also has a vectorized set available which include features that are ready to use with classifiers.

TODO - add some background and samples for the fetch20

In [1]:
from sklearn.datasets import fetch_20newsgroups_vectorized
data = fetch_20newsgroups_vectorized('all')

In [2]:
from sklearn.model_selection import train_test_split

In [3]:
newsdata = data
X = newsdata.data
y = newsdata.target

In [4]:
X.shape

(18846, 130107)

In [5]:
y.shape

(18846,)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.1)

In [7]:
y_train

array([19,  7, 11, ..., 19, 12, 18])

In [8]:
y_train[1]

7

In [9]:
X_train[:,1]

<16961x1 sparse matrix of type '<class 'numpy.float64'>'
	with 625 stored elements in Compressed Sparse Row format>

# Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression
logregress = LogisticRegression(solver = 'lbfgs')

In [11]:
logregress.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
logregress_predictions = logregress.predict(X_test)

In [13]:
logregress_score = logregress.score(X_test, y_test)
print (logregress_score)

0.820689655172


In [14]:
logregress2 = LogisticRegression(solver = 'sag')

In [15]:
logregress2.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='sag', tol=0.0001,
          verbose=0, warm_start=False)

In [16]:
logregress_predictions2 = logregress2.predict(X_test)

In [17]:
logregress_score2 = logregress2.score(X_test, y_test)
print (logregress_score2)

0.820689655172


In [18]:
#Additional performance measures
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(classification_report(y_test, logregress_predictions))

             precision    recall  f1-score   support

          0       0.86      0.85      0.86        80
          1       0.71      0.81      0.76        97
          2       0.81      0.80      0.80        99
          3       0.76      0.66      0.71        98
          4       0.86      0.78      0.82        96
          5       0.75      0.80      0.77        99
          6       0.72      0.86      0.78        98
          7       0.82      0.93      0.87        99
          8       0.91      0.86      0.89       100
          9       0.84      0.88      0.86        99
         10       0.91      0.92      0.92       100
         11       0.90      0.85      0.88        99
         12       0.83      0.84      0.83        98
         13       0.78      0.80      0.79        99
         14       0.87      0.89      0.88        99
         15       0.77      0.88      0.82       100
         16       0.82      0.88      0.85        91
         17       0.86      0.87      0.87   

In [19]:
#Confusion matrix. Higher values along the diagonal are better
print("Confusion matrix:\n%s" % confusion_matrix(y_test, logregress_predictions))

Confusion matrix:
[[68  0  0  0  0  0  0  0  0  1  0  1  0  1  0  4  0  2  0  3]
 [ 0 79  3  1  0  6  2  0  0  0  0  0  1  1  1  1  0  1  0  1]
 [ 0  5 79  4  1  5  2  0  1  0  0  0  1  0  1  0  0  0  0  0]
 [ 0  4  8 65  4  1  4  1  1  0  1  1  3  2  1  0  0  1  1  0]
 [ 0  3  1  8 75  1  3  0  0  0  0  0  2  0  0  1  1  1  0  0]
 [ 0  6  7  3  1 79  0  1  1  1  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  1  1  1 84  3  0  2  1  2  1  1  0  1  0  0  0  0]
 [ 0  1  0  0  0  0  2 92  0  0  0  0  1  2  1  0  0  0  0  0]
 [ 0  0  0  0  1  0  3  6 86  1  0  0  0  0  1  0  1  0  1  0]
 [ 0  1  0  1  0  1  2  0  0 87  4  0  0  3  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  4  0  0  3 92  0  0  0  0  0  0  0  1  0]
 [ 0  2  0  0  1  4  2  0  1  2  1 84  0  0  0  0  1  1  0  0]
 [ 0  5  0  2  1  1  2  4  0  0  0  0 82  0  1  0  0  0  0  0]
 [ 0  1  0  0  0  0  3  3  2  1  0  1  1 79  3  1  1  2  1  0]
 [ 1  1  0  0  1  2  2  0  0  0  0  0  0  3 88  1  0  0  0  0]
 [ 1  3  0  0  1  3  0  0  0  0  1  0

# Stochastic Gradient Descent

In [20]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)

In [21]:
sgd_predictions = sgd_clf.predict(X_test)

In [22]:
sgd_score = sgd_clf.score(X_test, y_test)
print(sgd_score)

0.872148541114


In [23]:
print(classification_report(y_test, sgd_predictions))

             precision    recall  f1-score   support

          0       0.87      0.85      0.86        80
          1       0.76      0.91      0.83        97
          2       0.82      0.87      0.84        99
          3       0.83      0.73      0.78        98
          4       0.88      0.84      0.86        96
          5       0.95      0.78      0.86        99
          6       0.77      0.91      0.84        98
          7       0.91      0.97      0.94        99
          8       0.97      0.92      0.94       100
          9       0.87      0.98      0.92        99
         10       0.98      0.98      0.98       100
         11       0.95      0.90      0.92        99
         12       0.92      0.82      0.86        98
         13       0.89      0.90      0.89        99
         14       0.89      0.94      0.92        99
         15       0.80      0.89      0.84       100
         16       0.89      0.92      0.91        91
         17       0.78      0.97      0.87   

In [24]:
print("Confusion matrix:\n%s" % confusion_matrix(y_test, sgd_predictions))

Confusion matrix:
[[68  0  0  0  0  0  0  0  0  1  0  1  0  0  0  3  0  2  0  5]
 [ 0 88  2  1  0  2  2  0  0  1  0  0  0  0  0  0  0  1  0  0]
 [ 0  5 86  2  2  1  2  0  0  0  0  0  0  0  1  0  0  0  0  0]
 [ 1  4  5 72  5  0  5  0  1  0  0  0  2  0  2  0  0  1  0  0]
 [ 0  3  2  4 81  0  3  0  0  0  0  0  1  0  0  0  0  2  0  0]
 [ 0  4  8  4  1 77  0  0  0  2  0  0  0  0  1  0  0  2  0  0]
 [ 0  0  1  2  0  0 89  1  0  1  0  1  2  0  1  0  0  0  0  0]
 [ 0  2  0  0  0  0  0 96  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  2  4 92  2  0  0  0  0  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  0  0  0 97  1  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  1  0  0  1 98  0  0  0  0  0  0  0  0  0]
 [ 0  2  0  0  0  0  3  1  1  0  0 89  0  1  0  1  0  1  0  0]
 [ 0  3  0  2  2  0  4  3  0  0  0  1 80  1  2  0  0  0  0  0]
 [ 0  0  0  0  0  0  1  1  0  1  0  0  0 89  1  1  0  4  1  0]
 [ 1  2  0  0  1  0  1  0  0  0  0  0  0  1 93  0  0  0  0  0]
 [ 3  2  0  0  0  1  1  0  0  1  0  0

# Scalable Vector Machine

In [25]:
from sklearn.svm import LinearSVC

scalable_vec_machine = LinearSVC(C=1, loss="hinge")
scalable_vec_machine.fit(X_train, y_train)

LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)

In [28]:
svm_predictions = scalable_vec_machine.predict(X_test)

In [27]:
svn_score = scalable_vec_machine.score(X_test, y_test)
print(svn_score)

0.893899204244


In [29]:
print(classification_report(y_test, svm_predictions))

             precision    recall  f1-score   support

          0       0.84      0.88      0.86        80
          1       0.81      0.88      0.84        97
          2       0.83      0.87      0.85        99
          3       0.81      0.81      0.81        98
          4       0.93      0.86      0.90        96
          5       0.92      0.88      0.90        99
          6       0.84      0.89      0.87        98
          7       0.92      0.98      0.95        99
          8       0.97      0.95      0.96       100
          9       0.92      0.99      0.95        99
         10       0.96      0.99      0.98       100
         11       0.96      0.92      0.94        99
         12       0.91      0.86      0.88        98
         13       0.93      0.89      0.91        99
         14       0.93      0.94      0.93        99
         15       0.78      0.92      0.84       100
         16       0.90      0.95      0.92        91
         17       0.91      0.96      0.93   

In [30]:
print("Confusion matrix:\n%s" % confusion_matrix(y_test, svm_predictions))

Confusion matrix:
[[70  0  0  0  0  0  0  0  0  1  0  1  0  0  0  4  0  0  0  4]
 [ 0 85  2  1  0  4  2  0  0  1  0  0  1  0  0  0  0  1  0  0]
 [ 0  5 86  4  1  1  1  0  0  0  0  0  0  0  1  0  0  0  0  0]
 [ 0  3  5 79  3  1  3  0  1  0  0  0  1  1  1  0  0  0  0  0]
 [ 0  3  2  4 83  0  0  0  0  0  1  0  1  0  0  0  0  2  0  0]
 [ 0  2  6  3  0 87  0  0  0  1  0  0  0  0  0  0  0  0  0  0]
 [ 1  0  0  3  0  0 87  2  0  0  0  1  2  0  1  1  0  0  0  0]
 [ 0  1  0  1  0  0  0 97  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  2  2 95  1  0  0  0  0  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  0  0  0 98  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  1 99  0  0  0  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  3  0  1  0  0 91  0  0  1  1  0  1  0  0]
 [ 0  2  1  2  1  1  2  3  0  0  0  0 84  2  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  2  2  0  1  1  0  0 88  1  1  0  2  1  0]
 [ 1  1  0  0  1  0  1  0  1  0  0  0  0  1 93  0  0  0  0  0]
 [ 3  1  0  0  0  1  0  0  0  0  1  0