# `SVM` with `NB` features

This notebook implements shows how to use the NB-SVM classifier from [this paper](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) that has become a classic (and competitive) baseline.

## toy dataset (binary)

Working example borrowed from [Stanford IR online text](https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)

| docID | text                                | label     |
|-------|-------------------------------------|-----------|
| 1     | chinese beijing chinese             | CHINA     |
| 2     | chinese chinese shanghai            | CHINA     |
| 3     | chinese macao                       | CHINA     |
| 4     | tokyo japan chinese                 | NOT_CHINA |
| 5     | chinese chinese chinese tokyo japan | ???       |


In [1]:
doc_one = "chinese beijing chinese"
label_one = 1

doc_two = "chinese chinese shanghai"
label_two = 1

doc_three = "chinese macao"
label_three = 1

doc_four = "tokyo japan chinese"
label_four = 0

all_docs = [doc_one, doc_two, doc_three, doc_four]
all_labels = [label_one, label_two, label_three, label_four]

In [2]:
from collections import OrderedDict
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import *
from sklearn.linear_model import *
from sklearn.naive_bayes import *
from sklearn.ensemble import *
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import accuracy_score
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import inspect
import os
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
os.sys.path.insert(0, os.path.join(parentdir, 'nb_transformer'))
from nb_transformer import NaiveBayesTransformer
from nb_enhanced_classifier import NaiveBayesEnhancedClassifier

## traditional `bernoulli` `vectorizer`

First let's `vectorize` the documents in the training set using `CountVectorizer()`. <br>
The paper found binary features to be more effective than raw counts.

In [3]:
vectorizer_bernoulli = CountVectorizer(
    binary=True
)
X_bernoulli = vectorizer_bernoulli.fit_transform(all_docs)

In [4]:
vectorizer_bernoulli.vocabulary_

{'beijing': 0, 'chinese': 1, 'japan': 2, 'macao': 3, 'shanghai': 4, 'tokyo': 5}

In [5]:
X_bernoulli.toarray()

array([[1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 1]], dtype=int64)

## `NaiveBayesTransformer`

This is a custom `transformer` that will use the Naive Bayes conditional probabilities as feature values.  See [the paper](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) for details.

The key thing to recognize is that this method is built off of the assumption of a `binary` (only two labels) problem.  It can be expanded to `multiclass` utilizing the `one-v-all` approach, but this also requires a **different** transformation for **each** label, thus the inclusion of `label_of_interest` as an argument to the `NaiveBayesTransformer`.

In [6]:
nb_transformer = NaiveBayesTransformer(all_labels, 1)

### original feature space

In [7]:
original_features = vectorizer_bernoulli.fit_transform(all_docs)
original_features.toarray()

array([[1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 1]], dtype=int64)

### transformed feature space

In [8]:
nb_transformed = nb_transformer.transform(original_features)
nb_transformed.toarray()

array([[ 0.40546511,  0.40546511,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.40546511,  0.        ,  0.        ,  0.40546511,
         0.        ],
       [ 0.        ,  0.40546511,  0.        ,  0.40546511,  0.        ,
         0.        ],
       [ 0.        ,  0.40546511, -0.98082925,  0.        ,  0.        ,
        -0.98082925]])

## Train

Given our features, let's train a classifier to predict `1` v. `0`. 
The paper found an `SVM` with `squared loss` and `l2` penalizing to be most effective.

In [9]:
svm_clf = LinearSVC(            
    loss='squared_hinge',
    penalty='l2',
    random_state = 1,         # to ensure reproducible results
    class_weight='balanced',
    dual=False
)

Let's build an instance of `NaiveBayesEnchancedClassifier`.

In [10]:
svm_clf.get_params()

{'C': 1.0,
 'class_weight': 'balanced',
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'loss': 'squared_hinge',
 'max_iter': 1000,
 'multi_class': 'ovr',
 'penalty': 'l2',
 'random_state': 1,
 'tol': 0.0001,
 'verbose': 0}

In [11]:
binary_example_clf = NaiveBayesEnhancedClassifier(
    list_of_possible_classes=[0,1],
    clf=svm_clf
)

Let's train it on our dataset by feeding the *original* vectorized features into `.fit()`

In [12]:
binary_example_clf.fit(original_features, all_labels)

Baked into `.fit()` are the following steps:
 1. `transform` the features using the `NaiveBayesTransformer`
 2. Apply an interpolation of weights to the `.coef_` of the `classifier` using the formula below: 
  - $w' = (1 - \beta)\bar{w} + \beta w$ where $\bar{w}$ is the mean magnitude of all the weights
  
Below are a few cells showing the calculations that are done in `NaiveBayesTransformer._interpolate()`

In [13]:
svm_clf.coef_

array([[ 0.3158278 ,  0.32634704,  0.45491727,  0.3158278 ,  0.3158278 ,
         0.45491727]])

In [14]:
interpolation_factor = 0.25

In [15]:
mean_magnitude = np.sum(svm_clf.coef_)/svm_clf.coef_.shape[1]
mean_magnitude_vector = np.full(svm_clf.coef_.shape, mean_magnitude)
mean_magnitude_vector

array([[ 0.36394416,  0.36394416,  0.36394416,  0.36394416,  0.36394416,
         0.36394416]])

In [16]:
svm_clf.coef_ = (1 - interpolation_factor) * mean_magnitude_vector + \
                                interpolation_factor * svm_clf.coef_
svm_clf.coef_

array([[ 0.35191507,  0.35454488,  0.38668744,  0.35191507,  0.35191507,
         0.38668744]])

## Test

Now let's predict on a new data point.

In [17]:
test_doc = "chinese chinese chinese tokyo japan"

We first must `vectorize` the test item into the original feature space.

In [18]:
test_features_original = vectorizer_bernoulli.transform([test_doc])

Then we can use `NaiveBayesEnhancedClassifier.predict()` to (1) apply the `NB-transformation` and make our prediction.

In [19]:
binary_example_clf.predict(test_features_original)

array([0])

We can see the "confidence" of this prediction using `NaiveBayesEnhancedClassifier.decision_function_predict_proba(test_features_original)` which will use `.decision_function()` or `.predict_proba()` of the original `classifier`.

In [20]:
binary_example_clf.decision_function_predict_proba(test_features_original)

array([-0.08810003])

The model predicts the test document to be in the `not China` class but not with much confidence.  

Note: This is opposite the prediction made in the Stanford IR worked example because they use `multinomial NB` and the presence of "China" three times outweighed the presence of the other words.

## `multiclass` test



As explained above, this transformation is built for `binary` classification.  The `multiclass` version requires the application of `one-v-all` to a traditional `SVM` classifier.  This is all done automatically within the `NaiveBayesEnhancedClassifier` `class`.  An example is below.

In [21]:
obvious_docs = [
    "cat",
    "cat",
    "cat",
    "pencil",
    "pencil",
    "pencil",
    "car",
    "car",
    "car",
]

obvious_labels = [
    0,
    0,
    0,
    1,
    1,
    1,
    2,
    2,
    2
]

test_docs = [
    "cat",
    "pen",
    "pencil",
    "car"
]

### traditional `bernoulli` `vectorizer`

In [22]:
multiclass_bernoulli_vectorizer = CountVectorizer(
    binary=True
)

multiclass_original_features = multiclass_bernoulli_vectorizer.fit_transform(obvious_docs)

In [23]:
multiclass_original_features.toarray()

array([[0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]], dtype=int64)

### Train

In [24]:
multiclass_example_clf = NaiveBayesEnhancedClassifier(
    [0,1,2]      # we'll use the default settings for `clf`
)

In [25]:
multiclass_example_clf.fit(
    multiclass_bernoulli_vectorizer.fit_transform(obvious_docs),   # vectorize the original documents
    obvious_labels
)

### Test

Let's project the test documents into the same feature space as the training data.

In [26]:
multiclass_test_features = multiclass_bernoulli_vectorizer.transform(test_docs)
multiclass_test_features.toarray()

array([[0, 1, 0],
       [0, 0, 0],
       [0, 0, 1],
       [1, 0, 0]])

Then let's get predictions for each test document.

In [27]:
multiclass_example_clf.predict(multiclass_test_features)

array([0, 0, 1, 2])

Behind the scenes, the classifier is generating predictions for how likely the data point is in that class.  We can see these values by calling `NaiveBayesEnhancedClassifier.decision_function_predict_proba()`.

Below, each row is a classifier trained for a different label.  The values in the vector represent the margin of each data point from the decision boundary.

In [28]:
for i, row in zip([0,1,2], multiclass_example_clf.decision_function_predict_proba(multiclass_test_features)):
    print(i, row)

0 [ 0.48603643 -0.46415139 -0.98429487 -1.06914084]
1 [-0.98429487 -0.46415139  0.48603643 -1.06914084]
2 [-0.98429487 -0.46415139 -0.98429487  0.64103138]


We then select the "most confident" classifier for each data point, where "most confident" is the one with the largest *positive* (or, if they're all *negative*, the *smallest*) margin.  In this case, each column is a data point, so we take the `argmax` of each column.

Note: Since the second data point ("pen") was never seen in the training vocabulary, the classifiers all generate the same prediction, none of which are positive (implying a low level of confidence)

# Experiments on `20Newsgroups`

Below we attempt to recreate the `binary` classification problems in the experiments of the [original paper](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf).  Whenever provided, we use the same hyperparameters.

We also compare different classifiers and the use of (1) no transformer or (2) `TfidfVectorizer()`.

Note: It's likely that the data coming from `sklearn` for `20Newsgroups` is *not* exactly the same as the data used by the authors.  But the results are comparable.

## `alt.atheism` v. `talk.religion.misc` (`AthR`)

In [29]:
categories = [
    'alt.atheism', 'talk.religion.misc',
]

In [30]:
remove = (
    'headers'
)

## Data

In [31]:
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

idx2class_lookup = dict((i,c) for i,c in enumerate(data_train.target_names))
idx2class_lookup

{0: 'alt.atheism', 1: 'talk.religion.misc'}

In [32]:
# get label distribution
num_pos = list(data_train.target).count(1)
num_neg = list(data_train.target).count(0)
num_pos, num_neg

(377, 480)

### Vectorize

In [33]:
newsgroups_vectorizer_uni = CountVectorizer(
    binary=True
)

newsgroups_vectorizer_tfidf = TfidfVectorizer()

In [34]:
newsgroups_train_features_uni = newsgroups_vectorizer_uni.fit_transform(data_train.data)
newsgroups_train_features_uni

<857x17440 sparse matrix of type '<class 'numpy.int64'>'
	with 136539 stored elements in Compressed Sparse Row format>

In [35]:
newsgroups_test_features_uni = newsgroups_vectorizer_uni.transform(data_test.data)
newsgroups_test_features_uni

<570x17440 sparse matrix of type '<class 'numpy.int64'>'
	with 86232 stored elements in Compressed Sparse Row format>

### transformed - nb

In [36]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    random_state=1,
    dual=False
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [37]:
newsgroups_nb_plus_clfs = OrderedDict()
for clf_name, clf in all_classifiers.items():
    newsgroups_nb_plus_clfs[clf_name] = NaiveBayesEnhancedClassifier(
        list(map(lambda x: x[0], idx2class_lookup.items())),
        clf,
        interpolation_factor=None
)

In [38]:
# transformed - nb
for clf_name, clf in newsgroups_nb_plus_clfs.items():
    print("training {} - nb transformed".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)

training svm_from_paper - nb transformed
training sgd - nb transformed
training ridge - nb transformed



In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training perceptron - nb transformed
training passive_aggressive - nb transformed
training random_forest - nb transformed



the classifier you instantiated does not have an attribute `.coef_`interpolation will not occur



training adaboost - nb transformed



the classifier you instantiated does not have an attribute `.coef_`interpolation will not occur



In [39]:
newsgroups_nb_transformed_uni_results = OrderedDict()
for clf_name, clf in newsgroups_nb_plus_clfs.items():
    print("testing transformed {}".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_nb_transformed_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_nb_transformed_uni_results

testing transformed svm_from_paper
testing transformed sgd
testing transformed ridge
testing transformed perceptron
testing transformed passive_aggressive
testing transformed random_forest
testing transformed adaboost


OrderedDict([('svm_from_paper', 0.83333333333333337),
             ('sgd', 0.81403508771929822),
             ('ridge', 0.79824561403508776),
             ('perceptron', 0.80701754385964908),
             ('passive_aggressive', 0.84736842105263155),
             ('random_forest', 0.77894736842105261),
             ('adaboost', 0.75964912280701757)])

### no transformation

In [40]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    dual=False,
    C=0.1,
    random_state=1
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["bernoulli_nb"] = BernoulliNB(
    alpha=0.01,
)

all_classifiers["multinomial_nb"] = MultinomialNB(
    alpha=0.01,
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [41]:
# no transformation count
for clf_name, clf in all_classifiers.items():
    print("training baseline {} - no transformation".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)

training baseline svm_from_paper - no transformation
training baseline sgd - no transformation
training baseline ridge - no transformation
training baseline perceptron - no transformation
training baseline passive_aggressive - no transformation
training baseline random_forest - no transformation



In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training baseline adaboost - no transformation
training baseline bernoulli_nb - no transformation
training baseline multinomial_nb - no transformation


In [42]:
newsgroups_no_transformation_uni_results = OrderedDict()
for clf_name, clf in all_classifiers.items():
    print("testing baseline {} - no transformation".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_no_transformation_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_no_transformation_uni_results

testing baseline svm_from_paper - no transformation
testing baseline sgd - no transformation
testing baseline ridge - no transformation
testing baseline perceptron - no transformation
testing baseline passive_aggressive - no transformation
testing baseline random_forest - no transformation
testing baseline adaboost - no transformation
testing baseline bernoulli_nb - no transformation
testing baseline multinomial_nb - no transformation


OrderedDict([('svm_from_paper', 0.77719298245614032),
             ('sgd', 0.756140350877193),
             ('ridge', 0.77017543859649118),
             ('perceptron', 0.76666666666666672),
             ('passive_aggressive', 0.77017543859649118),
             ('random_forest', 0.77894736842105261),
             ('adaboost', 0.75964912280701757),
             ('bernoulli_nb', 0.83157894736842108),
             ('multinomial_nb', 0.83684210526315794)])

### transformation - tfidf

In [43]:
newsgroups_train_features_uni = newsgroups_vectorizer_tfidf.fit_transform(data_train.data)
newsgroups_test_features_uni = newsgroups_vectorizer_tfidf.transform(data_test.data)

In [44]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    C=0.1,
    random_state=1,
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["bernoulli_nb"] = BernoulliNB(
    alpha=0.01
)

all_classifiers["multinomial_nb"] = MultinomialNB(
    alpha=0.01
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [45]:
for clf_name, clf in all_classifiers.items():
    print("training baseline {} - tfidf transformation".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)


In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training baseline svm_from_paper - tfidf transformation
training baseline sgd - tfidf transformation
training baseline ridge - tfidf transformation
training baseline perceptron - tfidf transformation
training baseline passive_aggressive - tfidf transformation
training baseline random_forest - tfidf transformation
training baseline adaboost - tfidf transformation
training baseline bernoulli_nb - tfidf transformation
training baseline multinomial_nb - tfidf transformation


In [46]:
newsgroups_tfidf_transformed_uni_results = OrderedDict()
for clf_name, clf in all_classifiers.items():
    print("testing baseline {} - no transformation".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_tfidf_transformed_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_tfidf_transformed_uni_results

testing baseline svm_from_paper - no transformation
testing baseline sgd - no transformation
testing baseline ridge - no transformation
testing baseline perceptron - no transformation
testing baseline passive_aggressive - no transformation
testing baseline random_forest - no transformation
testing baseline adaboost - no transformation
testing baseline bernoulli_nb - no transformation
testing baseline multinomial_nb - no transformation


OrderedDict([('svm_from_paper', 0.79649122807017547),
             ('sgd', 0.81228070175438594),
             ('ridge', 0.81403508771929822),
             ('perceptron', 0.7929824561403509),
             ('passive_aggressive', 0.82105263157894737),
             ('random_forest', 0.75438596491228072),
             ('adaboost', 0.74210526315789471),
             ('bernoulli_nb', 0.83157894736842108),
             ('multinomial_nb', 0.83859649122807023)])

In [47]:
groups = [
    ("no_transform", newsgroups_no_transformation_uni_results),
    ("tfidf_transform", newsgroups_tfidf_transformed_uni_results), 
    ("nb_transform", newsgroups_nb_transformed_uni_results)
]

clfs = list(all_classifiers.keys())

traces = OrderedDict()
for i in range(len(groups)):
    transformation = groups[i][1]
    transformation_name = groups[i][0]
    scores = []
    for label in clfs:
        try:
            score = transformation[label]
        except:
            score = 0.0
        scores.append(score)
    traces["trace_{}".format(transformation_name)] = go.Bar(
        x=clfs,
        y=scores,
        name=transformation_name,
    )

data_ = [v for k, v in traces.items()]
layout_ = go.Layout(
    barmode='group',
    title='Transformation comparison: AthR (2.9)',
    yaxis=dict(
        range=[0, 1]
    )
)

fig_ = go.Figure(data=data_, layout=layout_)
iplot(fig_)    

## `comp.graphics` v. `comp.windows.x` (`XGraph`)

In [48]:
categories = [
    'comp.graphics', 'comp.windows.x',
]

## Data

In [49]:
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

idx2class_lookup = dict((i,c) for i,c in enumerate(data_train.target_names))
idx2class_lookup

{0: 'comp.graphics', 1: 'comp.windows.x'}

In [50]:
# get label distribution
num_pos = list(data_train.target).count(1)
num_neg = list(data_train.target).count(0)
num_pos, num_neg

(593, 584)

## Unigrams

### Vectorize

In [51]:
newsgroups_vectorizer_uni = CountVectorizer(
    binary=True
)

newsgroups_vectorizer_tfidf = TfidfVectorizer()

In [52]:
newsgroups_train_features_uni = newsgroups_vectorizer_uni.fit_transform(data_train.data)
newsgroups_train_features_uni

<1177x21163 sparse matrix of type '<class 'numpy.int64'>'
	with 136868 stored elements in Compressed Sparse Row format>

In [53]:
newsgroups_test_features_uni = newsgroups_vectorizer_uni.transform(data_test.data)
newsgroups_test_features_uni

<784x21163 sparse matrix of type '<class 'numpy.int64'>'
	with 89865 stored elements in Compressed Sparse Row format>

### transformed - nb

In [54]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    random_state=1,
    dual=False
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [55]:
newsgroups_nb_plus_clfs = OrderedDict()
for clf_name, clf in all_classifiers.items():
    newsgroups_nb_plus_clfs[clf_name] = NaiveBayesEnhancedClassifier(
        list(map(lambda x: x[0], idx2class_lookup.items())),
        clf,
        interpolation_factor=None
)

In [56]:
# transformed - nb
for clf_name, clf in newsgroups_nb_plus_clfs.items():
    print("training {} - nb transformed".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)

training svm_from_paper - nb transformed
training sgd - nb transformed
training ridge - nb transformed



In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training perceptron - nb transformed
training passive_aggressive - nb transformed
training random_forest - nb transformed



the classifier you instantiated does not have an attribute `.coef_`interpolation will not occur



training adaboost - nb transformed



the classifier you instantiated does not have an attribute `.coef_`interpolation will not occur



In [57]:
newsgroups_nb_transformed_uni_results = OrderedDict()
for clf_name, clf in newsgroups_nb_plus_clfs.items():
    print("testing transformed {}".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_nb_transformed_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_nb_transformed_uni_results

testing transformed svm_from_paper
testing transformed sgd
testing transformed ridge
testing transformed perceptron
testing transformed passive_aggressive
testing transformed random_forest
testing transformed adaboost


OrderedDict([('svm_from_paper', 0.83290816326530615),
             ('sgd', 0.84183673469387754),
             ('ridge', 0.83673469387755106),
             ('perceptron', 0.8482142857142857),
             ('passive_aggressive', 0.84693877551020413),
             ('random_forest', 0.80612244897959184),
             ('adaboost', 0.81122448979591832)])

### no transformation

In [58]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    dual=False,
    C=0.1,
    random_state=1
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["bernoulli_nb"] = BernoulliNB(
    alpha=0.01,
)

all_classifiers["multinomial_nb"] = MultinomialNB(
    alpha=0.01,
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [59]:
# no transformation count
for clf_name, clf in all_classifiers.items():
    print("training baseline {} - no transformation".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)

training baseline svm_from_paper - no transformation
training baseline sgd - no transformation
training baseline ridge - no transformation
training baseline perceptron - no transformation
training baseline passive_aggressive - no transformation
training baseline random_forest - no transformation



In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training baseline adaboost - no transformation
training baseline bernoulli_nb - no transformation
training baseline multinomial_nb - no transformation


In [60]:
newsgroups_no_transformation_uni_results = OrderedDict()
for clf_name, clf in all_classifiers.items():
    print("testing baseline {} - no transformation".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_no_transformation_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_no_transformation_uni_results

testing baseline svm_from_paper - no transformation
testing baseline sgd - no transformation
testing baseline ridge - no transformation
testing baseline perceptron - no transformation
testing baseline passive_aggressive - no transformation
testing baseline random_forest - no transformation
testing baseline adaboost - no transformation
testing baseline bernoulli_nb - no transformation
testing baseline multinomial_nb - no transformation


OrderedDict([('svm_from_paper', 0.82397959183673475),
             ('sgd', 0.8482142857142857),
             ('ridge', 0.83418367346938771),
             ('perceptron', 0.83290816326530615),
             ('passive_aggressive', 0.82908163265306123),
             ('random_forest', 0.81377551020408168),
             ('adaboost', 0.81122448979591832),
             ('bernoulli_nb', 0.81760204081632648),
             ('multinomial_nb', 0.86352040816326525)])

### transformation - tfidf

In [61]:
newsgroups_train_features_uni = newsgroups_vectorizer_tfidf.fit_transform(data_train.data)
newsgroups_test_features_uni = newsgroups_vectorizer_tfidf.transform(data_test.data)

In [62]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    C=0.1,
    random_state=1,
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["bernoulli_nb"] = BernoulliNB(
    alpha=0.01
)

all_classifiers["multinomial_nb"] = MultinomialNB(
    alpha=0.01
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [63]:
for clf_name, clf in all_classifiers.items():
    print("training baseline {} - tfidf transformation".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)


In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training baseline svm_from_paper - tfidf transformation
training baseline sgd - tfidf transformation
training baseline ridge - tfidf transformation
training baseline perceptron - tfidf transformation
training baseline passive_aggressive - tfidf transformation
training baseline random_forest - tfidf transformation
training baseline adaboost - tfidf transformation
training baseline bernoulli_nb - tfidf transformation
training baseline multinomial_nb - tfidf transformation


In [64]:
newsgroups_tfidf_transformed_uni_results = OrderedDict()
for clf_name, clf in all_classifiers.items():
    print("testing baseline {} - no transformation".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_tfidf_transformed_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_tfidf_transformed_uni_results

testing baseline svm_from_paper - no transformation
testing baseline sgd - no transformation
testing baseline ridge - no transformation
testing baseline perceptron - no transformation
testing baseline passive_aggressive - no transformation
testing baseline random_forest - no transformation
testing baseline adaboost - no transformation
testing baseline bernoulli_nb - no transformation
testing baseline multinomial_nb - no transformation


OrderedDict([('svm_from_paper', 0.8482142857142857),
             ('sgd', 0.8571428571428571),
             ('ridge', 0.85459183673469385),
             ('perceptron', 0.84311224489795922),
             ('passive_aggressive', 0.85459183673469385),
             ('random_forest', 0.80867346938775508),
             ('adaboost', 0.78061224489795922),
             ('bernoulli_nb', 0.81760204081632648),
             ('multinomial_nb', 0.86734693877551017)])

In [65]:
groups = [
    ("no_transform", newsgroups_no_transformation_uni_results),
    ("tfidf_transform", newsgroups_tfidf_transformed_uni_results), 
    ("nb_transform", newsgroups_nb_transformed_uni_results)
]

clfs = list(all_classifiers.keys())

traces = OrderedDict()
for i in range(len(groups)):
    transformation = groups[i][1]
    transformation_name = groups[i][0]
    scores = []
    for label in clfs:
        try:
            score = transformation[label]
        except:
            score = 0.0
        scores.append(score)
    traces["trace_{}".format(transformation_name)] = go.Bar(
        x=clfs,
        y=scores,
        name=transformation_name,
    )

data_ = [v for k, v in traces.items()]
layout_ = go.Layout(
    barmode='group',
    title='Transformation comparison: XGraph (1.8)',
    yaxis=dict(
        range=[0, 1]
    )
)

fig_ = go.Figure(data=data_, layout=layout_)
iplot(fig_)    

## `rec.sport.baseball` v. `sci.crypt` (`BbCrypt`)

In [66]:
categories = [
    'rec.sport.baseball', 'sci.crypt',
]

## Data

In [67]:
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

idx2class_lookup = dict((i,c) for i,c in enumerate(data_train.target_names))
idx2class_lookup

{0: 'rec.sport.baseball', 1: 'sci.crypt'}

In [68]:
# get label distribution
num_pos = list(data_train.target).count(1)
num_neg = list(data_train.target).count(0)
num_pos, num_neg

(595, 597)

In [69]:
# determine intercept (log(N+/N-))
intercept = np.log(num_pos/num_neg)
intercept

-0.0033557078469723042

## Unigrams

### Vectorize

In [70]:
newsgroups_vectorizer_uni = CountVectorizer(
    binary=True
)

newsgroups_vectorizer_tfidf = TfidfVectorizer()

In [71]:
newsgroups_train_features_uni = newsgroups_vectorizer_uni.fit_transform(data_train.data)
newsgroups_train_features_uni

<1192x21197 sparse matrix of type '<class 'numpy.int64'>'
	with 176313 stored elements in Compressed Sparse Row format>

In [72]:
newsgroups_test_features_uni = newsgroups_vectorizer_uni.transform(data_test.data)
newsgroups_test_features_uni

<793x21197 sparse matrix of type '<class 'numpy.int64'>'
	with 97031 stored elements in Compressed Sparse Row format>

### transformed - nb

In [73]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    random_state=1,
    dual=False
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [75]:
newsgroups_nb_plus_clfs = OrderedDict()
for clf_name, clf in all_classifiers.items():
    newsgroups_nb_plus_clfs[clf_name] = NaiveBayesEnhancedClassifier(
        list(map(lambda x: x[0], idx2class_lookup.items())),
        clf,
        interpolation_factor=None
)

In [76]:
# transformed - nb
for clf_name, clf in newsgroups_nb_plus_clfs.items():
    print("training {} - nb transformed".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)

training svm_from_paper - nb transformed
training sgd - nb transformed
training ridge - nb transformed



In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training perceptron - nb transformed
training passive_aggressive - nb transformed
training random_forest - nb transformed



the classifier you instantiated does not have an attribute `.coef_`interpolation will not occur



training adaboost - nb transformed



the classifier you instantiated does not have an attribute `.coef_`interpolation will not occur



In [77]:
newsgroups_nb_transformed_uni_results = OrderedDict()
for clf_name, clf in newsgroups_nb_plus_clfs.items():
    print("testing transformed {}".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_nb_transformed_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_nb_transformed_uni_results

testing transformed svm_from_paper
testing transformed sgd
testing transformed ridge
testing transformed perceptron
testing transformed passive_aggressive
testing transformed random_forest
testing transformed adaboost


OrderedDict([('svm_from_paper', 0.96216897856242123),
             ('sgd', 0.9709962168978562),
             ('ridge', 0.96973518284993698),
             ('perceptron', 0.97351828499369486),
             ('passive_aggressive', 0.98108448928121061),
             ('random_forest', 0.94325346784363173),
             ('adaboost', 0.9319041614123581)])

### no transformation

In [78]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    dual=False,
    C=0.1,
    random_state=1
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["bernoulli_nb"] = BernoulliNB(
    alpha=0.01,
)

all_classifiers["multinomial_nb"] = MultinomialNB(
    alpha=0.01,
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [79]:
# no transformation count
for clf_name, clf in all_classifiers.items():
    print("training baseline {} - no transformation".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)

training baseline svm_from_paper - no transformation



In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training baseline sgd - no transformation
training baseline ridge - no transformation
training baseline perceptron - no transformation
training baseline passive_aggressive - no transformation
training baseline random_forest - no transformation
training baseline adaboost - no transformation
training baseline bernoulli_nb - no transformation
training baseline multinomial_nb - no transformation


In [80]:
newsgroups_no_transformation_uni_results = OrderedDict()
for clf_name, clf in all_classifiers.items():
    print("testing baseline {} - no transformation".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_no_transformation_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_no_transformation_uni_results

testing baseline svm_from_paper - no transformation
testing baseline sgd - no transformation
testing baseline ridge - no transformation
testing baseline perceptron - no transformation
testing baseline passive_aggressive - no transformation
testing baseline random_forest - no transformation
testing baseline adaboost - no transformation
testing baseline bernoulli_nb - no transformation
testing baseline multinomial_nb - no transformation


OrderedDict([('svm_from_paper', 0.95838587641866335),
             ('sgd', 0.95586380832282469),
             ('ridge', 0.94703656998738961),
             ('perceptron', 0.95208070617906682),
             ('passive_aggressive', 0.95964691046658257),
             ('random_forest', 0.94955863808322827),
             ('adaboost', 0.9319041614123581),
             ('bernoulli_nb', 0.94703656998738961),
             ('multinomial_nb', 0.98612862547288782)])

### transformation - tfidf

In [82]:
newsgroups_train_features_uni = newsgroups_vectorizer_tfidf.fit_transform(data_train.data)
newsgroups_test_features_uni = newsgroups_vectorizer_tfidf.transform(data_test.data)

In [83]:
all_classifiers = OrderedDict()

all_classifiers["svm_from_paper"] = LinearSVC(
    loss='squared_hinge',
    penalty='l2',
    C=0.1,
    random_state=1,
)

all_classifiers["sgd"] = SGDClassifier(
    n_iter=50,
    penalty='elasticnet',
    random_state=1,
)

all_classifiers["ridge"] = RidgeClassifier(
    tol=1e-2,
    solver='lsqr',
    random_state=1,
)

all_classifiers["perceptron"] = Perceptron(
    n_iter=50,
    random_state=1,
)

all_classifiers["passive_aggressive"] = PassiveAggressiveClassifier(
    n_iter=50,
    random_state=1,
)

all_classifiers["random_forest"] = RandomForestClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["adaboost"] = AdaBoostClassifier(
    n_estimators=100,
    random_state=1,
)

all_classifiers["bernoulli_nb"] = BernoulliNB(
    alpha=0.01
)

all_classifiers["multinomial_nb"] = MultinomialNB(
    alpha=0.01
)


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.


n_iter parameter is deprecated in 0.19 and will be removed in 0.21. Use max_iter and tol instead.



In [84]:
for clf_name, clf in all_classifiers.items():
    print("training baseline {} - tfidf transformation".format(clf_name))
    clf.fit(newsgroups_train_features_uni, data_train.target)


In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



training baseline svm_from_paper - tfidf transformation
training baseline sgd - tfidf transformation
training baseline ridge - tfidf transformation
training baseline perceptron - tfidf transformation
training baseline passive_aggressive - tfidf transformation
training baseline random_forest - tfidf transformation
training baseline adaboost - tfidf transformation
training baseline bernoulli_nb - tfidf transformation
training baseline multinomial_nb - tfidf transformation


In [86]:
newsgroups_tfidf_transformed_uni_results = OrderedDict()
for clf_name, clf in all_classifiers.items():
    print("testing baseline {} - no transformation".format(clf_name))
    preds = clf.predict(newsgroups_test_features_uni)
    newsgroups_tfidf_transformed_uni_results[clf_name] = accuracy_score(data_test.target, preds)
newsgroups_tfidf_transformed_uni_results

testing baseline svm_from_paper - no transformation
testing baseline sgd - no transformation
testing baseline ridge - no transformation
testing baseline perceptron - no transformation
testing baseline passive_aggressive - no transformation
testing baseline random_forest - no transformation
testing baseline adaboost - no transformation
testing baseline bernoulli_nb - no transformation
testing baseline multinomial_nb - no transformation


OrderedDict([('svm_from_paper', 0.95208070617906682),
             ('sgd', 0.96343001261034045),
             ('ridge', 0.9709962168978562),
             ('perceptron', 0.95586380832282469),
             ('passive_aggressive', 0.97477931904161408),
             ('random_forest', 0.94955863808322827),
             ('adaboost', 0.91298865069356872),
             ('bernoulli_nb', 0.94703656998738961),
             ('multinomial_nb', 0.97982345523329129)])

In [87]:
groups = [
    ("no_transform", newsgroups_no_transformation_uni_results),
    ("tfidf_transform", newsgroups_tfidf_transformed_uni_results), 
    ("nb_transform", newsgroups_nb_transformed_uni_results)
]

clfs = list(all_classifiers.keys())

traces = OrderedDict()
for i in range(len(groups)):
    transformation = groups[i][1]
    transformation_name = groups[i][0]
    scores = []
    for label in clfs:
        try:
            score = transformation[label]
        except:
            score = 0.0
        scores.append(score)
    traces["trace_{}".format(transformation_name)] = go.Bar(
        x=clfs,
        y=scores,
        name=transformation_name,
    )

data_ = [v for k, v in traces.items()]
layout_ = go.Layout(
    barmode='group',
    title='Transformation comparison: BbCrypt (0.5)',
    yaxis=dict(
        range=[0, 1]
    )
)

fig_ = go.Figure(data=data_, layout=layout_)
iplot(fig_)    