We'll be working with some cancer imaging data from the University of Wisconsin, which includes 30 features of different images along with an ID number and a binary classification of the diagnosis ("M" for malignant, "B" for benign). We'll try a couple quick RF and SVM classifiers first to get a baseline accuracy, then see what happens when we run an RBM on the feature set first.

In [102]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score

cancer = pd.read_csv("http://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/WDBC/WDBC.dat"
                      ,delimiter=",",header=None)

%matplotlib inline

In [103]:
cancer = cancer.drop(0,axis=1)
cancer = cancer.rename(columns={1:'diagnosis'})
cancer.head()

Unnamed: 0,diagnosis,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Let's create a quick RFC and SVM classifier to see how well we do with baseline features

In [106]:
X = cancer.drop('diagnosis',axis=1)
Y = cancer['diagnosis']

In [112]:
from sklearn.ensemble import RandomForestClassifier

# Not going to mess around with settings too much since we're hoping to just explore the performance boost
# from using RBM
RFC = RandomForestClassifier()

rfc_scores = cross_val_score(RFC,X,Y,scoring='accuracy')

print('RFC Accuracy: {}'.format(rfc_scores))

RFC Accuracy: [0.93157895 0.97368421 0.97354497]


In [126]:
from sklearn import preprocessing
from sklearn.svm import SVC

scaler = preprocessing.StandardScaler()
X_scaled = scaler.fit_transform(X)

SVC = SVC()

svc_scores = cross_val_score(SVC,X_scaled,Y,scoring='accuracy')

print('SVC Accuracy: {}'.format(svc_scores))

SVC Accuracy: [0.96315789 0.98421053 0.97354497]


So we're already able to produce pretty solid accuracy given vanilla versions of SVM and RFC. But let's see how well we can do when tuning an RBM to the problem.

In [206]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.model_selection import GridSearchCV

rbm_rfc = Pipeline(steps=[('rbm', BernoulliRBM()), ('rfc', RandomForestClassifier())])

# Grid of parameters for our gridsearch optimization (done successively)
param_grid = [
    {
        'rbm__learning_rate':[0.0000001,0.000001,0.00001],
        'rbm__n_components': [20000,2500,3000]
    },
    { # Not really messing with RFC params since we're mostly interested in rbm impact
    },
]

GS = GridSearchCV(rbm_rfc,param_grid,scoring='accuracy')
GS.fit(X,Y)

print(GS.best_params_)
best_params = GS.best_params_

{'rbm__learning_rate': 1e-07, 'rbm__n_components': 3000}


In [205]:
best_params = GS.best_params_

In [212]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.model_selection import GridSearchCV

# Using best params from grid search exercise
rbm_rfc = Pipeline(steps=[('rbm', BernoulliRBM(learning_rate=1e-07,n_components= 3000)), ('rfc', RandomForestClassifier())])

rbm_rfc.fit(X,Y)

rbm_rfc_scores = cross_val_score(rbm_rfc,X_scaled,Y,scoring='accuracy')

print('rbm_rfc Accuracy: {}'.format(rbm_rfc_scores))

rbm_rfc Accuracy: [0.96315789 0.97368421 0.97883598]


Very similar accuracy to our vanilla RFC but a slight improvement, which is to say that the model does pretty well. But we're not seeing a vast improvement as we might have expected.

In [234]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

rbm_svc = Pipeline(steps=[('rbm', BernoulliRBM()), ('svm', SVC())])

# Grid of parameters for our gridsearch optimization (done successively)
param_grid = [
    {
        'rbm__learning_rate':[0.1, 0.00000001],
        'rbm__n_components': [100,200,300,1000,2000]
    },
    {
    },
]

GS = GridSearchCV(rbm_svc,param_grid,scoring='accuracy')
GS.fit(X_scaled,Y)

print(GS.best_params_)
best_params = GS.best_params_

{'rbm__learning_rate': 0.1, 'rbm__n_components': 100}


In [235]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.model_selection import GridSearchCV

# Using best params from grid search exercise
rbm_svc = Pipeline(steps=[('rbm', BernoulliRBM(learning_rate=0.1,n_components= 100)), ('svc', SVC())])

rbm_svc.fit(X_scaled,Y)

rbm_svc_scores = cross_val_score(rbm_svc,X_scaled,Y,scoring='accuracy')

print('rbm_svc Accuracy: {}'.format(rbm_svc_scores))

rbm_svc Accuracy: [0.90526316 0.91578947 0.92063492]


RBM doesn't seem to be particularly useful when used in conjunction with our SVM classifier.

### Concluding Thoughts on RBM
The task was to use RBM on a dataset of image data to improve results, which we did. We were able to improve results ever so slightly for RFC, but it turns out we weren't using the best dataset to demonstrate the capabilities of RBM in feature extraction for this type of data. RBM is particularly useful when the raw data is not conducive to modeling, but our variables actually were quite ready to be modeled on already. As a result, the accuracy from vanilla RFC and SVC were already extremely high, making it difficult to demonstrate vast improvement from RBM.

Just for fun, let's see how an MLP does on this same dataset....

In [250]:
# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model (tried diff variations on num layers and their size)
mlp = MLPClassifier(hidden_layer_sizes=(5000,5000,5000,))
mlp.fit(X, Y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5000, 5000, 5000), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [251]:
mlp_scores = cross_val_score(mlp,X,Y,scoring='accuracy')

print('mlp Accuracy: {}'.format(mlp_scores))

mlp Accuracy: [0.68421053 0.93157895 0.37037037]


This does OK, but given the excessive runtime and middling results relative to the RFC and SVM classifiers, probably wouldn't be our model of choice. We know that RBMs typically like LOTS of data, and we only have ~500 or so observations, so that could be why this model is underperforming.