## Regression Comparison

Let's look at standard logistic regression, Ridge Regression, and Lasso Regression and see how they compare when examining breast cancer diagnosis.

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
import seaborn as sns
import sklearn
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score

%matplotlib inline
sns.set_style('white')

In [2]:
full_df = pd.read_csv('Breast Cancer Diagnosis/breast-cancer-wisconsin.data', 
                      names=['id','clumpthickness','cellsize','cellshape','adhesion','epithelialsize',
                             'barenuclei','chromatin','nucleoli','mitosis','class']
                     ) 
full_df['class'] = full_df['class'].replace({2:0, 4:1})
full_df['barenuclei'] = pd.to_numeric(full_df['barenuclei'], errors='coerce')
full_df = full_df.dropna()
full_df.shape

(683, 11)

In [3]:
full_df.head()

Unnamed: 0,id,clumpthickness,cellsize,cellshape,adhesion,epithelialsize,barenuclei,chromatin,nucleoli,mitosis,class
0,1000025,5,1,1,1,2,1.0,3,1,1,0
1,1002945,5,4,4,5,7,10.0,3,2,1,0
2,1015425,3,1,1,1,2,2.0,3,1,1,0
3,1016277,6,8,8,1,3,4.0,3,7,1,0
4,1017023,4,1,1,3,2,1.0,3,1,1,0


In [4]:
Y = full_df['class'].values.reshape(-1, 1).ravel()
X = full_df.loc[:, 'clumpthickness':'mitosis']

In [5]:
X.shape

(683, 9)

Let's create some new features to broaden the analysis.

In [6]:
X['cell_epithelial'] = X['cellsize'] * X['epithelialsize']
X['bare_nucleoli'] = X['barenuclei'] * X['nucleoli']
X['clump_shape'] = X['clumpthickness'] * X['cellshape']
X['chromatin_adhesion'] = X['chromatin'] * X['adhesion']
X['mitosis_shape'] = X['mitosis'] * X['cellshape']
X['clump_bare'] = X['clumpthickness'] * X['barenuclei']
X['size_2'] = X['cellsize'] * X['cellsize']
X['nucleoli_2'] = X['nucleoli'] * X['nucleoli']

In [31]:
# To negate the automatic regularization in the LogisticRegression() model, we can set C to be large.

regr = linear_model.LogisticRegression(penalty='l2', C=1e5, fit_intercept=False)
ridg = linear_model.LogisticRegression(penalty='l2', C=0.05, fit_intercept=False)
lass = linear_model.LogisticRegression(penalty='l1', C=0.05, fit_intercept=False)

regr.fit(X,Y)
ridg.fit(X,Y)
lass.fit(X,Y)

print(regr.score(X,Y))
print(ridg.score(X,Y))
print(lass.score(X,Y))

0.972181551977
0.96486090776
0.96486090776


In [42]:
n_folds = 6
# With 6 folds, the training set will have about 550 data points, and the test set will have about 110.
# Changing this parameter changed the variance in the accuracy, but not by much.

regr_score = cross_val_score(regr, X, Y, cv=n_folds)
ridg_score = cross_val_score(ridg, X, Y, cv=n_folds)
lass_score = cross_val_score(lass, X, Y, cv=n_folds)

print('Vanilla Regression: %.2f +/- %.2f' % (regr_score.mean(), 2*regr_score.std()))
print('Ridge Regression: %.2f +/- %.2f' % (ridg_score.mean(), 2*ridg_score.std()))
print('Lasso Regression: %.2f +/- %.2f' % (lass_score.mean(), 2*lass_score.std()))

Vanilla Regression: 0.95 +/- 0.05
Ridge Regression: 0.95 +/- 0.08
Lasso Regression: 0.95 +/- 0.07


Accuracy for these models on this data set are are about equal. Let's see if anything interesting happened with the coefficients.

In [33]:
regr.coef_

array([[ 0.20466653, -0.03989445, -2.43359885,  0.21604814, -0.88656952,
         0.46797946,  0.32484814, -0.45764693, -5.52135748,  0.2023485 ,
        -0.05425013,  0.01879739,  0.04341705,  2.85696572,  0.02113579,
        -0.08781763,  0.09922987]])

In [34]:
ridg.coef_

array([[-0.48879288, -0.26684986, -0.38816821, -0.34472689, -0.68488374,
        -0.03299346, -0.2810587 , -0.35349209, -0.32446246,  0.18337765,
        -0.03666279,  0.12499736,  0.20809154,  0.33215703,  0.11419692,
        -0.06864741,  0.08153829]])

In [35]:
lass.coef_

array([[-0.51919287,  0.        , -0.04181794,  0.        , -1.18056651,
         0.        , -0.0612637 ,  0.        ,  0.        ,  0.23576338,
        -0.00638086,  0.10505863,  0.05079747,  0.02403591,  0.09561895,
        -0.07905688,  0.03139293]])

Nothing too crazy -- these values are all reasonable. Lasso did eliminate 5 of the features, which is nice. I suspect this data set is too small to really see substantial differences between the models. I played with C values (1/lambda), and I can increase the L1 regularization to the point where only 5-6 features remain and still get 94-95% accuracy. For a larger dataset, this could be a powerful tool. The L2 regularization doesn't prove to be so necessary for this particular model -- none of the coefficients in the Vanilla Regression are getting out of control, and Vanilla still performs slightly better on the cross validation set. 

These accuracy values are also very similar to that I was able to achieve with a simple Random Forest model. 