# L2 Regularized Logistic Regression Model

This demo file provides an example of comparing my L2 Regularized Logistic Regression model built with numpy to scikit-learn's L2 Regularized Logistic Regression model.  The task is to classify tumors in the Wisconsin Breast Cancer Dataset as benign or cancerous.  The misclassification error on the test set for both models is printed at the end of the notebook.

- **Dataset:** https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
- **Author:** Joel Stremmel (jstremme@uw.edu)
- **Credits:** University of Washington DATA 558 with Zaid Harchaoui and Corinne Jones

### Import Standard Scikit-Learn Functionality

In [1]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_recall_fscore_support as score

### Import Dataset

In [2]:
from sklearn.datasets import load_breast_cancer

### Import Methods for Modeling Training and Analysis

In [3]:
from l2_regularized_logistic_regression import *

### Import Scikit-Learn Model

In [4]:
from sklearn.linear_model import LogisticRegression

### Load Data and Separate Features and Targets

In [5]:
data = load_breast_cancer()
X = data.data
y = data.target
features = data.target_names

### Train Test Split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

### Scale Input Data and Transform Binary Targets to [-1, 1]

In [7]:
X_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
X_test = X_scaler.transform(X_test)
y_train = transform_target(y_train)
y_test = transform_target(y_test)

### Grid Search for the Regularization Term in Terms of C (Scikit-Learn) and Lambda (My Model)

In [8]:
estimator = LogisticRegression(penalty='l2', solver='liblinear')
param_grid = {'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]}

clf = GridSearchCV(estimator=estimator, cv=5, param_grid=param_grid)
clf = clf.fit(X, y)

results = clf.cv_results_
best_run = np.argmax(results['mean_test_score'])
best_c = results['params'][best_run]['C']
best_lambda = 1/best_c * 1/len(y_train) * 1/2
print('Best C: {}'.format(best_c))
print('Best lambda: {}'.format(best_lambda))

Best C: 5
Best lambda: 0.0002347417840375587


### Train Scikit-Learn Model with Best C

In [9]:
clf = LogisticRegression(penalty='l2', solver='liblinear', C=best_c, tol=0.001)
clf = clf.fit(X_train, y_train)

### Compute Misclassifciation Error

In [10]:
sklearn_preds = clf.predict(X_test)
error(sklearn_preds, y_test)

0.04895104895104896

### Train My Model with Best Lambda

In [11]:
beta_vals = l2_log_reg(X_train, y_train, lambda_penalty=best_lambda, eps=0.001, v=0)

### Compute Misclassifciation Error

In [12]:
my_preds = predict(beta_vals[-1], X_test, threshold=0.5)
error(my_preds, y_test)

0.04895104895104896