# Logistic Regression Example

In this tutorial, we'll walk through how to use our automatic differentiation engine to create and train a logistic regression model to predict breast cancer. 

The data can be found [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).

Our goal here is not to create the best model to fit our data, but to simply show how to use our automatic differentiation engine to create, train, and test a logistic regression model. Thus, our model will be very simplistic for pedagogical purposes.

# Data Preprocessing

We'll take look at our data and then preprocess it to be in a form appropriate to use with our automatic differentiation engine.

Feel free to skip this section as it doesn't directly relate to how to use our automatic differentiation engine.

Let's first import some necessary libraries.

In [1]:
from autograd import Variable, LogisticRegressionLayer
import autograd
from tqdm import tqdm
import pandas as pd
import numpy as np
import random

Let's load the data. 

In [2]:
csv_path = './data/breast_cancer.csv'
df = pd.read_csv(csv_path, index_col=0)
df.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Let's clean our data.

Since this tutorial is focused on using our automatic differentiaion engine rather than on machine learning itself, we'll avoid doing a deep dive into the data analysis rationalizing the methods we here to clean our data. More examples of analysis on this dataset can be found [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/notebooks).

Let's start by dropping any null values

In [3]:
df.dropna(axis=0, inplace=True)
assert df.isnull().sum().sum() == 0

Let's make sure all of our data is numerical.

We're trying to predict whether or not the partient has breast cancer, so we'll need to convert the "diagnosis" column to a numerical value (where 1 denotes the presence of breast cancer and 0 denotes the absence of breast cancer).

In [4]:
df.diagnosis = df.diagnosis.map(lambda diagnosis: float(diagnosis=='M'))
assert {df[column].dtype for column in df.columns} == {np.dtype('float64')}

In [5]:
df.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,1.0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,1.0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,1.0,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,1.0,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,1.0,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Let's normalize our data.

In [6]:
for column in df.columns:
    if column != 'diagnosis':
        df[column] = (df[column] - df[column].mean()) / df[column].std()

df.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,1.0,1.0961,-2.071512,1.268817,0.98351,1.567087,3.280628,2.650542,2.530249,2.215566,...,1.885031,-1.358098,2.301575,1.999478,1.306537,2.614365,2.107672,2.294058,2.748204,1.935312
842517,1.0,1.828212,-0.353322,1.684473,1.90703,-0.826235,-0.486643,-0.023825,0.547662,0.001391,...,1.80434,-0.368879,1.533776,1.888827,-0.375282,-0.430066,-0.14662,1.086129,-0.243675,0.280943
84300903,1.0,1.578499,0.455786,1.565126,1.557513,0.941382,1.052,1.36228,2.03544,0.938859,...,1.510541,-0.023953,1.346291,1.455004,0.526944,1.08198,0.854222,1.953282,1.151242,0.201214
84348301,1.0,-0.768233,0.253509,-0.592166,-0.763792,3.280667,3.399917,1.914213,1.450431,2.864862,...,-0.281217,0.133866,-0.24972,-0.549538,3.391291,3.889975,1.987839,2.173873,6.040726,4.930672
84358402,1.0,1.748758,-1.150804,1.775011,1.824624,0.280125,0.538866,1.369806,1.427237,-0.009552,...,1.297434,-1.465481,1.337363,1.219651,0.220362,-0.313119,0.61264,0.728618,-0.86759,-0.396751


Our model will be very simple. We'll create a logistic regression model that takes in a value from each column and will return whether or not the patient has a 10 year risk of heart disease.

Our automatic differentiation engine expects numpy arrays. 

Let's convert the data into two numpy arrays, one for the inputs and one for the outputs.

In [7]:
input_df = df[[column for column in df.columns if column != 'diagnosis']]
output_df = df['diagnosis']
input_matrix = input_df.values
output_vector = output_df.values

In [8]:
type(input_matrix)

numpy.ndarray

In [9]:
input_matrix.shape

(569, 30)

In [10]:
type(output_vector)

numpy.ndarray

In [11]:
output_vector.shape

(569,)

Let's now split our data into a training set and test set. We'll use 50% of our data for training and 50% for testing. We'll manually set the random seeds for reproducibility purposes.

In [12]:
random.seed(0)
np.random.seed(0)

row_count = input_matrix.shape[0]
row_indices = list(range(row_count))
random.shuffle(row_indices)

last_training_row_position = round(row_count*0.50)
training_indices = row_indices[:last_training_row_position+1]
testing_indices = row_indices[last_training_row_position+1:]

training_input_matrix = input_matrix[training_indices]
training_output_vector = output_vector[training_indices]
testing_input_matrix = input_matrix[testing_indices]
testing_output_vector = output_vector[testing_indices]

A logistic regression model can be implemented as a dense linear layer with a sigmoid activation. 

Our automatic differentiation engine provides a layer that does exactly that.

In [13]:
logistic_regression_layer = LogisticRegressionLayer(30, 1)

First, we'll need to establish some hyperparameters for training our linear regression model. We'll be using stochastic gradient descent.

In practice, we might use various batch sizes or shuffle our training data between epochs, but we'll forgo this for the sake of simplicity as the purposes of this tutorial is to show how to use our automatic differentiation engine rather than finding the best model possible.

In [14]:
number_of_epochs = 40
learning_rate = 1e-3
sgd = autograd.optimizer.SGD(learning_rate)

Let's write our training loop. We'll use cross entropy as the loss function along with L2 regularization.

In [15]:
epoch_iterator = tqdm(range(number_of_epochs))
for epoch_index in epoch_iterator:
    mean_loss = 0 
    for x, y in zip(training_input_matrix, training_output_vector):
        y_hat = logistic_regression_layer(x)
        loss = y_hat.bce_loss(y)
        mean_loss += loss.data.item()
        sgd.take_training_step(loss)
    mean_loss /= len(training_output_vector)
    epoch_iterator.set_description(f'Mean Loss: {mean_loss:.8f}')
    epoch_iterator.refresh()

Mean Loss: 0.10605361: 100%|██████████| 40/40 [00:03<00:00, 11.50it/s]


Let's now see how our model performs on our test set.

In [16]:
true_positive_count = 0
true_negative_count = 0
false_positive_count = 0
false_negative_count = 0

for x, y in zip(testing_input_matrix, testing_output_vector):
    y_hat = logistic_regression_layer(x)
    y_hat = y_hat.round()
    is_correct = y_hat == y
    is_positive = y == 1
    if is_correct:
        if is_positive:
            true_positive_count += 1
        else:
            true_negative_count += 1
    else:
        if is_positive:
            false_positive_count += 1
        else:
            false_negative_count += 1

precision = true_positive_count/(true_positive_count+false_positive_count)
recall = true_positive_count/(true_positive_count+false_negative_count)
f1 = 2 * precision * recall / (precision + recall)

accuracy = (true_positive_count + true_negative_count) / (true_positive_count + true_negative_count + false_positive_count + false_negative_count)

print(f'True Positives:  {true_positive_count}')
print(f'True Negatives:  {true_negative_count}')
print(f'False Positives: {false_positive_count}')
print(f'False Negatives: {false_negative_count}')
print()
print(f'Precision: {precision}')
print(f'Recall:    {recall}')
print(f'F1:        {f1}')
print(f'Accuracy:  {accuracy}')

True Positives:  89
True Negatives:  188
False Positives: 1
False Negatives: 6

Precision: 0.9888888888888889
Recall:    0.9368421052631579
F1:        0.9621621621621621
Accuracy:  0.9753521126760564
