# Logistic Regression Example

In this tutorial, we'll walk through how to use our automatic differentiation engine to create and train a logistic regression model to predict diabetes. 

The data can be found [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

Our goal here is not to create the best model to fit our data, but to simply show how to use our automatic differentiation engine to create, train, and test a linear regression model. Thus, our model will be very simplistic for pedagogical purposes.

# Data Preprocessing

We'll first take look at our data, validate it, and preprocess it to be in a form appropriate to use with our automatic differentiation engine.

Feel free to skip this section as it doesn't directly relate to how to use our automatic differentiation engine.

Let's first import some necessary libraries.

In [1]:
from autograd import Variable, LinearLayer
import autograd
from tqdm import tqdm
import pandas as pd
import numpy as np
import random

Let's load the data. 

In [2]:
csv_path = './data/diabetes.csv'
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Let's verify that there are no null values we need to worry about. 

In [3]:
assert df.isnull().sum().sum() == 0

Our model will be very simple. We'll create a logistic regression model that takes in a value from each column and will return the predicted price.

Our automatic differentiation engine expects numpy arrays. 

Let's convert the data into two numpy arrays, one for the inputs and one for the outputs.

In [4]:
input_df = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
output_df = df['Outcome']
input_matrix = input_df.values
output_vector = output_df.values

In [5]:
type(input_matrix)

numpy.ndarray

In [6]:
input_matrix.shape

(768, 8)

In [7]:
type(output_vector)

numpy.ndarray

In [8]:
output_vector.shape

(768,)

Let's now split our data into a training set and test set. We'll use 70% of our data for training and 30% for testing. We'll manually set the random seeds for reproducibility purposes.

In [9]:
random.seed(0)
np.random.seed(0)

row_count = input_matrix.shape[0]
row_indices = list(range(row_count))
random.shuffle(row_indices)

last_training_row_position = round(row_count*0.70)
training_indices = row_indices[:last_training_row_position+1]
testing_indices = row_indices[last_training_row_position+1:]

training_input_matrix = input_matrix[training_indices]
training_output_vector = output_vector[training_indices]
testing_input_matrix = input_matrix[testing_indices]
testing_output_vector = output_vector[testing_indices]

A logistic regression model can be implemented as a dense linear layer with a sigmoid activation. 

This is how we'll implement our model. 

In [153]:
sigmoid = lambda x: np.exp(x)/(np.exp(x)+1)
linear_layer = LinearLayer(8, 1)

First, we'll need to establish some hyperparameters for training our linear regression model. We'll be using stochastic gradient descent.

In practice, we might use various batch sizes or shuffle our training data between epochs, but we'll forgo this for the sake of simplicity as the purposes of this tutorial is to show how to use our automatic differentiation engine rather than finding the best model possible.

In [160]:
number_of_epochs = 3
learning_rate = 1e-50
sgd = autograd.optimizer.SGD(learning_rate)

Let's write our training loop. 

In [163]:
a = Variable(np.ones(8)) / 100
b = Variable(1)

epoch_iterator = tqdm(range(number_of_epochs))
for epoch_index in epoch_iterator:
    for x, y in zip(training_input_matrix, training_output_vector):
        #linear_result = linear_layer(x/100)
        #y_hat = sigmoid(linear_result)
        y_hat = a * x + b
        diff = (y_hat - y)
        loss = diff ** 2
        var2grad = sgd.take_training_step(loss)
        # print(var2grad[linear_layer.matrix])
        # print(np.linalg.norm(var2grad[a]))
        print(loss.sum())
        if np.isnan(linear_result.data).any():
            break
    epoch_iterator.set_description(f'Loss: {loss.sum().data:.8f}')
    epoch_iterator.refresh()

Loss: 2.62849471:   0%|          | 0/3 [00:00<?, ?it/s]

2.2430080129
5.980238817600002
16.2046531796
13.591470862400001
13.2210229881
3.649918722500001
14.6138876964
4.0446747488999995
16.3016370656
3.52370929
20.4468045721
17.7527031841
16.3073781201
2.5022634481000003
16.9056117081
16.8617154976
14.4464255696
14.26165769
7.414776164900001
23.337674704900003
5.9563903720999996
2.7236568641000005
13.668135451600001
6.056662576900001
14.976209988899999
13.300116132100001
15.326895523600001
16.7638422144
16.5412498116
15.5268903001
2.7792899888999996
16.2806862801
17.0445663601
2.9996416024999997
14.684627722500002
18.042235288100002
26.642609728900005
15.016451755600002
2.1238333761000003
2.3813577296000004
14.234526496400001
2.2722824516000006
16.0409587025
10.5584410404
22.383713315599998
15.2733475225
20.3948565444
36.5485205524
16.1066444484
15.874071864400001
4.5041738084
15.914539640900001
13.884291878400001
6.7260221715999995
18.8590857156
3.9216341584000007
16.1071910225
13.734266360100001
18.213349395599998
13.6861150224
18.43379764

Loss: 2.62849471: 100%|██████████| 3/3 [00:00<00:00,  8.55it/s]


20.7123531601
27.4809785616
15.6031910544
2.7821053824
3.5607604944
3.3962956249
14.8850494249
14.1331927636
17.4475621961
5.293551176099999
15.969998857599998
12.2945835721
13.213942937600002
16.457998073600002
15.300127759999999
14.759107155599999
2.0300130995999996
15.153848700900003
39.66555270250001
21.58586804
37.443325280399996
16.6911496644
17.86606
16.302088182400002
3.6400288368999996
19.2398500025
2.0552598809
15.3769215225
15.5847307081
16.600399835599998
4.998214712399999
34.2722898724
15.056609273600001
14.5466019049
14.703864972900003
4.563542876900001
18.28883589
3.161182875600001
14.520657768900003
20.604672602500003
10.896189241599998
18.6080268121
16.8493801081
16.7323841876
2.2590650495999993
21.1297171801
16.179351472100002
16.724980000000002
1.6343178208999998
2.2066725600000003
17.5166159281
13.3298907956
2.0145821009000002
8.2983126464
15.177820359999998
7.276491758399999
1.9051501204000008
1.1973234256
22.568418490000003
19.5844847204
3.1906306688999995
12.053




In [146]:
a

Variable([-6.86993812e-002,              nan, -6.46680134e-003,
          -1.29964944e-002, -5.12157446e+208, -1.42329934e-002,
          -6.66093411e-002, -9.98507746e-003])

In [49]:
y_hat

Variable([[0.002]])

Let's now see how our model performs on our test set.

In [None]:
mean_testing_loss = 0
for x, y in zip(testing_input_matrix, testing_output_vector):
    y_hat = np.sum(np.multiply(y, weights) + biases)
    loss = (y_hat - y) ** 2
    mean_testing_loss += loss.data
mean_testing_loss /= len(testing_output_vector)

In [None]:
mean_testing_loss

Let's look at a few random results to sanity check our mode. 

In [None]:
random_test_examples = random.sample(list(zip(testing_input_matrix, testing_output_vector)), 10)
for x, y in random_test_examples:
    y_hat = np.sum(np.multiply(y, weights) + biases)
    print()
    print(f'y_hat: {y_hat.data}')
    print(f'y:     {y}')