# Linear Regression Example

In this tutorial, we'll walk through how to use our automatic differentiation engine to perform multivariate linear regression on a real estate dataset. 

The data can be found [here](https://www.kaggle.com/quantbruce/real-estate-price-prediction).

Our goal here is not to create the best model to fit our data, but to simply show how to use our automatic differentiation engine to create, train, and test a linear regression model. Thus, our model will be very simplistic for pedagogical purposes.

# Data Preprocessing

We'll first take look at our data, validate it, and preprocess it to be in a form appropriate to use with our automatic differentiation engine.

Feel free to skip this section as it doesn't directly relate to how to use our automatic differentiation engine.

Let's first import some necessary libraries.

In [1]:
from autograd import Variable
import autograd
from tqdm import tqdm
import pandas as pd
import numpy as np
import random

Let's load the data. 

In [2]:
csv_path = './data/real_estate.csv'
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


Let's verify that there are no null values we need to worry about. 

In [3]:
assert df.isnull().sum().sum() == 0

Our model will be very simple. We'll create a linear regression model that takes in a value from each column and will return the predicted price.

Our automatic differentiation engine expects numpy arrays. 

Let's convert the data into two numpy arrays, one for the inputs and one for the outputs.

In [4]:
input_df = df[['X1 transaction date', 'X2 house age', 'X3 distance to the nearest MRT station', 'X4 number of convenience stores', 'X5 latitude', 'X6 longitude']]
output_df = df['Y house price of unit area']
input_matrix = input_df.values
output_vector = output_df.values

In [5]:
type(input_matrix)

numpy.ndarray

In [6]:
input_matrix.shape

(414, 6)

In [7]:
type(output_vector)

numpy.ndarray

In [8]:
output_vector.shape

(414,)

Let's now split our data into a training set and test set. We'll use 70% of our data for training and 30% for testing. We'll manually set the random seeds for reproducibility purposes.

In [9]:
random.seed(0)
np.random.seed(0)

row_count = input_matrix.shape[0]
row_indices = list(range(row_count))
random.shuffle(row_indices)

last_training_row_position = round(row_count*0.70)
training_indices = row_indices[:last_training_row_position+1]
testing_indices = row_indices[last_training_row_position+1:]

training_input_matrix = input_matrix[training_indices]
training_output_vector = output_vector[training_indices]
testing_input_matrix = input_matrix[testing_indices]
testing_output_vector = output_vector[testing_indices]

We have 6 input columns, so we'll need to learn 6 multiplicative weights and 6 biases for our linear regression model. 

Let's initialize those weights.

In [10]:
weights = Variable(np.random.rand(6))
biases = Variable(np.random.rand(6))

First, we'll need to establish some hyperparameters for training our linear regression model. We'll be using stochastic gradient descent.

In practice, we might use various batch sizes or shuffle our training data between epochs, but we'll forgo this for the sake of simplicity as the purposes of this tutorial is to show how to use our automatic differentiation engine rather than finding the best model possible.

In [14]:
number_of_epochs = 2_000
learning_rate = 1e-4
sgd = autograd.optimizer.SGD(learning_rate)

Let's write our training loop. 

In [15]:
for epoch_index in tqdm(range(number_of_epochs)):
    for x, y in zip(training_input_matrix, training_output_vector):
        y_hat = np.multiply(y, weights) + biases
        loss = np.subtract(y, y_hat) ** 2
        sgd.take_training_step(loss)
print(loss.data.sum())

100%|██████████| 2000/2000 [00:39<00:00, 50.23it/s]

-0.0005999999871072146





In [16]:
loss.data

array([-9.99999991e-05, -9.99999966e-05, -9.99999960e-05, -9.99999993e-05,
       -9.99999973e-05, -9.99999988e-05])