# Part 4 - Simple Linear Regression

Let's get our feet wet with a basic Machine Learning algorithm: Simple Linear Regression.  
Everyone remembers the equation of a line:  
  
$$
y = mx + b
$$ 
  
Given an unknown value of `x`, we can infer `y` by multiplying by slope `m` and adding y-intercept `b`.  
Linear Regression uses the same basic concept, however `x` can be multidimensional, denoted as a matrix `X`.  
This notebook will deal with only single dimensional `x`.

In [None]:
import time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
%matplotlib inline

## Let's load the data and remind ourselves of the contents

In [None]:
df = pd.read_csv('./data/rew_van_jan12_clean_engineered.csv')
df.head()

For this basic linear regression example we will choose only one variable as our independent variable `x`.  
In our EDA we discovered that `sqft` was highly correlated with `price` and they also shared a linear relationship.  
This makes `sqft` a great candidate for inferring `price` using linear regression.  

Recall our line equation: 
$$
y = mx + b
$$ 

We assign the 'price' column of our DataFrame as the dependent variable `y` and the 'sqft' column as the independent variable `x`

In [None]:
x = df['sqft']
y = df['price']

Note:  
Indexing into a single column of a `pandas.DataFrame` will return a `pandas.Series` object while indexing multiple columns will return another `pandas.DataFrame`. Additionally, the `pandas.Series` object is a wrapper around the popular `numpy.ndarray` datatype. We can get the `numpy.ndarray` from the `Series` using `Series.values`.  
See below:

In [None]:
print("df type = {}".format(type(df)))
print("x type = {}".format(type(x)))
print("y type = {}".format(type(y)))
multi_indexed = df[['sqft', 'price']]
print("multi_indexed type = {}".format(type(multi_indexed)))
ndarray = x.values
print("x.values type = {}".format(type(ndarray)))

## Plot `price` vs `sqft`

In [None]:
sns.set(rc={'figure.figsize':(12, 8)})# globally set our seaborn plot size to 12 by 8 inches
sns.regplot(x, y, fit_reg=False)

We observe a linear relationship between `x` and `y` and we could probably visually estimate a best fitting line using our eyes however let's determine this best fitting line algorithmically using [Ordinary Least Squares](https://en.wikipedia.org/wiki/Ordinary_least_squares).  

Ordinary Least Squares (OLS) is a great introduction to our first **Optimization** problem in which we are **minimizing** the sum of square differences between the observed and predicted values in order to find the best fit line to the data.  
  
Basically: 
1. start with an initial *guess* at the slope `m` and y-intercept `b` of the line equation
2. calculate the sum of square differences between the predicted value of `y` and the actual value of `y`.
3. adjust parameters (slope `m` and y-intercept `b`) such that error gets *smaller*.
4. repeate steps 1-3 until error cannot decrease any further.
  
See [here] for a quick, interactive tutorial if you're keen on learning more about OLS.
[here]: http://setosa.io/ev/ordinary-least-squares-regression/

Rather than write our linear regressor from scratch, let's use the [LinearRegression] module from [Scikit-Learn], a popular machine learning library.  
[LinearRegression]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
[Scikit-Learn]: http://scikit-learn.org/stable/

In [None]:
regressor = LinearRegression(normalize=True)

We can now simply call `fit()` on the `regressor` object to run OLS to iteratively fit to the training data.  
  
Note:  
`fit()` expects `x` to be a `numpy.ndarray` of shape `(num_samples, num_features)` and `y` to be of shape `(num_samples, 1)`.  
Recall we are only dealing with single variable linear regression in this notebook so our number of features is 1 however in multiple linear regression the number of features can be very large. 

Let's try first with our `x` and `y` DataFrames so we can get acquianted with the error message in case we encounter something similar in the future

In [None]:
print("x.shape = {}".format(x.shape))
print("y.shape = {}".format(y.shape))
model = regressor.fit(x,y)

We get:  
`ValueError: Expected 2D array, got 1D array instead`  
Let's reshape our data to 2D `numpy.ndarray`

In [None]:
num_samples = len(x)
assert len(x) == len(y) # be sure we have the same number of training samples as target samples
num_features = 1 # only a single feature, sqft
x_np = x.values.reshape((num_samples, num_features))
y_np = y.values.reshape((num_samples, 1))

In [None]:
print("x_np.shape = {}".format(x_np.shape))
print("y_np.shape = {}".format(y_np.shape))
model = regressor.fit(x_np, y_np)

The regressor has been fit to our training data and saved into a variable named `model`.  
Now we can call `predict()` to predict `price` for a given `sqft`. For illustrative purposes we will predict on the same data we trained on

We can evaluate the performance of a given model using Scikit-Learn `mean_square_error` and `LinearRegression.score()` function

In [None]:
def evaluate_model(model, X, y):
    y_pred = model.predict(X) # predict y values from input X
    mse = mean_squared_error(y_true=y, y_pred=y_pred)
    print("Mean Squared Error: {}".format(mse))
    print("Accuracy: {}%".format(model.score(X, y)*100.0))
evaluate_model(model, x_np, y_np)

As our model improves we expect the MSE to decrease and Accuracy to increase.

Let's try on a brand new input from https://www.rew.ca/properties/areas/vancouver-bc

In [None]:
# try brand new data
actual_price = '$5,688,000'
sqft = 3790
new_df = pd.DataFrame(data=[sqft])
predicted_price = model.predict(new_df)
print("predicted price: ${}M".format(predicted_price[0][0]/1e6))
print("actual price: {}".format(actual_price))

Plot the predicted values (red) and actual values (blue) on the same graph

In [None]:
y_pred = model.predict(x_np) # predict y values from input X
sns.regplot(x_np, y_np, fit_reg=False, color='red')
sns.regplot(x_np, y_pred, fit_reg=False, color='blue')

Our model looks like it has fit nicely to the data!  
One immediate drawback you may notice is that the predictions are constrained to the regression line which results in a large error for many inputs.  
Even more importantly: we have many more features other than `sqft` which can help predict `price`.
We address these problems in the next notebook by introducing **multiple linear regression**.

## Model Validation
In the above example we trained and tested our model with the same dataset. In practice this is a **big** mistake. We will get false confidence in our model's performance since we didn't validate the it's ability to generalize to new, unseen data from outside of the training set.  
This inability to generalize is called [overfitting] and is one of the most common problems Machine Learning Engineers face.  
  
New data is typically hard to come by but we can do our best to avoid overfitting by "holding out" some of our training data as *validation* data. Hopefully our dataset is diverse enough that random sampling a validation dataset can properly represent new, unseen test data.
[overfitting]: https://en.wikipedia.org/wiki/Overfitting

Use Scikit-Learn's `train_test_split()` to divide our dataset into *training* and *validation* data.  
A good rule of thumb is 70% training, 30% validation.

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x_np, 
                                                  y_np, 
                                                  test_size=0.30, 
                                                  random_state=123) # split 70% train, 30% validation

Fit the model to our training set and evaluate using our validation set

In [None]:
model= regressor.fit(x_train, y_train)
evaluate_model(model, x_val, y_val)

The results are very similar to before, however don't expect to get this lucky in practice!  
Let's save our model using Python's built-in persistence library `pickle`. That way we can compare results with more complex models in a future notebook.

Since we are reasonably confident our model is not overfitting, let's retrain on the entire dataset before saving

In [None]:
model= regressor.fit(x_np, y_np)

In [None]:
import pickle
with open('./models/simple_linear.pkl', 'wb') as f:
    pickle.dump(model, f)

We can reload the model with:

In [None]:
with open('./models/simple_linear.pkl', 'rb') as f:
    model_load = pickle.load(f)
evaluate_model(model_load, x_val, y_val)

## We are done with Linear Regression. Let's see if we can improve our model accuracy by including more features!