# Linear Regression
- We will show you how to load some data 
- plot it with a scatterplot 
- calculate the cost function with respect to a straight line.

In [0]:
!git clone https://github.com/ArctiqTeam/e-ml-workshop

## LR

In [0]:
# import pkgs
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [0]:
#load file
df = pd.read_csv('./e-ml-workshop/data/weight-height.csv')

In [0]:
df.head()

In [0]:
df.plot(kind='scatter',
        x='Height',
        y='Weight',
        title='Weight and Height in adults')

In [0]:
df.plot(kind='scatter',
        x='Height',
        y='Weight',
        title='Weight and Height in adults')

# Here we're plotting the red line 'by hand' with fixed values
# We'll try to learn this line with an algorithm below
# Arbitrary line (seems to feet pretty good eh?)  This is random

plt.plot([55, 78], [75, 250], color='red', linewidth=3)

In [0]:
# Let's define a function called "line", that takes some values in X and performs what you've learned to be the equation that defines a line so multiplies 
# X by the way W and adds the bias, both set to zero for now

def line(x, w=0, b=0):
    return x * w + b

In [0]:
# notice we can define a space of 100 points equally spaced between 55 and 80.
x = np.linspace(55, 80, 100)

In [0]:
x

In [0]:
yhat = line(x, w=0, b=0)

In [0]:
# now let's calculate yhat and let's look at it.
# should be equal to an array of zeros (because both W and B are zero)
yhat

In [0]:
# let's plot the function yhat as a function of x and added to our data.
df.plot(kind='scatter',
        x='Height',
        y='Weight',
        title='Weight and Height in adults')
plt.plot(x, yhat, color='red', linewidth=3)

In [0]:
# It's a straight line passing through zero with zero slope.

### Cost Function
- Now we are going to calculate the cost 
- Which is the mean squared error given by the residuals of these data points from this line

In [0]:
# This will take our true data and our predictive data which is the points of the line the yhat.
# y and the yhat and it will take the difference square it save it in a temporary variable called S 
# and then calculate the mean of s. So Mean Squared Error

def mean_squared_error(y_true, y_pred):
    s = (y_true - y_pred)**2
    return s.mean()

In [0]:
# Now we passed two arrays X to be the values of height and y true to be the values of weight 
# This makes X and y_true numpy arrays and not pandas objects

X = df[['Height']].values
y_true = df['Weight'].values

In [0]:
#It's X and y_true are now actually an array.
y_true

In [0]:
X

In [0]:
# Now we can calculate Y predicted by taking the line function defined above, 
y_pred = line(X)

In [0]:
# Colum array (because X was a column vector)
y_pred

In [0]:
mses=mean_squared_error(y_true, y_pred.ravel())
mses

In [0]:
y_pred

### you do it!

Try changing the values of the parameters b and w in the line above and plot it again to see how the plot and the cost  change.

In [0]:
plt.figure(figsize=(10, 5))

# we are going to draw 2 plots in the same figure
# first plot, data and a few lines
ax1 = plt.subplot(121)
df.plot(kind='scatter',
        x='Height',
        y='Weight',
        title='Weight and Height in adults', ax=ax1)

# let's explore the cost function for a few values of b between -100 and +150
bbs = np.array([-100, -50, 0, 50, 100, 150])

# we will append the values of the cost here, for each line
mses = []  
for b in bbs:
    y_pred = line(X, w=2, b=b)
    mse = mean_squared_error(y_true, y_pred)
    mses.append(mse)
    plt.plot(X, y_pred)

# second plot: Cost function
ax2 = plt.subplot(122)
plt.plot(bbs, mses, 'o-')
plt.title('Cost as a function of b')
plt.xlabel('b')

Of course if we start changing the slope will go even further down and reduce the costs great.

## Linear Regression with Tensorflow and Keras

So Linear Regressions allows us to do this process automatically and that's what we're going to do next.

Let's do it with Keras!

There are a lot of packages that implemented in the regressions we can do it in sklearn, scipy,  We will do it in Keras because I want you guys to start familiarize with the Keras API.

We will go through it in greater detail as we discover neural networks but it's great if you already start to getting familiar with it.

In [0]:
# First thing we're going to import is the type of Model
# This is called sequential and it's called sequential because we're going to be adding elements to these model in a sequence. 
# To build a linear regression we only We only need that dense type of layer, the last thing we import is a couple of optimizers.
# These are the things that change our weights and biases looking for the minimum cost

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam, SGD

In [0]:
# Define as Sequential
model = Sequential()

In [0]:
# Now add to the mode a "dense" layer
# First Paramets "1" is the number of units. Means how many output values will this model have. 
# Well since it's a LR that takes one value x as input and one value is given as output y_hat.
# We only need one value in output. 

# Input Shape is 1 number (X)

model.add(Dense(1, input_shape=(1,)))

In [0]:
# As you can see there is only one layer. It's called dense one the output shape is one number and it has two parameters these are the single weight and the bias. 
# Notice that the output shape is given with this tuple that says "None, 1". The reason for this is that the model can accept multiple points at once.
# So instead of passing a single value for x we could ask for predictions for many values of x in one single call.

model.summary()

In [0]:
model.compile(Adam(lr=0.8), 'mean_squared_error')

In [0]:
model.fit(X, y_true, epochs=40)

In [0]:
y_pred = model.predict(X)
y_pred

In [0]:
df.plot(kind='scatter',
        x='Height',
        y='Weight',
        title='Weight and Height in adults')
plt.plot(X, y_pred, color='red')

In [0]:
W, B = model.get_weights()

In [0]:
W

In [0]:
B

## Evaluating Model Performance
- We also need to establish a score to compare different models.
- Unfortunately we cannot use the cost itself as a score because its value depends on the scale used to measure features and labels.
- Let's start by defining a better score common to use score for aggression:  R-squared score

In [0]:
# R-squared score compares the sum of the squares of residuals in our model with the sum of the squares
# in a baseline model that predicts the average price 
# IF the mode is really good the sum of the squares will be  small 
# Compared to the sum of the squares and the fraction on the right will tend to be zero.
# R2 close to one = Good score 
# R2 lower than one = Increasingly we're worse score.
# And when you are at zero or below your Model is doing worse than the simple model of using the average price.

from sklearn.metrics import r2_score

In [0]:
print("The R2 score is {:0.3f}".format(r2_score(y_true, y_pred)))

### Train Test Split
- We split our data into training and test set to check how well our model is able to generalize.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y_true,
                                                    test_size=0.2)

In [0]:
len(X_train)

In [0]:
len(X_test)

In [0]:
W[0, 0] = 0.0
B[0] = 0.0
model.set_weights((W, B))

In [0]:
model.fit(X_train, y_train, epochs=50, verbose=0)

In [0]:
y_train_pred = model.predict(X_train).ravel()
y_test_pred = model.predict(X_test).ravel()

In [0]:
from sklearn.metrics import mean_squared_error as mse

In [0]:
# Values that are close together are great because it means that our model is generalizing on the training and on the test.
print("The Mean Squared Error on the Train set is:\t{:0.1f}".format(mse(y_train, y_train_pred)))
print("The Mean Squared Error on the Test set is:\t{:0.1f}".format(mse(y_test, y_test_pred)))

In [0]:
# What matters here is not only how high your R-squared (remember closer to one the better) 
# What is most important here is that the score for train and test is close, this mean your model is generalizing

from sklearn.metrics import r2_score
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))

# Machine Learning Exercises

## Exercise 1

You've just been hired at a real estate investment firm and they would like you to build a model for pricing houses. You are given a dataset that contains data for house prices and a few features like number of bedrooms, size in square feet and age of the house. Let's see if you can build a model that is able to predict the price. In this exercise we extend what we have learned about linear regression to a dataset with more than one feature. Here are the steps to complete it:

1. Load the dataset ./e-ml-workshop/data/housing-data.csv
- plot the histograms for each feature
- create 2 variables called X and y: X shall be a matrix with 3 columns (sqft,bdrms,age) and y shall be a vector with 1 column (price)
- create a linear regression model in Keras with the appropriate number of inputs and output
- split the data into train and test with a 20% test size
- train the model on the training set and check its accuracy on training and test set
- how's your model doing? Is the loss growing smaller?
- try to improve your model with these experiments:
    - normalize the input features with one of the rescaling techniques mentioned above
    - use a different value for the learning rate of your model
    - use a different optimizer
- once you're satisfied with training, check the R2score on the test set

# You've just learned the basic ingredients of a neural network.

- Learned about hypothesis.
- You've learned about cost and 
- You've learned about optimization.