## Linear Regression

Linear regression is a simple approach for supervised learning. We are going to use
linear regression to predict the price of a house given it's size and the number
of bedrooms it has.


In [53]:
import numpy as np
import pandas as pd
from ipywidgets import *
import matplotlib.pyplot as plt

%matplotlib widget

df = pd.read_csv('houses.csv')
df

Unnamed: 0,area,price
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4


The above dataset will be our training set. **size** is the input variable. Input variables also referred to as
features, predictors or independent variables. **price** is our output or target
variable.

The essence of machine learning is to find a model that **maps** input variables to target variables.
Let $X$ denote the input variable space and $Y$ denote the output variable spaces, then the point of
machine learning is to find a function $h:X \mapsto Y$ such that $h(x)$ is a good predictor
for the corresponding value of y. This function $h$ is called a **hypothesis**


Plotting our training set

In [54]:
plt.scatter(df.area, df.price)
plt.title('House Prices')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In this case, we can approximate $h$ as a linear function which means that it is a
straight line when plotted.

For purposes of simplicity let's define $h(x)$  as $h(x) = \theta x_1$

$\theta$ parameterizes $h$. The task now is to find $\theta$ such that we get a prediction value
closest to the real price value.

We can express the hypothesis function in Python as below

In [55]:
def hypothesis(theta, x):
    return theta * x


## Cost Function
The **cost function** measures how close our prediction is to the real value. It is defined as

$J(\theta)=\frac{1}{2m}\sum(h(x_i)-y_i)^2$ where $h(x_i)$ is the prediction for the ith entry in our training set.

Writing the cost function in Python

In [56]:
def cost_function(theta, training_data):
    result = 0
    for index, training_set in training_data.iterrows():
        predicted = hypothesis(theta, training_set['area'])
        square_diff = (predicted - training_set['price']) ** 2
        result = square_diff / (2 * len(training_data.index))
    return result


In this case a $\theta$ value of 1, means that our prediction is exactly the same as the real
value, therefore, the mean squared difference between our predicted and real values should be 0.
In the snippet below, we see that the further we move from 1, the less accurate our prediction becomes.

In [57]:
print('theta=1: result=', cost_function(1, df))
print('theta=1.5: result=', cost_function(1.5, df))
print('theta=2: result=', cost_function(2, df))

theta=1: result= 0.0
theta=1.5: result= 0.4
theta=2: result= 1.6


## Visualising Hypothesis and Cost Function

In [67]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(5, 5))

def plot_prediction(theta):
    x1 = np.linspace(0.0, 5.0)
    y1 = np.linspace(0.0, 5.0)

    axes[0].set_ylim(0, 5.0)

    axes[0].set_title('House Prices')
    axes[0].set_xlabel('Area(feet$^2$)')
    axes[0].set_ylabel('Price')

    axes[0].scatter(df.area, df.price)
    axes[0].plot(df.area, theta*df.price, color='red', label='prediction')
    axes[0].legend(loc='upper left')

def plot_cost(theta):
    x2 = np.linspace(-0.5, 2.5)
    y2 = [cost_function(x, training_data=df) for x in x2]

    axes[1].set_ylim(0, 3.5)
    axes[1].set_xlim(-0.5, 2.5)

    axes[1].set_title('Cost Function')
    axes[1].set_xlabel(r'$\theta$')
    axes[1].set_ylabel(r'$J(\theta)$')

    axes[1].scatter(theta, cost_function(theta, df), label='cost')
    axes[1].legend(loc='upper left')
    axes[1].plot(x2, y2)
    
plot_prediction(theta=1.0)
plot_cost(theta=1.0)

def update(theta = 1.0):
    axes[0].clear()
    axes[1].clear()

    plot_prediction(theta)
    plot_cost(theta)

 
interact(update)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

interactive(children=(FloatSlider(value=1.0, description='theta', max=3.0, min=-1.0), Output()), _dom_classes=…

<function __main__.update(theta=1.0)>

In the diagram above, we can see how the accuracy of our prediction changes with respect to a change in the cost $J(\theta)$. Our prediction is equal to the real values when cost 1.

## Parameter Learning
We have seen by inspection how the parameter $\theta$ affects our prediction. We also observed that the closer $\theta$ is to the cost function's global optima the better our prediction.  What we are going to do next is implement **gradient descent**. **Gradient descent** learns the best value for $\theta$.


#### Gradient Descent Intuition
1. Start with a guess for $\theta$ say $\theta$ = 2
2. Find the gradient at $\theta$ = 2
3. Descend the cost function by a value $\alpha \times gradient$, $\alpha$ is called the learning rate
4. Repeat until the solution converges. 

#### Gradient Descent
Gradient Descent helps us find the value of $\theta$ that will give us the best prediction. It works by starting with a guess value say $\theta = 2$




In [69]:
def cost(theta):
    plt.figure()

    x2 = np.linspace(-0.5, 2.5)
    y2 = [cost_function(x, training_data=df) for x in x2]

    plt.ylim(0, 3.5)
    plt.xlim(-0.5, 2.5)

    plt.title('Cost Function')
    plt.xlabel(r'$\theta$')
    plt.ylabel(r'$J(\theta)$')

    plt.scatter(theta, cost_function(theta, df), label='cost')
    plt.plot(x2, y2)
    plt.legend(loc='upper left')
    plt.show()

cost(theta=2)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

From the current theta position we need to go down towards the global optima. We can do that by
1. Getting the gradient at the current $\theta$ value
2. Substracting the gradient from the $\theta$ to get the new $\theta$ value

Let's indulge in some math

The cost function is defined as

$J(\theta) = \frac{1}{2m}\sum(h(x_i) - y_i)^2$

and our prediction function is defined as

$h(x_i) = \theta x_i$

Note that the cost function uses the prediction function so we can replace the prediction function in the cost function with the expression $\theta x_i$. Rewriting the cost function.

$J(\theta) = \frac{1}{2m}\sum(\theta x_i - y_i)^2$

At $\theta$ = 2,

$J(2) = \frac{1}{2m}\sum(2x_i - y_i)^2$, where m is the number of entries in our training data and i is the ith training set.

For purpose of demonstration, let's assume we have only one training set, so we won't need to sum them together and we can do away with i. This simplifies our equation to

$J(2) = \frac{1}{2}(2x - y)^2$

We can find the gradient at $\theta = 2$ by applying the **chain rule**

$\frac{d}{d\theta}J(2) = \frac{d}{d\theta}\frac{1}{2}(2x - y)^2$
            
$= 2 \times \frac{1}{2}(2x - y)x$ 

$= (2x - y)x$, where x is the house area and y the house price

If our training set is (1.5, 1.5), house area of size 3 and price of 3 dollars, then  our new theta will be 

$\theta = 2 - (2 \times 1.5 - 1.5)1.5$

$\theta = 0.25$


Plotting our \theta



In [70]:
cost(theta=0.25)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

What the hell just happened? We passed our destination -- global optima. How do we fix this? We'll introduce a learning rate $\alpha$ that we'll multiply the gradient with to make sure we are taking small steps towards the optima. If you use a big learning rate you might never get to the global optima, if you choose a small learning rate you'll spend 40 years in the wilderness before you reach the promised land -- global optima.

Rewriting our equation with a learning rate

$\theta = 2 - \alpha (2 \times 1.5 - 1.5)1.5$

Let's set $\alpha$ to a reasonal value say 0.25

$\alpha = 0.25$

$\theta = 2 - 0.25 (2 \times 1.5 - 1.5)1.5$

Then our new $\theta$ will be

$\theta = 1.437$

Plotting our theta



In [72]:
cost(theta=1.437)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Nice, we are moving in the right direction and closer to the global optima.

#### Generalizing Gradient Descent

What we saw was a particular solution for $\theta = 2$. To be able to apply this for any $\theta$ we need to generalize the gradient descent algorithm. For those who are mathematically inclined they can go through the equations below.

$\theta = \theta - \alpha\frac{d}{d\theta} J(\theta)$

Since $J(\theta)=\frac{1}{2m}\sum(h(x_i)-y_i)$

Then $\theta = \theta - \alpha\frac{d}{d\theta} \frac{1}{2m}\sum(h(x_i)-y_i)^2$

Using the chain rule 

$\frac{d}{d\theta} \frac{1}{2m}\sum(h(x_i)-y_i)^2 = \frac{2}{2m}\sum(h(x_i)-y_i) \times \frac{d}{d\theta} h(x)$

And since $h(x) = \theta x$ it's derivative with respect to $\theta$ will be $x$

This reduces to

$\theta = \theta - \alpha \frac{1}{m}\sum(h(x_i) - y_i)x_i$

We can express the above equation in python

In [78]:
def gradient_descent(theta, learning_rate=0.15):
    total = 0
    m = len(df.index)

    for i, training_set in df.iterrows():
        prediction = hypothesis(theta, training_set['area'])
        error = prediction - training_set['price']
        total += (error * training_set['area']) 

    new_theta = theta - ((learning_rate / m) * total)
    return new_theta


#### Taking gradient descent for a spin


In [81]:


def learn():
    theta = 3
    count = 0

    while (count < 10):
        theta = gradient_descent(theta)
        count += 1
        print(theta)
    return theta

learn()

1.2000000000000002
1.02
1.002
1.0002
1.00002
1.000002
1.0000002000000001
1.00000002
1.000000002
1.0000000002


1.0000000002

As we can see, $\theta$ gets closer and closer to the desired value 1.
