# Linear Regression with a Single Feature

In this notebook, we are going through the [linear regression](https://en.wikipedia.org/wiki/Linear_regression) implementation, comparing our predictions with the [Scikit-Learn](https://scikit-learn.org/stable/) module.

A *linear regression* can be applied on datasets with single or multiple features. For now, let's focus on the implementation for a single feature.

First, let's import useful packages for the implementation:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn') # plotting style. The seaborn one works well for ligth or dark themes

For this notebook, we are going to use an open source dataset of features of houses in Boston to predict its price.

So, let's read the csv data and convert it to a pandas DataFrame.

In [None]:
data = pd.read_csv('../data/BostonHousing.csv')
data.head()

In [None]:
data.columns

Wow, the data has a lot of columns...

Let's see the definition of each columns:

| Column Name | Description |
| :-----: |    :---:    |
|   id    | Identification number of the house |
|   date    | Date of the aquired information |
|   price    | Price of the house |
|   bedrooms    | Number of bedrooms in the house |
|   bathrooms    | Number of bathrooms in the houne |
|   sqft_living    | Square footage of the living area |
|   sqft_lot    | Square footage of the lot area |
|   floors    | Number of floors in the house |
|   waterfront    | If it has water front |
|   view    | If it has view |
|   condition    | Condition of the house |
|   grade    | Grade given to the house |
|   sqft_above    | Square footage of the up part of the house |
|   sqft_basement    | Square footage of the basement |
|   yr_built    | Year the house was built |
|   yr_renovated    | Year of the last renovation of the house |
|   zipcode    | Zipcode |
|   lat    | Latitude of the house |
|   long    | Longitude of the house |
|   sqft_living15    | Square footage of the living area of the 15 nearest houses |
|   sqft_lot15    | Square footage of the lot area of the 15 nearest houses |

Let's take a look on the data:

In [None]:
data.info()

In [None]:
data.describe()

Let's do a few plots of the features over the price:

In [None]:
features = list(data.columns[3:])

# sqft_living
data.plot(x = 'price', y = 'sqft_living', figsize = (20,5), kind = 'scatter')

In [None]:
# sqft_lot
data.plot(x = 'price', y = 'sqft_lot', figsize = (20,5), kind = 'scatter')

In [None]:
# waterfront
data.plot(x = 'price', y = 'waterfront', figsize = (20,5), kind = 'scatter')

In [None]:
# sqft_living15
data.plot(x = 'price', y = 'sqft_living15', figsize = (20,7), kind = 'scatter')

In [None]:
# sqft_lot15
data.plot(x = 'price', y = 'sqft_lot15', figsize = (20,7), kind = 'scatter')

In [None]:
# bedrooms
data.plot(x = 'price', y = 'bedrooms', figsize = (20,7), kind = 'scatter')

In [None]:
# bathrroms
data.plot(x = 'price', y = 'bathrooms', figsize = (20,7), kind = 'scatter')

In [None]:
# Condition
data.plot(x = 'price', y = 'condition', figsize = (20,7), kind = 'scatter')

In [None]:
# grade
data.plot(x = 'price', y = 'grade', figsize = (20,7), kind = 'scatter')

### Plotting using a for loop

In [None]:
target = 'price' # target column

fig, ax = plt.subplots(len(features), 1, figsize = (20, len(features)*6))
for i, f in enumerate(features):
    ax[i].scatter(data[target].values, data[f].values, s = 15)
    ax[i].set_xlabel(target, fontsize = 14)
    ax[i].set_ylabel(f, fontsize = 14)

Or you can do the subplots in just one line, by faking the **DataFrame.plot(kind = 'line')** to be a scatter plot. Check the options *linestyle* and *marker* of the argument parameters.

Now, follow those hints, *Google* the options and do the subplots in just one programming line:

In [None]:
# YOUR CODE HERE #


Let's check the price of the houses by location, as the features latitude and longitude showed to matter:

In [None]:
plt.figure(figsize = (30,15))
data.sort_values(by = 'price', inplace = True)
plt.scatter(x = data['long'], y = data['lat'], s = 1500*data[target]/max(data[target]), alpha = 0.4, edgecolor = 'w', c = data[target], cmap = 'jet')
plt.xlabel('Latitude', fontsize = 16)
plt.ylabel('Longitude', fontsize = 16)
ax = plt.colorbar(orientation = 'vertical')
ax.set_label('Price', fontsize = 20)

Nice! Based on visual inference, some features look to have higher correlation with the price than others. Now, let's see how to do predictions using linear regression.

## Train/Testing Split

To validate a machine learning model, it is very common to split the data into a training and a validation (testing) set.

Let's do it. Let's get 80% of the data (randomly) as training and 20% as testing.

In [None]:
msk = np.random.rand(len(data)) < 0.8 # Create a mask with random indexes

train = data[msk]
test = data[~msk]

print(len(data), len(train), len(test), len(train) + len(test))

# Linear Regression Method

We want to fit a line that best estimates all the values in the training set. For that, we need to find 2 parameters: the slope and the intercept. The equation of the line is:

$$y_{i}^{,}(x_i) = w_0 + w_{1}*x_i$$

Now, we want to find the parameters $w_0$ and $w_1$ that reduces the cost function (the sum of the squared difference between measured data $y_i$ to the predicted data $y_{i}^{,}$):

$$RSS(w_0,w_1) = \sum_{i=1}^{N}(y_i - y_{i}^{,})^2 = \sum_{i=1}^{N}(y_i - [w_0 + w_{1}*x_i])^2$$

Minimizing the cost function means to take the derivative of cost function for each parameter ($w_0$ and $w_1$) and make it equal to zero. This leads to 2 simple formulas for $w_0$ and $w_1$:

$$w_0 = \frac{\sum_{i=1}^{N}y_i}{N} - w_{1}\frac{\sum_{i=1}^{N}x_i}{N}$$

and

$$w_1 = \frac{\sum_{i=1}^{N}y_{i}x_{i} - \frac{\sum_{i=1}^{N}y_{i}\sum_{i=1}^{N}x_{i}}{N}}{\sum_{i=1}^{N}x_{i}^2 - \frac{\sum_{i=1}^{N}x_{i}\sum_{i=1}^{N}x_{i}}{N}}$$

With the equations above, it is possible to compute the intercept ($w_0$) and the slope ($w_1$) that best predict the output $(y_{i}^{'})$ given the input $x_i$ and the measured data $y_i$ (for one feature only).

Now, let's create the function that gets the input feature $x_i$ and the measured data $y_i$ of the training set, and return the intercept $w_0$ and the slope $w_1$.

Click [here](https://www.w3schools.com/python/python_functions.asp) to learn how to define a function in python.

Note: you need to complete the code at some places.

In [None]:
def linear_regression_single(input_feature, measured_data):
    # First, let's compute the sums and squared sums of the parameters equations
    Isum = input_feature.sum()
    Msum = measured_data.sum()
    IMsum = sum([input_feature[i]*measured_data[i] for i in range(len(input_feature))])
    IIsum = sum([input_feature[i]*input_feature[i] for i in range(len(input_feature))])

    # We need to compute the slope first
    num = IMsum-(1./len(input_feature)*(Isum*Msum))
    den = IIsum-(1./len(input_feature)*(Isum*Isum))
    
    # YOUR CODE HERE #
    slope = 
    
    # Now that we have the slope, we can compute the intercept
    intercept = (1./len(input_feature))*Msum-slope*Isum*(1./len(input_feature))
    
    # Return the parameters
    return intercept, slope

Let's test our function:

In [None]:
intercept_1, slope_1 = linear_regression_single(train['sqft_living'].values, train['price'].values)

print("Intercept: " + str(intercept_1))
print("Slope: " + str(slope_1))

The next step is to calculate the estimations $y_{i}^{,}$ by using the linear equation:

$$y_{i}^{,}(x_i) = w_0 + w_{1}*x_i$$

Let's create a function that gets the input feature $x_i$ and the measured data $y_i$. It needs to estimate the intercept $w_0$ and slope $w_1$ and calculate the predictions $y_{i}^{,}$ using the linear equation above.

In [None]:
def regression_predictions_single(input_feature, measured_data):
    # First, we need to estimete the intercept and the slope
    intercept, slope = linear_regression_single(input_feature, measured_data)
    
    # Now, compute the predictions
    # YOUR CODE HERE #
    predicted_values = 
        
    # Return outputs
    return predicted_values, intercept, slope

Let's test our function:

In [None]:
predictions, intercept, slope = regression_predictions_single(train['sqft_living'].values, train['price'].values)

plt.figure(figsize = (20,7))
plt.plot(train['sqft_living'], train['price']/1000000,'.', color = 'blue', label = 'Training data')
plt.plot(train['sqft_living'], predictions/1000000,'-', color = 'orange', label = 'Predictions')
plt.ylabel('Price (Millions USD$)')
plt.xlabel('House Size (Square Feet)')
xl = plt.xlim()
yl = plt.ylim()
plt.legend()
plt.show()

Above, we did a linear regression for the whole dataset. Below, use the slope and intercept estimated in the training data to make estimations over the testing data:

In [None]:
# YOUR CODE HERE #
predictions_test = 

plt.figure(figsize = (20,7))
plt.plot(test['sqft_living'], test['price']/1000000,'.', color = 'blue', label = 'Testing data')
plt.plot(test['sqft_living'], predictions_test/1000000,'-', color = 'orange', label = 'Predictions')
plt.ylabel('Price (Millions USD$)')
plt.xlabel('House Size (Square Feet)')
plt.xlim(xl)
plt.ylim(yl)
plt.legend()
plt.show()

Now, use the calculated slope and intercept to do the prediction of a single size value:

In [None]:
size_to_predict = 15000 # sqft_living to be predicted

# YOUR CODE HERE #
prediction = 

print('Size (sqft): %.i' % size_to_predict)
print('Price (USD$): %.i' % prediction)

### Congratulations!!!!

You just created your own linear regression pipeline for a single variable!

Now, let's compare it with the [linear regression function](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) from the [Scikit-Learn](https://scikit-learn.org/stable/index.html) module.

# Linear regression with Scikit-Learn

The [Scikit-Learn](https://scikit-learn.org/stable/index.html) module is usually automaticaly installed during the **Anaconda** instalation. If you don't have it, create a new cell and run:

```python
! pip install -U scikit-learn
```

For the linear regression, we will use the function [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) from the [sklearn.linear_model](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) submodule.

In [None]:
from sklearn.linear_model import LinearRegression

First create a model with the desired parameters. Here, let's just use the default ones.

In [None]:
reg = LinearRegression()

Now, fit the model to the train dataset:

In [None]:
reg.fit(train['sqft_living'].values.reshape(-1, 1), train[target])

Compute the predictions:

In [None]:
reg_pred = reg.predict(test['sqft_living'].values.reshape(-1, 1))

Let's compare the curves:

In [None]:
plt.figure(figsize = (20,7))
plt.plot(test['sqft_living'], test['price']/1000000,'.', color = 'blue', label = 'Testing data')
plt.plot(test['sqft_living'], predictions_test/1000000,'-', color = 'orange', label = 'Our Predictions')
plt.plot(test['sqft_living'], reg_pred/1000000,'-.', color = 'green', alpha = 0.5, label = 'Sklearn Predictions')
plt.ylabel('Price (Millions USD$)')
plt.xlabel('House Size (Square Feet)')
plt.xlim(xl)
plt.ylim(yl)
plt.legend()
plt.show()

Predicting one single value:

In [None]:
sing_pred = reg.predict(np.array(size_to_predict).reshape(1, -1))

Compare to what we had before:

In [None]:
print('Size (sqft): %.i' % size_to_predict)
print('Price (USD$): %.i (our prediction)' % prediction)
print('Price (USD$): %.i (sklearn prediction)' % sing_pred)