# Linear Regression - Housing Prices

For those of you unfamiliar with Jupyter notebooks, this is a concise explanation.
Each "cell" of the notebook (a rectangular area) contains either explanatory text ("Markdown," like this cell) or Python/Keras commands ("Code"). By clicking on this cell, you select it (and the contours are highlighted). If you press "__Run__" in the menu, Jupyter processes the contents of this cell and moves on to the next. Scroll to the next cell, read the command and press __Run__ again. The result of the command (if any) will become visible. Just proceed through the notebook in this fashion, and return to previous cells, whenever necessary (either to re-read an explanation or command, or to change parameters). Please note that if you want to restart the entire notebook, you have to start at the top.

Let's start with importing the required plugins.

In [None]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Linear Regression is usually the first machine learning algorithm that every data scientist comes across. It is a simple model, but nontheless good to understand, as it lays the foundation for other machine learning algorithms. In statistics, Linear Regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables.

Next, we will load the housing data.

In [None]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()
print("Data loaded.")

We print the values of the __boston_dataset__ to understand what it contains.

In [None]:
print(boston_dataset.DESCR)

The house prices, indicated by the variable __MEDV__, is our target variable. The remaining variables are the feature variables, based on which we will predict the value of a house.

We will now load the data into a dataframe. To get an idea of the generated dataframe, we then print the first 5 rows of the data.

In [None]:
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target
boston.head()

It's a good practice to explore your data before starting with the linear regression. Next, we will use some visualizations to understand the relationship of the target variable (house prices) with other features in the data.

Let’s first plot the distribution of the target variable __MEDV__.

In [None]:
plt.hist(boston['MEDV'],30)
plt.title('Distribution house prices')
plt.xlabel('House price (*$1000)')
plt.ylabel('Occurance')
plt.show()

Now we can observe that the housing prices are normally distributed, with a few outliers. We suspect that the number of rooms in a house, represented by __RM__, has a strong correlation with the actual price of a house. Visualizing this using a scatterplot may confirm our suspicion.

In [None]:
plt.figure(figsize=(20, 5))

features = boston['RM']
target = boston['MEDV']

plt.scatter(features, target)
plt.title('House prices VS Number of rooms')
plt.ylabel('House price (*$1000)')
plt.xlabel('Number of rooms')
plt.show

Looks like we were right! The house prices seem to increase as the number of rooms increases. Now we can build our Linear Regression model. First, we load the relevant data (in our case the house prices __MEDV__ and the number of rooms __RM__) to a separate data frame.

In [None]:
X = pd.DataFrame(boston['RM'], columns = ['RM'])
Y = boston['MEDV']
print("Data loaded.")

Then, we can set up a Linear Regression model, and fit the data to this model.

In [None]:
lin_model = LinearRegression()
lin_model.fit(X, Y)
print("Model loaded.")

Lastly, we generate a prediction by the model and visualize the results.

In [None]:
Y_pred = lin_model.predict(X)
plt.figure(figsize=(20, 5))
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.title('Linear Regression - Prediction')
plt.ylabel('House price (*$1000)')
plt.xlabel('Number of rooms')
plt.show()

The red line in the plot shows how our model predicts house prices will change, based on the number of rooms available in a house. A line for which the the error between the predicted values and the observed values (blue dots) is minimum is called the best fit line or the regression line.

These errors are also called residuals. The residuals can be visualized as the 'distance' between the observed data values and the regression line.

![](images/residuals.png)
>_http://wiki.engageeducation.org.au/further-maths/data-analysis/residuals/_