# Linear Regression with Boston House Prices

![Photo from Ryan Mercier in Unsplash](../../resources/ryan-mercier-3U7tnALnvas-unsplash.jpg)  
*Photo from Ryan Mercier in Unsplash*


- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per \$10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in \$1000's


## Downloading the dataset

import os
import sys
module_path = os.path.abspath(os.path.join('..', '..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from kaggle_utils.utils import KaggleUtils

dataset_name = 'vikrishnan/boston-house-prices'
with KaggleUtils() as api:
    api.kaggle_download_dataset(dataset_name)

## Problem Statement

Suppose your are an engineer in Capsule State Corp., which is a real state company. Capsule State Corp is a looking for new invesments at boston. Your task is to **create an automated system to estimate the cost of houses**, using the dataset provided by the data compilation team. The estimates of the system will be used as a reference for the selling cost of a house.

## Data Preparation and Cleaning

For our analysis, we are going to use the dataset contained in the file `housing.csv`, which was downloaded from Kaggle. We need to prepare our data by filling up any missing values. First we need to create a pandas dataframe using the downloaded file.

In [None]:
# %%
%pip install pandas --quiet

In [6]:
import numpy as np
import pandas as pd

In [7]:
housing_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'Bk', 'LSTAT', 'MEDV']

In [8]:
housing_df = pd.read_csv('housing.csv', delimiter='\s+', header=None, names=housing_columns)

In [9]:
housing_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,Bk,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


Let us look at the information contained in our dataset. The objective of the system is to estimate the median value of owner-occupied homes (MEDV) using our dataset.

In [None]:
housing_df.info()

The dataset contains 506 rows and 14 columns. Each row of the dataset contains information about one customer. Also, it seems that our dataset do not have any missing values, how wonderful!. Let's look at the basic statistics of our data.

In [None]:
housing_df.describe()

Only the CRIM, AGE, and Bk seems to be skewed as the median (50% percentile) and the min value differs significantly.

## Exploratory Data Analysis

Before training our ML model, we need to explore and analyze our dataset. The objective of this step is to help us understand the distributions and correlations in our data.

In [None]:
%pip install plotly matplotlib seaborn nbformat --quiet

In [3]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
import plotly.io as pio
pio.renderers

Renderers configuration
-----------------------
    Default renderer: 'vscode'
    Available renderers:
        ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode',
         'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab',
         'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg',
         'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe',
         'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']

The following settings will improve the default style and font sizes for our charts.

In [None]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (15, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
#pio.renderers.default = 'vscode'

### MEDV

For instance, we need to take a look at our main variable of interest median value of owner-occupied homes (MEDV). This is the parameter that we are trying to predict.

In [10]:
fig = px.histogram(housing_df, x="MEDV", marginal='box')
fig.show()

It appears that we have some outliners, but inside the IQR we have what seems to be *a normal distribution* with a **median value of \$21,200 dollars**. The presence of outliers show us that the median is a better central tendency measurement.

### Correlations

One of the most useful things to look at when analyzing data for a ML regression model is the pearson correlation value. Let's calculate it for our dataset.

In [None]:
corr_matrix = housing_df.corr()[['MEDV']]
corr_matrix.transpose()

In [None]:
plt.figure(figsize=(4, 4))
sns.heatmap(corr_matrix, cmap='Reds', annot=True);

We are only interesed in somewhat strong correlations.

In [None]:
corr_matrix.loc[(corr_matrix["MEDV"] >= 0.4) | (corr_matrix["MEDV"] <= -0.4)].sort_values(by="MEDV").T

We have 6 paramerters that have some correlation with MEDV. The strongest correlation is the [proportion of population that is lower status (LSTAT)](https://opendata.stackexchange.com/questions/15740/what-does-lower-status-mean-in-boston-house-prices-dataset) just followed by the average number of rooms (RM) and the pupil-teacher ratio by town (PTRATIO) respectively. The other parameters seems to be helpful, althogh their correlations are not as strong as the previosly mentioned.

In [None]:
sns.pairplot(housing_df, y_vars=["MEDV"], x_vars=housing_df.columns[:5]);

In [None]:
sns.pairplot(housing_df, y_vars=["MEDV"], x_vars=housing_df.columns[5:10]);

In [None]:
sns.pairplot(housing_df, y_vars=["MEDV"], x_vars=housing_df.columns[10:-1]);

We can visualize that the aforementioned correlations; we can see, that effectively, LSTAT, PTRATIO, INDUS, TAX, NOX, RM, and MEDV have the greatest correlations.

### LSTAT

In [None]:
fig = px.histogram(housing_df, x="LSTAT", marginal='box')
fig.show()

In [None]:
fig = px.scatter(housing_df,
                 x="LSTAT", 
                 y="MEDV", 
                 color="MEDV", 
                 opacity=0.7,
                 #trendline="ols",
                 #trendline_options=dict(log_x=True),
                 width=1200, 
                 height=600)
fig.show()

### RM

In [None]:
fig = px.histogram(housing_df, x="RM", marginal='box')
fig.show()

In [None]:
fig = px.scatter(housing_df,
                 x="RM", 
                 y="MEDV", 
                 color="MEDV", 
                 opacity=0.7,
                 #trendline="ols",
                 width=1200, 
                 height=600)
fig.show()

## Linear Regression

In [None]:
%pip install statsmodels --quiet

In [None]:
%pip install scikit-learn --quiet

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
independent_variables = ["LSTAT", "PTRATIO", "INDUS", "TAX", "NOX", "RM"]
inputs = housing_df[independent_variables]
targets = housing_df["MEDV"]
print('inputs.shape :', inputs.shape)
print('targes.shape :', targets.shape)

In [None]:
model.fit(inputs, targets)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
predictions = model.predict(inputs)

In [None]:
mean_squared_error(targets, predictions)

In [None]:
model.coef_

In [None]:
model.intercept_

### Model Improvement

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(housing_df[independent_variables])

In [None]:
scaler.mean_

In [None]:
scaler.var_

In [None]:
scaled_inputs = scaler.transform(housing_df[independent_variables])
scaled_inputs_df = pd.DataFrame(data=scaled_inputs, columns=independent_variables)
scaled_inputs_df.head()

In [None]:
scaled_inputs_df.describe()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
inputs_train, inputs_test, targets_train, targets_test = train_test_split(scaled_inputs_df, targets, test_size=0.3)

In [None]:
model_improved = LinearRegression()
model_improved.fit(inputs_train, targets_train)

In [None]:
predictions_improved = model.predict(inputs_test)

In [None]:
model_improved.coef_

In [None]:
model_improved.intercept_

In [None]:
mean_squared_error(targets_test, predictions_improved)

In [None]:
sns.kdeplot(scaled_inputs_df);

In [None]:
sns.kdeplot(housing_df[independent_variables]);

In [None]:
sns.kdeplot(housing_df[independent_variables], log_scale=True);