---

<center> <h1> <span style='color:#292D78'> CREWES Data Science Training </span> </h1> </center>

<center> <h2> <span style='color:#DF7F00'> Lab 4: Boston Housing Price Prediction </span> </h2> </center>

---

In this [Jupyter Notebook](https://jupyter.org/install) we will predict the **Boston housing price** using regression models.

# Boston Housing Price

### Context

Investments in real state is reasonable stable, and predicting housing prices based on the house's feature is of extremely importance to know in which region to invest expecting a higher profit. In this project, you will identify the qualities in a house from Boston and predict its retail price.

### Objective

Analyze the Boston housing dataset, identify which house qualities impact more in the price of the house, and train and test a regression model to predict the price of houses.

### Content

You are provided with house prices and features in Boston evaluated from 2014 to 2015.

> File: BostonHousing.csv

* **id**: identification of the house
* **date**: date of the house evaluation
* **price**: price of the house in US$
* **bedrooms**: number of bedrooms
* **bathrooms**: number of bathrooms
* **sqft_living**: square footage of the living area
* **sqft_lot**: square footage of the lot
* **floors**: number of floors
* **waterfront**: if it contains a waterfront
* **view**: if it has a good view
* **condition**: house's condition
* **grade**: house's score
* **sqft_above**: square footage of the upper level (if any)
* **sqft_basement**: square footage of the basement (if any)
* ***yr_built**: year built
* **yr_renovated**: year renovated (if any)
* **zipcode**: zipcode
* **lat**: latitude
* **long**: longitude
* **sqft_living15**: average square footage of the living area of the 15 nearest houses
* **sqft_lot15**: average square footage of the lot of the 15 nearest houses

Loading packages:

In [None]:
# Core
import numpy as np
import pandas as pd

# Supressing scientific notation in Pandas
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# print plots
%matplotlib inline 

# Machine Learning models and tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR, LinearSVR
from sklearn.neural_network import MLPRegressor

# Metrics
from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score

# To supress warnings and deprecated messages
import warnings

warnings.filterwarnings("ignore")

Reading Boston Hoursing file:

In [None]:
data = pd.read_csv("BostonHousing.csv")
print(data.shape)
data

The data contains $21613$ rows and $21$ columns.

Checking data types:

In [None]:
data.info()

There are no missing values in the data. Except for `date`, that is of type *object*, all other columns are numeric.

* `waterfront` sets if the house contains a water exit or not, though a boolean scenario
* `view`, `condition`, and `grade` should be categorical

In [None]:
print(data.waterfront.unique())
print(data.view.unique())
print(data.condition.unique())
print(data.grade.unique())

Checking for duplicates:

In [None]:
data.duplicated().sum()

There are not duplicated rows.

Checking again for missing:

In [None]:
data.isna().sum()

There are no missing values.

Let's drop the columns `id`, `date`, `zipcode` as we won't be using them:

In [None]:
data.drop(columns = ["id", "date", "zipcode"], inplace = True)
data

### Converting Columns Types

* `waterfront` > boolean
* `view` > category
* `condition` > category
* `grade` > category

In [None]:
data["waterfront"] = data["waterfront"].astype("bool")

cat = ["view", "condition", "grade"]

for i in cat:
    data[i] = data[i].astype("category")

data.info()

Statistical description of the data:

In [None]:
data.describe().T

* *price* has a minimum of $75k and a maximum of $7.7M, and it may have a right skewed distribution
* Up to 50% of the houses have at least 3 bedrooms, and there is a maximum na 33 bedrooms
* At least 50% of the houses have 2.25 or more bathrooms
* Most of the houses have equal or more than 1910 sqft of living area
* Mean of lot square footage is larger than the 75% quantile, suggesting a strongly right skewed distribution
* The oldest house in the data was built in 1900 while the newest one was built in 2015
* There are a large number of houses that were never renovated

In [None]:
data.describe(exclude = np.number).T

* Most of the houses have no water front, no good view, are of average condition, and received an average grade.

# EDA

## Univariate Analysis

### Numeric Features

Let's check the distribution of the numeric columns.

In [None]:
def hist_box(data, feature, figsize=(12, 7)):

    # Subplot canvas
    fig, (ax_box, ax_hist) = plt.subplots(nrows = 2, sharex = True, gridspec_kw = {"height_ratios": (0.25, 0.75)}, figsize = figsize)

    # Boxplot on top
    sns.boxplot(data = data, x = feature, ax = ax_box, showmeans = True, color = "pink")  # boxplot will be created and a star will indicate the mean value of the column
    
    # Histogram on bottom
    sns.histplot(data = data, x = feature, ax = ax_hist) # histogram will be created and

    # Add mean and median to histogram
    ax_hist.axvline(data[feature].mean(), color = "green") # mean
    ax_hist.axvline(data[feature].median(), color = "orange") # median

    # Title
    fig.suptitle("Distribution of " + feature, fontsize=16)

In [None]:
# Get numerical columns:
cols_num = list(data.select_dtypes(include = ["int64", "float64"]))

print(cols_num)

Now let's plot all the numerical columns using a for loop:

In [None]:
for i in cols_num:
    hist_box(data, i)

* `price` is right skewed, with outliers of prices over $1M..
* Most of the distributions are right skewed, but `sqft_lot` and `sqft_lot15` are strongly right skewed, and we will apply the log transformation
* `sqft_basement` and `yr_renovated` contain a large number of zeros.

### Categorical Features

In [None]:
cols_cat = ["waterfront", "view", "condition", "grade"]

for i in cols_cat:
    sns.countplot(data = data, x = i);
    plt.title(i)
    plt.show()

* Most of the houses don't have water front or a good view
* Only a few houses are in low condition
* 7 and 8 are the most common grades.

## Bivariate Analysis

### Numerical vs Price

In [None]:
for i in cols_num:
    plt.figure(figsize = (15, 7))
    sns.scatterplot(data = data, x = i, y = "price")
    plt.title(i)
    plt.show()

* `price` seems to have a posite correlation with almost all the numerical variables, except for `floors`, `sqft_lot`, and `sqft_lot15`.
* Location (`lat`, `long`) also have an influency in the price of the houses.
* `sqft_lot` and `sqft_lot15` seem to have a non-linear relationship with price. A linear model will hardly capture their correlation.

### Categorical vs Price

In [None]:
for i in cat:
    plt.figure(figsize = (15, 7))
    sns.boxplot(data = data, x = "price", y = i, showfliers = False)
    plt.title(i)
    plt.show()

* Better the view, higher the price of the house
* When the houses are in better condition, they tend to be more expensive, however it seems not true for conditions 3 and 4
* Higher the grade of the house, higher the price is

### Log transformation

We will do the log transformation on the `sqft_lot` and `sqfr_lot15` columns

In [None]:
log_cols = ["sqft_lot", "sqft_lot15"]

for i in log_cols:
    data[i + "_log"] = np.log(data[i] + 1)
    plt.figure(figsize = (15, 7))
    sns.scatterplot(data = data, x = i + "_log", y = "price")
    plt.title(i + "_log")
    plt.show()

The log of `sqft_lot` and `sqft_lot15` now seems to have a positive correlation with the price of the house.

## Multivariate

### Correlation Plot

In [None]:
plt.figure(figsize = (15,10))
sns.heatmap(data = data.corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f", cmap = "Spectral");

In [None]:
plt.figure(figsize = (15,10))
sns.heatmap(data = data.corr(method = "spearman"), vmin = -1, vmax = 1, annot = True, fmt = ".2f", cmap = "Spectral");

* `price` has a relative high positive correlation with the size of the house, number of bedrooms and bathrooms, and number of floors.
* The square footage features have a high correlation with each other.

Let's drop the `sqft_lot` and `sqft_lot15` columns:

In [None]:
data.drop(columns = ["sqft_lot", "sqft_lot15"], inplace = True)

### Pair Plot

In [None]:
sns.pairplot(data = data.drop(columns = ["waterfront"]))

Similar interpretation as the correlation plot.

## Data Transformation

### One Hot Encoding

Let's do the one hof encoding of the categorical features

In [None]:
data = pd.get_dummies(data, drop_first = True)
data

# Regression

Meta-data:

In [None]:
features = list(data.drop("price", axis = 1))
target = "price"
print(features)

## Splitting the data into train and test

In [None]:
temp, test = train_test_split(data, test_size = 0.2, random_state = 10)
train, val = train_test_split(temp, test_size = 0.25, random_state = 10)

print(train.shape[0], val.shape[0], test.shape[0], data.shape[0])

### Z-Transform

For most of the regression models, a kind of "distance" between features (from predicted values) and the target is calculated. Features commonly have different units, and numbers from one columns may be much larger than another, and those large columns will have a greater impact in the calculated distance (which may bias the model).

To avoid this, we can "normalize" the columns or try some kind of standardization, like the **z-transform**.

The **z-transform** standardize each columns by subtracting each element $x$ by the average of the columns and then divide by the standard deviation of the column:

$$z = \frac{x - u}{s}$$

where $u$ is the mean and $s$ is the standard deviation of the column.

In [None]:
# Reseting indexes
train.reset_index(inplace = True)
test.reset_index(inplace = True)
val.reset_index(inplace = True)

# Saving the prices of the houses
temp1 = train["price"]
temp2 = test["price"]
temp3 = val["price"]

# Creating the scaler operator
scaler = StandardScaler()

# Training and transforming the scaler for the training data
train = scaler.fit_transform(train.drop("price", axis = 1))

# Transforming test
test = scaler.transform(test.drop("price", axis = 1))
val = scaler.transform(val.drop("price", axis = 1))

# Recovering the columns names
train = pd.DataFrame(train, columns = scaler.feature_names_in_)
test = pd.DataFrame(test, columns = scaler.feature_names_in_)
val = pd.DataFrame(val, columns = scaler.feature_names_in_)

# Price back to dataframes
train["price"] = temp1
test["price"] = temp2
val["price"] = temp3

In [None]:
train.head()

In [None]:
val.head()

In [None]:
test.head()

## Linear Regression

In [None]:
# Creating the model
model_lr = LinearRegression()

# Training the model on the train data
model_lr.fit(train[features], train[target])

Let's check how the model behaves on the train data:

In [None]:
# Calculating predictions
pred = model_lr.predict(train[features])

# Metrics
print("R2 Score:", r2_score(train[target], pred))
print("MSE:", mean_squared_error(train[target], pred))
print("Expl. Var.:", explained_variance_score(train[target], pred))

### Metrics DataFrame

In [None]:
def regression_metrics(model, data, features = features, target = target):

    # Computing prediction
    pred = model.predict(data[features])

    # Computing metrics
    r2 = r2_score(data[target], pred)
    mse = mean_squared_error(data[target], pred)
    evs = explained_variance_score(data[target], pred)

    # DataFrame
    df = pd.DataFrame([r2, mse, evs], index = ["R2 Score", "MSE", "Exp. Var."], columns = ["Values"])
    
    return df
    

In [None]:
print("Train Data")
regression_metrics(model_lr, train)

In [None]:
print("Validation Data")
regression_metrics(model_lr, val)

In [None]:
print("Test Data")
regression_metrics(model_lr, test)

What happened to the test predictions?

In [None]:
tmp = pd.DataFrame({"True": test[target].values, "Predictions": model_lr.predict(test[features])})
tmp

In [None]:
tmp.describe().T

In [None]:
tmp[tmp.Predictions > 10000000]

Apparently, the model predicted one price very wrong.

In [None]:
test.iloc[1:4,:]

The data looks good for this point...

Apparently, there is some non-linear relationship not properly captured by the model for this house. Let's drop it to check the metrics.

In [None]:
print("Test Data")
regression_metrics(model_lr, test.drop(2))

Without this specific house, the metrics are good.

## Support Vector Regressor

First, let's try the linear SVR.

In [None]:
model_linearsvr = LinearSVR(loss = 'squared_epsilon_insensitive', random_state = 10, dual = False)

model_linearsvr.fit(train[features], train[target])

Checking the metrics:

In [None]:
print("Train Data")
print(regression_metrics(model_linearsvr, train))
print(20 * "-")

print("Valilation Data")
print(regression_metrics(model_linearsvr, val))
print(20 * "-")

print("Test Data")
print(regression_metrics(model_linearsvr, test))
print(20 * "-")

Similar results as the linear regression, and without the strange prediction in the test data.

## Neural Networks

Now let's try a non-linear model to predict the price of the houses:

In [None]:
model_nn = MLPRegressor(
    hidden_layer_sizes = (20, 20),
    activation = "relu",
    solver = "lbfgs",
    max_iter = 500,
    early_stopping = True,
    n_iter_no_change = 10,
    random_state = 10
).fit(train[features], train[target])

print("Train Data")
print(regression_metrics(model_nn, train))
print(20 * "-")

print("Valilation Data")
print(regression_metrics(model_nn, val))
print(20 * "-")

print("Test Data")
print(regression_metrics(model_nn, test))
print(20 * "-")

The neural networks model has a better performance than the linear regression and linear SVR.

### Plot True vs Predictions

In [None]:
plt.figure(figsize = (8,7))
plt.axline((0,0), (1000000, 1000000), color = "lightgray")
sns.scatterplot(model_lr.predict(val[features]), val[target])
plt.title("Linear Regression")

In [None]:
plt.figure(figsize = (8,7))
plt.axline((0,0), (1000000, 1000000), color = "lightgray")
sns.scatterplot(model_linearsvr.predict(val[features]), val[target])
plt.title("Linear SVR")

In [None]:
plt.figure(figsize = (8,7))
plt.axline((0,0), (1000000, 1000000), color = "lightgray")
sns.scatterplot(model_nn.predict(val[features]), val[target])
plt.title("Neural Networks")

# Summary

## EDA

* We removed some columns not used for the modeling.
* Size of the house and number of bedroom and bathroom showed to be great differentiators in the price of the houses.
* Location has influence in the price of the house.
* The categorical features (waterfront, view, condition, and grade) also showed high influence in the price of the house.
* sqft_lot and sqft_lot15 were transformed with a log transform, showing a better linear correlation with the price.
* We applied the one-hot encoding on the categorical features.
* The z-transformed was used to bring all the features with different units to the same scale

## Regression

* Linear regression worked fine, with good metrics, but with 1 strange prediction.
* The linearSVR had similar metrics as the linear regression, and showed to be more stable.
* The neural networks was the one with the best performance, showing that the housing price is not a strict linear problem.