# Regression Case Study: House Price Prediction

In this case study, we develop basic regression models in Python to predict the price of a house from its various features, for King County in the state of Washington. To build the models, we use [Kaggle's King County house dataset](https://www.kaggle.com/harlfoxem/housesalesprediction) which contains house sales data for the county. First we import some standard Python libraries.

### Import standard Python libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (6, 4)

### Load dataset

We load the dataset from the file "kc_house_data.csv" into a pandas DataFrame object, and use its methods to obtain a summary of the dataset.

In [None]:
%matplotlib inline
import seaborn as sns; sns.set()

# Histogram for prices


## Univariabe Linear Regression

As a warmup, we train a univariate linear regression model that predicts the price of a house using only its living square footage (the 'sqft_living' feature). 

### Prepare Data

First we prepare the data by extracting the living square footage feature (the input) and the price (the output) of all examples in the dataset.

#### Create a pandas DataFrame consisting of the living square footage of all houses in the dataset

#### Extract the prices of all houses in the dataset into a pandas Series

#### Split the data set into a training set (80%) and a test set (20%)

The **training set** is the subset of data used to train the model. The **test set** is the subset of data used to test (i.e. evaluate) the trained model. After splitting, **X_train** and **y_train** respectively contain the input features and true output values of the data in the training set, called the *training examples*. *Both* **X_train** and **y_train** will be used for training the model. This is an example of **Supervised Learning**. Likewise, **X_test** and **y_test** respectively contain the input features and true output values of the data in the test set, called the *test examples*. After the model is trained, provided that it performs well on the training set, the model will be used to make predictions on **X_test**. The predictions will be compared against the true output values in **y_test**, and the result of comparison will be used as an estimate of the model's performance on data unseen.

In [None]:
from sklearn.model_selection import train_test_split



In [None]:
print('X_train.shape = ', X_train.shape)
print('y_train.shape = ', y_train.shape)
print('X_test.shape = ', X_test.shape)
print('y_test.shape = ', y_test.shape)

#### Plot price against living square footage for the training set

### Train the Model

Now that the data is prepared, we are ready to train the model.

#### Import the LinearRegression class from sklearn.linear_model

#### Create a LinearRegression model object

#### Fit the model to the training set

As noted above, both **X_train** which contains the input features and **y_train** which contains the price of every training example are used for training the model. This is an example of **Supervised Learning**.

#### Parameters of the Model

We can find the parameters of the model by its **coef_** and **intercept_** attributes. In a plane, a line described by an equation of the form

\begin{equation}
y = ax + b
\end{equation}

The coefficient (slope) $a$ is given by the **coef_** attribute, and the intercept $b$ by the **intercept_** attribute. Therefore, the univariate linear model predicts the price of a house as

\begin{equation}
\text{predicted price} \; = \; \text{coef} \: \times \: \text{sqft living} \: + \: \text{intercept}
\end{equation}

#### Make a prediction

Now that the model is trained, we can use it to make a prediction, as follows. In the example below, the input to the method **predict** is a DataFrame with a single column consisting of the square footage of three houses. The output is a one-dimensional NumPy array consisting of the prices of the three houses. Note that the input to the method **predict** is two-dimensional, even if there is only one feature (column).

Now that the model has been trained, let's see how well it works on the data used to train the model.

#### Use the trained model to predict the prices of houses in the training set

#### Plot the <font color='red'> predicted </font> house prices (the <font color='red'> red </font> line) and the <font color='blue'> true </font> house prices (the <font color='blue'> blue </font> dots) for the training set

In [None]:
plt.scatter(X_train['sqft_living'], y_train)

x_min = X_train['sqft_living'].min()
x_max = X_train['sqft_living'].max()
x_plot = np.linspace(x_min, x_max, 10000)
y_plot = lr.coef_ * x_plot + lr.intercept_
plt.plot(x_plot, y_plot, color='r')
plt.xlabel('Living Square Footage')
plt.ylabel('Price')

#### Compute the $R^2$-score of the model on the training set 

The $R^2$-score, also known as the coefficient of determination, is a measure of the performance of a regression model. Its best possible value is 1.

In [None]:
from sklearn.metrics import r2_score



Save the training data and label for later use

In [None]:
X_train_sqft = X_train.copy()
y_train_sqft = y_train.copy()

### Remark

The above plot and the low $R^2$-score indicate that the simple univariate linear regression model performs poorly on the training data set, *even when* the true price of every house in the training set is given for training. This is however expected because the model uses only the living square footage to predict the price of a house, while many other features affect the price. This phenomenon of an ML model fitting the data poorly, is known as **underfitting**. It implies that the current model is insufficient, and an improved model using more features is needed for better predictions.

## Multivariate Linear Regression

We now build a mutivariate linear regression model to predict the price of a house, using (almost) all the relevant features in the dataset.

### Prepare Data

First we prepare the data by extracting the features of a house relevant for predicting its price.

#### Create a pandas DataFrame consisting of the features relevant for predicting the house price.

#### Split the data set into a training set (80%) and a test set (20%)

In [None]:
print('X_train.shape = ', X_train.shape)
print('y_train.shape = ', y_train.shape)
print('X_test.shape = ', X_test.shape)
print('y_test.shape = ', y_test.shape)

### Train the Model

Now that the data is prepared, we are ready to train the model.

#### Create a LinearRegression model object

#### Fit the model to the training set

#### Parameters of the Model

We can find the parameters of the model by its **coef_** and **intercept_** attributes. For a multivariate linear regression model, **coef_** is a one-dimensional NumPy array with one coefficient for every input feature. If 
$$
\text{coef} \;=\; \left[ c_1, \ldots, c_n \right]
$$

and the input features are 
$$ \left[x_1, \ldots, x_n \right] $$ 

then the multivariate linear model predicts the price of a house as

$$
\text{predicted price} \; = \; \sum_{i=1}^{n} c_i x_i \: + \: \text{intercept} 
$$

#### Make a prediction

Now that the model is trained, we can use it to make a prediction. The input to the method **predict** is a pandas DataFrame or a two-dimensional NumPy array, where each row consists of the input features of a single house. The output is a one-dimensional NumPy array consisting of the prices of the houses. 

#### Use the trained model to predict the prices of houses in the training set

#### Compute the $R^2$-score of the model on the training set

#### Compare predicted prices and true prices

While it's hard to plot and visualize the result of a high dimensional model, we use the following DataFrame to compare the prices predicted by the model against the true prices.

### Remark

We see a considerable improvement in the training $R^2$-score, from about 0.5 for the previous univariate linear mode, to about 0.7 of the multivariate linear model we just built. However, the training $R^2$-score of 0.7 is still unsatisfactory, and the comparison above shows that for some houses, the predicted price is very far from the actual price. The result indicates that **underfitting** still occurs and calls for an even better model. The underfitting is likely due to the nonlinear nature of the data; in this case, no linear model would fit the data well.

## Random Forest

We now employ a more sophisticated regression model known as **Random Forest** to achieve significantly better house price predictions than the above multivariable linear model. I omit the details of Random Forest for now but will cover them later in the course.

### Import the RandomForestRegressor class from sklearn.ensemble 

### Using the single feature 'sqft_living' only

To visualize its effectiveness, we first train a Random Forest model using the 'sqft_living' feature only as we did in univariate linear regression, and compare its result with that of the univariate linear regression model.

#### Create a RandomForestRegressor model object with the default settings

#### Fit the model to the training set used for the univariate regression model

#### Use the trained model to predict the prices of houses in the training set

#### Compute the $R^2$-score of the model on the training set

#### Plot the <font color='red'> predicted </font> house prices (the <font color='red'> red </font> line) and the <font color='blue'> true </font> house prices (the <font color='blue'> blue </font> dots) for the training set

In [None]:
plt.scatter(X_train_sqft, y_train_sqft)

x_plot_dict = {'sqft_living': x_plot}
x_plot_df = pd.DataFrame(x_plot_dict)
y_plot = rf.predict(x_plot_df)
plt.plot(x_plot, y_plot, color='r')
plt.xlabel('Living Square Footage')
plt.ylabel('Price')

#### Remark

Despite the expected low $R^2$-score (which is about 0.62) of the Random Forest model trained with just the 'sqft_living' feature, the score is a lot higher than that of the corresponding univariate Linear Regression model (which is about 0.49). Moreover, the above plot shows that random forest fits the data better than linear regression; in particular, random forest seems to capture some nonlinear patterns of the dataset.

### Using all relevant features

We now train a Random Forest model using all the data as we did in multivariabe linear regression.

#### Create a RandomForestRegressor model object with the default settings

#### Fit the model to the training set

Notice that Random Forest regression takes a lot longer to train than linear regression.

#### Use the trained model to predict the prices of houses in the training set

#### Compute the $R^2$-score of the model on the training set

Wow! The training $R^2$-score of the Random Forest regressor is above **0.98**!

#### Compare predicted prices and true prices in the training set

#### Compute and sort (in descending order) the relevance of the features

In [None]:
feature_importance = rf.feature_importances_

sorted_idx = np.argsort(feature_importance)[::-1]
print("Feature Importance:\n")
for i in range(len(X.columns)):
    idx = sorted_idx[i]
    print(f'{X.columns[idx]:20} {feature_importance[idx]:.3f}')

### Test the model

Now that we have trained a model which performs well on the training set, it's time to evaluate its performance on the test set. Notice that model has never seen the true prices of the houses in the test set. Therefore, the performance of the model on the test set is an estimate of how the model generalizes to unseen data.

#### Use the trained model to predict the prices of the houses in the test set

#### Compute the $R^2$-score of the model on the test set

#### Compare predicted prices and true prices in the test set

### Remark

The Random Forest regressor achieves a good $R^2$-score, which is above 0.85, on the test set. In the meantime, we notice a gap between the training score (above 0.98) and the test score (above 0.85). In general, an ML model is expected to perform worse on the test set than on the training set because the model was given the true output values of the training examples when it was trained; in other words, the model was fit to the input features and true output values of the training examples; on the other hand, the model never saw the true output values of the test examples. The goal is therefore to close the gap between the training performance and test performance, provided that the model performs well on the training set. A noticeable gap between the training performance and test performance is a sign of **overfitting**, an important topic we will address soon in the course.

## Model Persistence

After training a model that has good performance in both training and testing, it is desirable to save the model so that you can reuse it without retraining. Suppose that you were satisfied with the performance of the Random Forest Regressor. We use here an efficient package named joblib for saving the model to the disk and loading the model from the disk.

### Save the model to the disk

### Load the model from the disk