In [None]:
%reset


# House Prices: Advanced Regression Techniques (Kaggle Competition)

## Framing the Problem

**Objective:** To accurately predict the final price of homes in Ames, Iowa.

**Use of Solution**: The solution may be used to predict home prices in Ames, but it is mostly for me to use to experiment, learn and improve my ML skills.

**Current Solutions**: Experts in real estate give their best estimate of house prices given their knowledge of markets.

**Type of Problem**: This is a regression problem that should be solved with offline supervised learning in a model-based approach.

**Performance Measure**: As determined by the Kaggle competition, the performance of the model will be measured by the Root Mean Squared Logarithmic Error, however while working on the model I will evaluate it with a range of metrics, including the Mean Absolute Error to guage how well the model is performing in real terms of house prices.

**Minimum Performance for Business Objective Success**: I learned something.

## Get the Data

### Download it

In [None]:
from zipfile import ZipFile

# Having some trouble with Kaggle API at the moment, but in future try to download data programmatically if possible

ZIP_PATH = "data/house-prices-advanced-regression-techniques.zip"

with ZipFile(ZIP_PATH, 'r') as zip:
    zip.extractall('data')

### Read it in

In [None]:
import pandas as pd

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

## Exploratory Data Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
eda_data = train.copy()
eda_data.shape

In [None]:
eda_data.info(verbose=True)

In [None]:
pd.set_option('display.max_columns', None)
eda_data.describe()

In [None]:
eda_data.head(10)

In [None]:
eda_data_num = eda_data.select_dtypes(include=['int64', 'float64'])
eda_data_cat = eda_data.select_dtypes(include=['object'])

In [None]:
eda_data_num.hist(bins=50, figsize=(20,60), layout=(10,4))
plt.show()

### Notes on Histograms of Numerical Data

* Obviously the `2ndFlrSF` variable has many 0 values because most homes don't have a 2nd floor, I'm not sure how to deal with this. Should I create a categorical value for it as well? Will look into this. The same goes for garages, many don't have any garage at all. How should this be dealt with?
* `BsmtFinSF1`, `BsmtFinSF2`, `BsmtUnfSF` can probably be ignored in favour of `TotalBsmtSF`
* There's 4 different types of porch under the fields `OpenPorchSF`, `EnclosedPorch`, `3SsnPorch` and `ScreenPorch` which all give the square footage of the respective porch. First I'll see if any of these porch types have a noticeable impact on the sale price, but I will most likely just make them categorical variables and then have one numerical variable for porch square footage for porches of all kinds.
* `PoolArea` doesn't seem to have been recorded consistently, using `value_counts()` shows only a few were recorded. A categorical for whether or not there is a pool should be sufficient.
* All other fields at this point seem unimportant or should be fine to include.


### The Target Variable

In [None]:
eda_data_num.hist(column='SalePrice', bins=100, figsize=(16,6))

In [None]:
num_corr_matrix = eda_data_num.corr()
num_corr_matrix['SalePrice'].sort_values(ascending=False)

In [None]:
fig, ax = plt.subplots(figsize=(8,4))
sns.boxplot(x=eda_data_num['OverallQual'], y=eda_data_num['SalePrice']).set_title("Overall Quality vs SalePrice")

I'm assuming that this an arbitrary rating given by a Surveyor. While it is meant to be a rating of "the overall material and finish of the house" alone, I wouldn't be surprised if there is some inherent bias in the rating given by the surveyor that follows their general valuation of the home, otherwise there could be a correlation between the quality of the house and the other factors that determine the value of a home, as houses in expensive locations are probably more likely to have a more expensive, high quality finish. Either way it predicts SalePrice very accurately so, whatever the reason for it, we'll keep it. 

In [None]:
fig, ax = plt.subplots(figsize=(8,4))
sns.regplot(x=eda_data_num['GrLivArea'], y=eda_data_num['SalePrice'], scatter_kws={'alpha':0.2}).set_title("Ground Living Area vs SalePrice")

As expected, the living area in square feet of the home is well correlated with its value.

In [None]:
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x=eda_data_num['LotArea'], y=eda_data_num['SalePrice'], scatter_kws={'alpha':0.2}).set_title("Sale Price as a result of Lot Size (all data)")

In [None]:
import numpy as np
def is_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

In [None]:
no_la_outliers = eda_data_num[~is_outlier(eda_data_num['LotArea'])]

fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x=no_la_outliers['LotArea'], y=no_la_outliers['SalePrice'], scatter_kws={'alpha':0.2}).set_title("Sale Price as a result of Lot Size (no LotArea outliers)")

I find this strange, I would have expected LotArea to have a much stronger positive correlation with SalePrice. The `GrLivingArea` is very highly correlated though, as expected.

In [None]:
for fld in eda_data_cat.columns:
    print("\n" + fld)
    fig, ax = plt.subplots(figsize=(8,4))
    plt.bar(eda_data_cat[fld].dropna().unique(), eda_data_cat[fld].value_counts(dropna=True))
    plt.title(fld)
    plt.show()
    print("{} missing values".format(eda_data_cat[fld].isna().sum()))  

### Notes on Bar Charts of Categorical Data
* The vast majority of values in `MSZoning` are some kind of residential, the rest are commercial. It may be worth investigating if whether a property is zoned commercially vs residentially has an impact and just re-categorizing them as such.
* There are three different levels of irregularity for `LotShape`? Just make this a OneHot categorical variable, Irregular.


In [None]:
for fld in eda_data_cat.columns:
    print("\n" + fld)
    median_increasing = eda_data.groupby(by=[fld])['SalePrice'].median().sort_values(ascending=True).index

    fig, ax = plt.subplots(figsize=(8,4))
    sns.boxplot(x=eda_data_cat[fld], y=eda_data_num['SalePrice'], order=median_increasing).set_title("{} vs SalePrice".format(fld))
    plt.show()