# Home Price Prediction

This notebook goes over the creation of a simple model training to prediction pipeline. There isn't much in the ways of EDA or feature engineering, and there are many details about the data set that will slip by (for instance treating `MoSold` as a numerical variable rather than a cateogircal). There's is also no hyperparameter tuning or any sort of model tuning used in this notebook, and the model evaluation is limited to just basic metrics.

### Steps
1. Transform the data
2. Handle missing values
3. Train model
4. Make predictions

In [None]:
import os
import sklearn
import scipy

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
from sklearn.impute import KNNImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import OneHotEncoder

# Transform the Data

First we need to read in the data to transform it to be model ready. This involves converting categorical variables such as `Utilities` and `LotShape` into one hot encoded vectors that can be read by a model. First things first we'll read in the data set from our local directory.

In [None]:
dat = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")

In [None]:
dat = dat.drop('Id', axis=1)

In [None]:
dat.head()

With the data set loaded we need to identify the categorical data from the numeric data, as the categorical data will need to be transformed in some manner before prediction (It's good practice to transform the numeric data too, either to standardize it or for identifying outliers).

In [None]:
# Identify numerical data type columns
num_cols = list(dat.columns[dat.dtypes == np.int64])
num_cols += list(dat.columns[dat.dtypes == np.float64])

In [None]:
# Identify categorical data type columns
cat_cols = list(dat.columns[dat.dtypes == object])

# Handle Missing Values
With our categorical and numeric data seperated, we need to address the issue of missing values within the data set. We'll start by looking at what variables have missing data and how much.

In [None]:
# Quantify amount of missing data
null_values = (np.sum(dat.isna()))[np.sum(dat.isna()) > 0]
prop = (np.sum(dat.isna()) / dat.shape[0])[np.sum(dat.isna()) > 0] * 100
missing_data = pd.concat([null_values, prop], axis=1, keys=['Total', 'Percentage'])
missing_data.sort_values(by='Total', ascending=False)

Variables such as `PoolQC` or `MiscFeature` are more missing values than they are observed values for whatever reason. We can impute missing data using a variety of techniques, in this case we'll drop variables with a large amount of missing data as they don't provide much information.

In [None]:
drops = ['Alley', 'PoolQC', 'Fence', 'FireplaceQu', 'MiscFeature', 'SalePrice']

In [None]:
cat_cols = list(set(cat_cols) - set(drops))
num_cols = list(set(num_cols) - set(drops))

In [None]:
cat_categories = []
for c in cat_cols:
    cat_categories.append(list(set(dat[c])))

After dropping variables that were missing a substantial amount of data we can go on to encode the categorical variables via one hot encoding.

In [None]:
enc = OneHotEncoder(categories=cat_categories, handle_unknown='ignore')
enc.fit(dat[cat_cols])
enc.transform(dat[cat_cols]).toarray()

With the data encoded and the encoder fit to the training data it's time to impute the missing values of the data set. We'll use KNN imputation to fill in any missing values in the data set.

In [None]:
dat_imp = pd.concat([pd.DataFrame(enc.transform(dat[cat_cols]).toarray()), dat[num_cols]], axis=1)

In [None]:
imputer = KNNImputer(n_neighbors=5)
x_train = imputer.fit_transform(dat_imp)
y_train = dat.SalePrice

# Train the Model

There is usually a bit more to training the model than just passing a training data set and letting it build, however the default model works alright in this case so we'll train a gradient boosted regressor model on our newly transformed and imputed data set.

In [None]:
model = GradientBoostingRegressor()
model.fit(x_train, y_train)

In [None]:
plt.scatter(y_train, model.predict(x_train))
plt.xlabel('Observed Prices')
plt.ylabel('Predicted Prices')
plt.title('Observed vs Predicted Home Sale Prices')
plt.show()

In [None]:
model.score(x_train, y_train)

Overall the model doesn't appear to have done too poorly in it's task of predicting home prices, though these metrics are based on the training set so they should be good. With the model trained and making predictions let's move onto making predictions on the test data.

# Making Predictions

The test data set needs to be prepared just as the training data set was, and the data needs to be in the exact same format as the training data was in order to properly operate. We'll use the encoder as well as the imputer trained above in order to format the test data.

In [None]:
test_dat = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

In [None]:
enc.transform(test_dat[cat_cols]).toarray()
test_imp = pd.concat([pd.DataFrame(enc.transform(test_dat[cat_cols]).toarray()), test_dat[num_cols]], axis=1)
x_test = imputer.transform(test_imp)

With the test data formatted we can now generate the output for Kaggle submissions, to do this we'll use pandas to write out a dataframe with the `Id` and corresponding `SalePrice` predictions.

In [None]:
output = pd.DataFrame({'Id': test_dat.Id, 'SalePrice': model.predict(x_test)})

In [None]:
output.head()

In [None]:
#output.to_csv('my_submission.csv', index=False)