# House prices

## Todo

* Finish feature importance graph
    * Convert to DMatrix to maintain feature names :/
    * Remove one-hot encoded features if they add too much noise?
* Remove any features with which less than 50% (?) of rows have values for
* Feature scaling
* Try`seaborn` graphs
* Better analysis methods
* (?) Use built-in pipeline CV rather than splitting manually
    * Hope this cuts down code complexity (slightly)
    * What graphs does this allow me to make?
* Merging train & validation data for final model seems to make the final model worse, not better! Why?
* Backup this kernel on Github
    * Use the [Kaggle API](https://github.com/Kaggle/kaggle-api) tool to edit locally & run against remote images
    * Script to pull all my kernels and back them up? Want to save the notebooks and the datasets ideally (maybe not the data if it's huge)

In [None]:
# Imports

import warnings

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from xgboost.sklearn import XGBRegressor

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(style="white", font_scale=0.9)

In [None]:
# Load data

random_state=3
initial_data = pd.read_csv("../input/train.csv")
test_data = pd.read_csv("../input/test.csv")

# Align (unencoded) categoricals 
category_names = ['MSSubClass', 'MSZoning', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
       'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
for category_name in category_names:
    all_data = pd.concat([initial_data[category_name], test_data[category_name]])
    categories = pd.Series(all_data, dtype="category")
    initial_data[category_name] = initial_data[category_name].astype('category', categories).values
    test_data[category_name] = test_data[category_name].astype('category', categories).values

# One hot encode categorical columns
# Need to join data so all categories are encoded the same way in both datasets
initial_data_length = len(initial_data.index)
data = pd.concat([initial_data, test_data], sort=True)
data = pd.get_dummies(data)
initial_data = data[:initial_data_length]
test_data = data[initial_data_length:]

# Set up features
static_column_names_to_drop = ['Id', 'SalePrice']
encoded_categorical_column_names_to_drop = [x for x in data.columns.values if x.startswith(("SaleType", "SaleCondition"))]
predictor_names = data.columns.drop(static_column_names_to_drop + encoded_categorical_column_names_to_drop).values
X_initial = initial_data[predictor_names]
X_test = test_data[predictor_names]

# Set up target
target_name = "SalePrice"
y_initial = np.log(initial_data[target_name])

# Split datasets
X_train, X_validate, y_train, y_validate = train_test_split(X_initial, y_initial, random_state=random_state)

In [None]:
# Examine data/features

print("Features")
print(predictor_names)
print("\n")

print(X_train.head())
print(X_train.describe())
print("\n")

print(y_train.head())
print(y_train.describe())
print("\n")

In [None]:
# Build pipeline components

imputer = SimpleImputer()
createEstimator = (lambda params = {}: XGBRegressor(random_state=random_state, **params))

In [None]:
# # Find optimal parameters (only needs running once for each model change)

# fixedParams = {
#     "n_estimators": 550,
#     "max_depth": 3,
# #     "early_stopping_rounds": 1,
# #     "learning_rate": 0.1,
#     "eval_set": [(X_validate, y_validate)],
# }

# base_pipeline = Pipeline([("imputer", imputer), ("estimator", createEstimator(fixedParams))])

# param_grid = {
# #     "estimator__n_estimators": [300, 500, 550, 600, 800],
#     "estimator__early_stopping_rounds": [1, 2, 3],
#     "estimator__learning_rate": [0.08, 0.1, 0.12],
# #     "estimator__max_depth": [1, 3, 5, 10],
# }
# grid = GridSearchCV(base_pipeline, param_grid, n_jobs=-1, cv=5)
# grid.fit(X_train, y_train)

# pipeline = grid.best_estimator_

# print(pipeline.get_params().keys())
# print(grid.best_score_)
# print(grid.best_params_)

In [None]:
# Quick model training (only run once optimal parameters have been found)

params = {
    "n_estimators": 550,
    "max_depth": 3,
    "early_stopping_rounds": 1,
    "learning_rate": 0.1,
}
pipeline = Pipeline([("imputer", imputer), ("estimator", createEstimator(params))])
pipeline.fit(X_train, y_train)

In [None]:
# Analyse

predictions = pipeline.predict(X_validate)

readable_predictions = np.exp(predictions)
readable_y_validate = np.exp(y_validate)

from sklearn.metrics import mean_squared_log_error
rmsle = np.sqrt(mean_squared_log_error(readable_y_validate, readable_predictions))
print("Root Mean Squared Log Error")
print(rmsle)
print("")

print("First 5")
print([float(x) for x in readable_predictions[:5]])
print([float(x) for x in readable_y_validate[:5]])
print("")

print("Last 5")
print([float(x) for x in readable_predictions[-5:]])
print([float(x) for x in readable_y_validate[-5:]])
print("")

In [None]:
# Graph: Actual vs predicted prices
warnings.filterwarnings("ignore")

prices_joint_grid = sns.jointplot(x=readable_y_validate, y=readable_predictions, kind="reg", height=20);
prices_joint_grid.set_axis_labels('Actual prices ($)', 'Predicted prices ($)')

warnings.resetwarnings()

In [None]:
# Graph: Feature importances
from xgboost import plot_importance

# Column names are lost in the pipeline somewhere, so re-add them to the plot
# Assumption here is that column order is preserved even when the names are lost :fingers_crossed:
column_names = X_train.columns.values
label_map = {"f{0}".format(index): name for index, name in enumerate(column_names)}
fscore = pipeline.named_steps['estimator'].get_booster().get_fscore()
named_values = {label_map[k]: v for k, v in fscore.items()}

importance_fig, importance_ax = plt.subplots()
plot_importance(named_values, ax=importance_ax, color='red')
importance_fig.set_size_inches((20, 50))

Merging train & validation data for final model seems to make the final model worse, not better! Why?
```python
X = pd.concat([X_train, X_validate])
y = pd.concat([y_train, y_validate])
pipeline.fit(X, y)
```

In [None]:
# Submit
test_predictions = pipeline.predict(X_test)
submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': np.exp(test_predictions)})
submission.to_csv('submission.csv', index=False)

## Notes

* Don't need to manually split CV data with `StratifiedKFold` or similar, it's done automatically by `GridSearchCV` (defaults to 3-fold).

## References

* [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
* [Mean squared log error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html)