# Predicting Housing Prices

## Getting the data:

In [134]:
import pandas as pd
housing = pd.read_csv("C:/Users/Nicoo/Documents/GitHub/Kaggle-Notebooks/data/housing_train.csv", index_col='Id')

Remove rows with missing target  and separate target from features:

In [135]:
housing = housing.dropna(axis=0, subset=['SalePrice'])
y = housing.SalePrice
housing = housing.drop(['SalePrice'], axis=1)

**Categorical Features:**

Select all columns with non numerical datatype as categorical columns:

In [136]:
categorical = [column for column in housing.columns if housing[column].dtype == "object"]
len(categorical)

43

A good way to handle categorical data is One-Hot-Encoding, which creates a new feature for all extra categories.<br>Let's see how many new features we will create with OHE:

In [137]:
sum = 0
for column in categorical:
    sum += housing[column].nunique()
sum

252

There's a total of 252 categories in the categorical columns, meaning we'll create 252 - 43 = **209** new columns.

**Numerical Features:**

Selecting numerical columns as well:

In [138]:
numerical = [column for column in housing.columns if housing[column].dtype in ['int64', 'float64']]

**All columns:**

In [139]:
selected = categorical + numerical

## Data Preprocessing:

An easy way to combine many preprocessing steps is to use a Pipeline. the steps in this Pipelin will first be to impute some of the missing data and then one-hot-encode the categorical features.<br>
Numerical features will be imputed with the Median Values, while categorical features will be imputed with the most frequent values.

**Setting up Pipeline with One Hot Encoder:**

In [140]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

numerical_transformer = SimpleImputer(strategy='median')
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])

preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical),('cat', categorical_transformer, categorical)])

**Train-Test-Split:**

In [141]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(housing, y, test_size=0.27, random_state=42)
X_train = X_train[selected]
X_test = X_test[selected]

## Setting up Full Pipeline with XGBoost

In [142]:
#Setting up model with XGBRegressor
from xgboost import XGBRegressor
xgb = XGBRegressor(random_state = 42, n_estimators = 1500, learning_rate=0.03)

In [143]:
#Combining preprocessing and model to final Pipeline
full_pipeline_test = Pipeline(steps=[('preprocessor', preprocessor), ('model', xgb)])
full_pipeline_final = Pipeline(steps=[('preprocessor', preprocessor), ('model', xgb)])

## Testing Pipeline:

In [144]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
full_pipeline_test.fit(X_train, y_train)

y_pred = full_pipeline_test.predict(X_test)

print("MAE: " + str(mean_absolute_error(y_test, y_pred)) + "\n" + "MSE: " + str(mean_squared_error(y_test, y_pred)) + "\n" + "R2: " + str(r2_score(y_test, y_pred)))

MAE: 16001.102966772152
MSE: 734671820.4923061
R2: 0.890619015530036


Good R2 value.

## Final Predictions:

In [145]:
X_pred = pd.read_csv("C:/Users/Nicoo/Documents/GitHub/Kaggle-Notebooks/data/housing_test.csv", index_col='Id')
X_pred = X_pred[selected]
full_pipeline_final.fit(housing[selected], y)
test_predictions = full_pipeline_final.predict(X_pred)

In [146]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_pred.index,
                       'SalePrice': test_predictions})
output.to_csv('simple_pipeline.csv', index=False)
output

Unnamed: 0,Id,SalePrice
0,1461,124658.000000
1,1462,160156.140625
2,1463,184922.687500
3,1464,192026.468750
4,1465,187671.937500
...,...,...
1454,2915,82611.632812
1455,2916,83011.257812
1456,2917,163022.562500
1457,2918,112538.343750
