Imports

In [1]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline

Import the Iowa data and define the target (y) vs. rest (X) of the data which could be used for building the model to predict y.

In [2]:
main_file_path = './data/' # this is the path to the Iowa data that you will use
data = pd.read_csv(main_file_path + 'train.csv') # load the training dataset
data.dropna(axis=0, subset=['SalePrice'], inplace=True) # if a row is missing the sale price, drop it

# Read the test data
test = pd.read_csv(main_file_path + 'test.csv') # read in the Kaggle competition evaluation data

y = data.SalePrice # define the target
X = data.drop(['SalePrice'], axis=1) # get rid of the target from list of features

Create one hot encoded categorical variables

In [3]:
one_hot_encoded_training = pd.get_dummies(X) # perform one hot encoding on training data

one_hot_encoded_test = pd.get_dummies(test) # also perform it on final evaluation data

final_train, final_test = one_hot_encoded_training.align(one_hot_encoded_test, join='inner', axis=1) # ensure that the columns are synced between the two datasets


Split the data into training and test data.

In [4]:
train_X, val_X, train_y, val_y = train_test_split(final_train, y, random_state = 0)

Create an Imputer -> XGBRegressor pipeline. The values were determined by some manual tuning. In the future I would hope to use a more rigorous approach as I did before with the RandomForest in order to tune the hyperparameters.

The pipeline is generated, trained and then used to predict on the test set.

In [19]:
pipeline = make_pipeline(Imputer(), XGBRegressor(n_estimators=1000, learning_rate=0.08, early_stopping_rounds=8, eval_set=[(val_X, val_y)], verbose=False))
pipeline.fit(train_X, train_y)

y_preds = pipeline.predict(val_X)

output_score = mean_absolute_error(val_y, y_preds)
print(output_score)

15577.478788527398


Submission generation code

In [20]:
# Use the model to make predictions
predicted_prices = pipeline.predict(final_test)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

[123090.33  170380.52  190820.97  ... 145931.64  108412.266 226818.34 ]
