<h1> This Notebook takes all steps from loading data & packages to creating a submission .csv file </h1>
<p>The goal here is to show a basic example script that runs reasonably fast. Therefore, certain columns are eliminated and the RandomForest is fitted with a relatively low number of estimators </p>
<ol>
<li> Packages are imported</li>
<li> Configuration is set </li>
<li> Data is loaded </li>
<li> Data is prepared for model  </li>
<li> RandomForestRegressor is trained </li>
<li> Results are evaluated </li>
<li> Prediction submission file is made </li>
</ol>

<h2> 1 | Packages </h2>

In [None]:
import pandas as pd
import numpy as np
import os
import re
import random
import datetime
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

<h2> 2 | Configuration </h2>

In [None]:
random.seed(3)  #To make the randomization reproducible
pd.options.mode.chained_assignment = None  #To turn off specific warnings

<h2> 3 | Load Data </h2>


In [None]:
train = pd.read_csv(r"../input/train_2016_v2.csv")   #The parcelid's with their outcomes
props = pd.read_csv(r"../input/properties_2016.csv")  #The properties dataset
samp = pd.read_csv(r"../input/sample_submission.csv")  #The parcelid's for the testset

<h2> 4 | Prepare Data </h2>

In [None]:
props = props.select_dtypes(exclude=[object])  #For this example, we take only numerical data, since strings require more processing
props.fillna(-1,inplace=True)  #Fill missing data so we can run the model
train = train.loc[:,['parcelid','logerror']].merge(props,how='left',left_on='parcelid',right_on='parcelid')
train_x = train.drop(['parcelid','logerror'],axis=1,inplace=False)
train_y = train['logerror']

test = samp.loc[:,['ParcelId']].merge(props,how='left',left_on='ParcelId',right_on='parcelid')
test_x = test.drop(['ParcelId','parcelid'],axis=1,inplace=False)

<h2> 5 | Fit RandomForestRegressor </h2>

In [None]:
parameters = {'n_estimators':[5,10,15],'n_jobs':[-1],'oob_score':[False]}  # this can be extended
model = RandomForestRegressor()
grid = GridSearchCV(model,param_grid=parameters,scoring='neg_mean_absolute_error',cv=3)  
grid.fit(train_x,train_y)

<h2> 6 | Evaluate </h2>
<p> We can see the test scores in the crossvalidation table. Also, we see the 20 most important features in a column chart. </p>

In [None]:
cv_results = pd.DataFrame(grid.cv_results_)
print(cv_results[["param_n_estimators","mean_test_score","std_test_score"]])

feat_imps = grid.best_estimator_.feature_importances_
fi = pd.DataFrame.from_dict({'feat':train_x.columns,'imp':feat_imps})
fi.set_index('feat',inplace=True,drop=True)
fi = fi.sort_values('imp',ascending=False)
fi.head(20).plot.bar()

<h2> 7 | Predict and Make Submission File </h2>
<p> The submission file is prepared with a datetime stamp so that it is not overwrited by consecutive submission files and it remains clear when the submission file was generated. For now, we assume the result is the same fot each month. </p>

In [None]:
test_y = grid.predict(test_x)
test_y = pd.DataFrame(test_y)
test_y[1] = test_y[0]
test_y[2] = test_y[0]
test_y[3] = test_y[0]
test_y[4] = test_y[0]
test_y[5] = test_y[0]  #For simplicity make identical predictions for all months
test_y.columns = ["201610","201611","201612","201710","201711","201712"]
submission = test_y.copy()
submission["parcelid"] = samp["ParcelId"].copy()
cols = ["parcelid","201610","201611","201612","201710","201711","201712"]
submission = submission[cols]
filename = "Prediction_" + str(submission.columns[0]) + re.sub("[^0-9]", "",str(datetime.datetime.now())) + '.csv'
print(filename)
submission.to_csv(filename,index=False)
