## In this kernel we use a random forest to predict house prices.  

There are 6 parts to this kernel:

1. Import the libraries and data 
1. Prepare the data 
1. Train a decision tree
1. Train the random forest 
1. Understand the forest 
1. Create predictions

## 1a. Import the libraries we are going to use
Here we need two full libraries:
**numpy** (linear algebra and mathematics) and **pandas** (data manipulation and i/o)

We also need some bits from **sklearn** - in particular the RandomForestRegressor and the preprocessing unit.

It is good practice to only import the bits you need from sklearn as it is quite a big library.

In [None]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor # import the random forest model
from sklearn import  preprocessing # used for label encoding and imputing NaNs

import datetime as dt # we will need this to convert the date to a number of days since some point

from sklearn.tree import export_graphviz
# import pydotplus
import six

import matplotlib.pyplot as plt
%matplotlib inline

## 1b. Next we import the data

In [None]:
df_train = pd.read_csv('../input/train.csv', parse_dates=['timestamp'])
df_test = pd.read_csv('../input/test.csv', parse_dates=['timestamp'])
df_macro = pd.read_csv('../input/macro.csv', parse_dates=['timestamp'])

df_train.head()

##  2. Data preparation

 - Create a vector containing the id's for our predictions
 - Create a vector of the target variables in the training set
 - Create joint train and test set to make data wrangling quicker and consistent on train and test
 - Removing the id (could it be a useful source of leakage?)
 - Convert the date into a number (of days since some point)
 - Deal with categorical variables
 - Deal with missing values

In [None]:
# Create a vector containing the id's for our predictions
id_test = df_test.id

#Create a vector of the target variables in the training set
# Transform target variable so that loss function is correct (ie we use RMSE on transormed to get RMLSE)
# ylog1p_train will be log(1+y), as suggested by https://github.com/dmlc/xgboost/issues/446#issuecomment-135555130
ylog1p_train = np.log1p(df_train['price_doc'].values)
df_train = df_train.drop(["price_doc"], axis=1)

# Create joint train and test set to make data wrangling quicker and consistent on train and test
df_train["trainOrTest"] = "train"
df_test["trainOrTest"] = "test"
df_all = pd.concat([df_train, df_test])

# Removing the id (could it be a useful source of leakage?)
df_all = df_all.drop("id", axis=1)

In [None]:
# Convert the date into a number (of days since some point)
fromDate = min(df_all['timestamp'])
df_all['timedelta'] = (df_all['timestamp'] - fromDate).dt.days.astype(int)
print(df_all[['timestamp', 'timedelta']].head())
df_all.drop('timestamp', axis = 1, inplace = True)

### Encoding categorical features
We will take a naive approach and assign a numeric value to each categorical feature in our training and test sets. 
Sklearn's preprocessing unit has a tool called LabelEncoder() which can do just that for us. 

In [None]:
for c in df_all.columns:
    if df_all[c].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(df_all[c].values)) 
        df_all[c] = lbl.transform(list(df_all[c].values))

### Addressing problems with NaN in the data

As we saw from our EDA there were quite a lot of NaN in the data. Our model won't know what to do with these so we need to replace them with something sensible.

There are quite a few options we can use - the mean, median, most_frequent, or a numeric value like 0. Playing with these will give different results, for now I have it set to use the mean.

 This uses the mean of the column in which the missing value is located. 

In [None]:
# Create a list of columns that have missing values and an index (True / False)
df_missing = df_all.isnull().sum(axis = 0).reset_index()
df_missing.columns = ['column_name', 'missing_count']
idx_ = df_missing['missing_count'] > 0
df_missing = df_missing.ix[idx_]
cols_missing = df_missing.column_name.values
idx_cols_missing = df_all.columns.isin(cols_missing)

In [None]:
# Instantiate an imputer
imputer = preprocessing.Imputer(missing_values='NaN', strategy = 'most_frequent', axis = 0)

# Fit the imputer using all of our data (but not any dates)
imputer.fit(df_all.ix[:, idx_cols_missing])

# Apply the imputer
df_all.ix[:, idx_cols_missing] = imputer.transform(df_all.ix[:, idx_cols_missing])

In [None]:
# See the results - note how all missing are replaced with the mode
df_all.head()

In [None]:
# Prepare separate train and test datasets
idx_train = df_all['trainOrTest'] == 1
idx_test = df_all['trainOrTest'] == 0

x_train = df_all[idx_train]
x_test = df_all[idx_test]

## The three step process below is common across many sklearn models

**Step 1:** we create an object which is the type of model we want to fit (we have called this object "Model"). In this case we are dealing with a regression problem and want to fit a Random Forest model so we choose RandomForestRegressor.  "Model" is an instance of a RandomForestRegressor.

**Step 2:** We train the model. We do this with our x and y training data. Remember that the y_train set is just the prediction we would like to make - in this instance the price price_doc. The x_train data is the information we are going to use to make that prediction. 

**Step 3:** Once we have fit the model we can then use it to make a prediction. We do this by called Model.Predict. We are looking to predict the house prices for our test data so we pass the test-data to the predict method and assign it to y_predict. This will contain our predicted set of house prices. 

### 3. Let's practise on a simple decision tree

In [None]:
# Step 1: Instantiate a decision tree regressor
# Choose a depth for the tree - something 3, 4 or 5 - not too large
Model = DecisionTreeRegressor(max_depth = 3)

In [None]:
# Step 2: Train the tree
# The .fit method takes two main arguments, the features (in our case x_train) and 
# the target variable (in our case ylog1p_train)
# Fill them in below and submit the code to train the tree
Model.fit(X = x_train, y = ylog1p_train)

In [None]:
# Step 3: Make predictions 
# The predict method takes one main argument - the examples for which
# we want to predict the target variable.  Here we will use the training data 
# itself i.e. x_train.  Fill this in below
ylog_pred = Model.predict(X = x_train)

In [None]:
# Check the training error

# Is the training error a reasonable estiamte of how this tree will perform on unseen data?
np.sqrt(np.mean((ylog_pred - ylog1p_train)**2))

In [None]:
# !!DO NOT RUN!!
# The code below will not work in this kernel because pydotplus is not available
# Now plot the tree 
dotfile = six.StringIO()

export_graphviz(Model, 
                out_file = dotfile, 
                max_depth = 3,
                feature_names = x_train.columns,
                filled = True,
                rounded = True)

pydotplus.graph_from_dot_data(dotfile.getvalue()).write_png('DTR.png')

### 4. Now a random forest

The parameter labelled n_estimators below, indicates the number of trees we would like in our forest.  The first time you run this kernel, we suggest you use something small  - between 10 and 50 just to check that the run time is not too slow.  If the run time is reasonable you can increase it (to 100 or more - but not now!!) in order to get better performance.

In [None]:
# Step 1: Instantiate a random forest regressor
Model = RandomForestRegressor(n_estimators = 30, 
                              random_state = 2017, 
                              oob_score = True, 
                              max_features = 20,
                              min_samples_leaf = 8)

In [None]:
# Step 2: Train the forest
# Again fill in X and y below with x_train and ylog1p_train
Model.fit(X = x_train, y = ylog1p_train)

In [None]:
# Step 3: Make predictions 
# Create predictions for the examples in x_train
ylog_pred = Model.predict(X = x_train)

### Check the performance of the random forest

In [None]:
# Check the training error
np.sqrt(np.mean((ylog_pred - ylog1p_train)**2)) # about 0.37 (if you use 100 trees)

The training error looks pretty good.  But it is only over the training data.  We don't know how well this forest does on data it has not seen.  Well actually we do.  Since the data to grow each tree is a bootstrap with replacement, only about 2/3 of the data is used each time.  We can use the "OOB" data to estimate performance on unseen data

In [None]:

np.sqrt(np.mean((Model.oob_prediction_ - ylog1p_train)**2)) # 0.47 slightly better than a simple tree.

## 5a. What do these trees look like?

In [None]:
# !!DO NOT RUN!!
# The code below will not work in this kernel because pydotplus is not available
for idx in range(3):
    dotfile = six.StringIO()
    
    export_graphviz(Model.estimators_[idx], 
                    out_file = dotfile, 
                    max_depth = 2,
                    feature_names = x_train.columns)
    
    pydotplus.graph_from_dot_data(dotfile.getvalue()).write_png('dtree'+ str(idx) + '.png')

## 5b. Actual versus expected

In [None]:
fig, ax = plt.subplots()
plt.scatter(Model.oob_prediction_, ylog1p_train)
x = np.linspace(*ax.get_xlim())
ax.plot(x, x, color = 'black')
plt.show()

## 5c. Variable importance

In [None]:
# Create a dataframe of the variable importances
df_ = pd.DataFrame(df_all.columns, columns = ['feature'])
df_['fscore'] = Model.feature_importances_[:, ]

In [None]:
# Plot the relative importance of the top 10 features
df_['fscore'] = df_['fscore'] / df_['fscore'].max()
df_.sort_values('fscore', ascending = False, inplace = True)
df_ = df_[0:10]
df_.sort_values('fscore', ascending = True, inplace = True)
df_.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('Random forest feature importance', fontsize = 24)
plt.xlabel('')
plt.ylabel('')
plt.xticks([], [])
plt.yticks(fontsize=20)
plt.show()
#plt.gcf().savefig('feature_importance_xgb.png')

## 6. Create the predictions

In [None]:
# Create the predictions

ylog_pred = Model.predict(x_test)
y_pred = np.exp(ylog_pred) - 1

### Output the data to CSV for submission
Finally we take the id_test vector we created earlier and combine it with our y_predictions to create our CSV for output. 

We are utilising the very useful panda's data frame to do this and it's associated method "to_csv" can write our file out.

In [None]:
output = pd.DataFrame({'id': id_test, 'price_doc': y_pred})

output.to_csv('RandomForest_2.csv', index=False)