In [4]:
import pandas as pd

melbourne_data = pd.read_csv('../data/melb_data.csv')
melbourne_data = melbourne_data.dropna(axis=0)

y = melbourne_data.Price
features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
            'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[features]

from sklearn.model_selection import train_test_split

train_X, valid_X, train_y, valid_y = train_test_split(X, y, random_state=0)

## Introduction

Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data. <br>
<b>So, find balance is hard.</b> But, many models have clever ideas that can lead to better predictions. Here we'll look at the `random forest` as an example.

`How works random tree and why it's more accuracy?` The random forest uses many trees, making prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
predictions = forest_model.predict(valid_X)
print(mean_absolute_error(valid_y, predictions))

191669.7536453626


There more space for improvements. But this is already big improvement over the best decision tree error of 250,000. There are parameters which allows to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.