## Random Forests

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# File path:
melbourne_file_path = 'C:/Users/Willians/Desktop/Python/ML/Kaggle_ML_Data-Exercises/Melbourne_housing_FULL.csv'

# Load and read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 

# Remove NA's
melbourne_data = melbourne_data.dropna(axis=0)

# Assign X (features -> columns) and y (Dependent Variable)
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

# Split data into training and validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))



186838.26687668767


## Conclusion

There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

You'll soon learn the XGBoost model, which provides better performance when tuned well with the right parameters (but which requires some skill to get the right model parameters).

## IOWA DATA

In [12]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Upload the data:
iowa_file_path = 'C:/Users/Willians/Desktop/Python/ML/Kaggle_ML_Data-Exercises/iowa_houseprices.csv'
home_data = pd.read_csv(iowa_file_path)

# Assing the target objects (X, y) for regression
y = home_data.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into trainning and validation data sets:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify the model:
iowa_model = RandomForestRegressor(random_state=1)

# Fit the model:
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
rf_iowa_model = RandomForestRegressor(max_leaf_nodes=100, random_state=1)
rf_iowa_model.fit(train_X, train_y)
rf_val_predictions = rf_iowa_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(rf_val_mae))

Validation MAE when not specifying max_leaf_nodes: 22,762
Validation MAE for best value of max_leaf_nodes: 22,838




Link to the course result:
https://www.kaggle.com/ianbernardino/exercise-machine-learning-competitions/notebook?scriptVersionId=18171762