# kaggle - Learn: Intro to Machine Learning
- https://www.kaggle.com/learn/intro-to-machine-learning
## 6. Random Forests
- Using a more sophisticated machine learning algorithm.
(Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting)

### Random Forests Model 
- uses many trees, and it makes a prediction by averaging the predictions of each component tree.
-  It generally has much better predictive accuracy than a single decision tree and it works well with default parameters
> If you keep modeling, you can learn more models with even better performance, *but many of those are sensitive to getting the right parameters*.

### Example
At the begin of the analysis we load de dataset an at the end we have:
- train_X (training features), val_X (validation_features)
- train_y (training target), val_y (validation target - also known as y_true as opposite to y_pred)
    - You use train_X and train_y (the training data) to .fit the model.
    - You use val_X to .predict unknown target.
    - You use val_y to calculate MAE (val_y, val_pred) or (y_true, y_pred)
Let's make the load:

In [35]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('min_melb_data.csv')
df.dropna(inplace=True)

y = df.Price
X = df[['Rooms', 'Bathroom', 'Landsize', 'BuildingArea','YearBuilt', 'Lattitude', 'Longtitude']]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

# display(train_X)
# print(train_y)
# display(val_X)
# print(val_y)


Now... we build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.

In [36]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)

val_pred = forest_model.predict(val_X)
mae = mean_absolute_error(val_y, val_pred)

print(f'MAE: {mae:,.2f}')

MAE: 214,229.71


### More
- There is likely room for further improvement, but this is a big improvement over the best decision tree error we saw previously.
- There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

## Exercise: Random Forests
### 0 - import libraries + make load and prepare train and val features an targets.

In [37]:
## All commented cause they are previously imported
# import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.emsemble import RandomForestRegressor
# from sklearn.metrics import mean_absolute_error

home_data = pd.read_csv('train.csv')

y = home_data['SalePrice']      # using direct location instead of dot-notation loc.
X = home_data[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

### 1.- make random_forest model, calculate forest_mae

In [38]:
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
forest_y = forest_model.predict(val_X)
forest_mae = mean_absolute_error(val_y, forest_y)
print(f'MAE using RandomForest model: {forest_mae:,.2f}')

MAE using RandomForest model: 23,009.21



### 2.- make decision_tree model (def params), calculate dtree_mae

In [39]:
from sklearn.tree import DecisionTreeRegressor

dtree_model = DecisionTreeRegressor(random_state=1)
dtree_model.fit(train_X, train_y)
dtree_y = dtree_model.predict(val_X)
dtree_mae = mean_absolute_error(val_y, dtree_y)
print(f'MAE using DecisionTree model: {dtree_mae:,.2f}')

MAE using DecisionTree model: 32,966.45


### 3.- make decision_tree model (max_leaf_nodes param for best case), calculate dbtree_mae

In [42]:
# function to get mae with max_leaf_nodes value as a parameter:
def get_mae (t_X, v_X, t_y, v_y, mlnodes):
    model = DecisionTreeRegressor(max_leaf_nodes=mlnodes, random_state=0)
    model.fit(t_X, t_y)
    y_pred = model.predict(v_X)
    mae = mean_absolute_error(v_y, y_pred)
    return(mae)

resdic = dict()        # results dictionary {mlnodes1: mae1, mlnodes2: mae2, ..., mlnodesN: maeN}

for mlnds in range(2, 400):
    mae = get_mae(train_X, val_X, train_y, val_y, mlnds)
    resdic[mlnds] = mae

dbtree_mae = min(resdic.values())
for it in resdic.keys():
    if resdic[it] == dbtree_mae:
        mlnodes_dbtree = it
        break

print(f'min MAE: {dbtree_mae:,.2f}  <-- max_leaf_nodes = {mlnodes_dbtree}')


min MAE: 27,203.78  <-- max_leaf_nodes = 82


In [41]:
print('Model Validation (via MAE) COMPARATIVE')
print('--------------------------------------')
print(f'RandomForestRegressor __default: {forest_mae:,.2f}')
print(f'DecisionTreeRegressor __default: {dtree_mae:,.2f}')
print(f'DecisionTreeRegressor __best: {dbtree_mae:,.2f}  <-- max_leaf_nodes = {mlnodes_dbtree}')

Model Validation (via MAE) COMPARATIVE
--------------------------------------
RandomForestRegressor __default: 23,009.21
DecisionTreeRegressor __default: 32,966.45
DecisionTreeRegressor __best: 27,203.78  <-- max_leaf_nodes = 82
