[Random Forests](https://www.kaggle.com/dansbecker/random-forests)

## Introduction

Un árbol profundo con muchas hojas se sobreajustará porque cada predicción proviene de datos históricos de solo las pocas casas en su hoja.

Pero un árbol poco profundo con pocas hojas tendrá un desempeño deficiente porque no logra capturar tantas distinciones en los datos brutos.

Incluso las técnicas de modelado más sofisticadas de la actualidad se enfrentan a esta tensión entre underfitting y overfitting.

Sin embargo, muchos modelos tienen ideas inteligentes que pueden conducir a un mejor rendimiento.

Veremos el bosque aleatorio como ejemplo.

El bosque aleatorio utiliza muchos árboles y hace una predicción promediando las predicciones de cada árbol componente.

Por lo general, tiene una precisión predictiva mucho mejor que un árbol de decisión único y funciona bien con los parámetros predeterminados.

Si continúa modelando, puede aprender más modelos con un rendimiento aún mejor, pero muchos de ellos son sensibles a la obtención de los parámetros correctos.

## Ejemplo

Ya has visto el código para cargar los datos varias veces.

Al final de la carga de datos, tenemos las siguientes variables:

* `train_X`  
* `val_X`  
* `train_y`  
* `val_y`

In [1]:
import pandas as pd
# La url de los datos es:
#  https://www.kaggle.com/dansbecker/melbourne-housing-snapshot    
# Load data
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data = pd.read_csv('melb_data.csv') 

## Data fields

## Campos de información

Aquí hay una versión breve de lo que encontrará en el archivo de descripción de datos.

* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.  
* MSSubClass: The building class  
* MSZoning: The general zoning classification  
* LotFrontage: Linear feet of street connected to property  
* LotArea: Lot size in square feet  
* Street: Type of road access  
* Alley: Type of alley access  
* LotShape: General shape of property  
* LandContour: Flatness of the property  
* Utilities: Type of utilities available  
* LotConfig: Lot configuration  
* LandSlope: Slope of property  
* Neighborhood: Physical locations within Ames city limits  
* Condition1: Proximity to main road or railroad  
* Condition2: Proximity to main road or railroad (if a second is present)  
* BldgType: Type of dwelling  
* HouseStyle: Style of dwelling  
* OverallQual: Overall material and finish quality  
* OverallCond: Overall condition rating  
* YearBuilt: Original construction date  
* YearRemodAdd: Remodel date  
* RoofStyle: Type of roof  
* RoofMatl: Roof material  
* Exterior1st: Exterior covering on house  
* Exterior2nd: Exterior covering on house (if more than one material)  
* MasVnrType: Masonry veneer type  
* MasVnrArea: Masonry veneer area in square feet  
* ExterQual: Exterior material quality  
* ExterCond: Present condition of the material on the exterior  
* Foundation: Type of foundation  
* BsmtQual: Height of the basement  
* BsmtCond: General condition of the basement  
* BsmtExposure: Walkout or garden level basement walls  
* BsmtFinType1: Quality of basement finished area  
* BsmtFinSF1: Type 1 finished square feet  
* BsmtFinType2: Quality of second finished area (if present)  
* BsmtFinSF2: Type 2 finished square feet  
* BsmtUnfSF: Unfinished square feet of basement area  
* TotalBsmtSF: Total square feet of basement area  
* Heating: Type of heating  
* HeatingQC: Heating quality and condition  
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level  
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

In [2]:
# Filter rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)

In [3]:
# Choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

dividir los datos en datos de entrenamiento y validación, tanto para las funciones como para el objetivo

La división se basa en un generador de números aleatorios.

Proporcionar un valor numérico al argumento `random_state` garantiza que obtengamos la misma división cada vez que
ejecute este script.

In [4]:
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

Construimos un modelo de bosque aleatorio de manera similar a cómo construimos un árbol de decisiones en scikit-learn, esta vez usando la clase `RandomForestRegressor` en lugar de` DecisionTreeRegressor`.

In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

191669.7536453626


## Conclusion

Es probable que haya espacio para una mejora adicional, pero esta es una gran mejora con respecto al error del mejor árbol de decisión de 250.000.

Hay parámetros que le permiten cambiar el rendimiento del bosque aleatorio tanto como cambiamos la profundidad máxima del árbol de decisión único.

Pero una de las mejores características de los modelos de Random Forest es que, por lo general, funcionan de manera razonable incluso sin este ajuste.

## Tu turno

Prueba [Usar un modelo de bosque aleatorio](https://www.kaggle.com/dansbecker/random-forests) tú mismo y comprueba cuánto mejora tu modelo.

## Ejercicio: Bosques aleatorios

In [6]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [7]:
# Path of the file to read
# iowa_file_path = '../input/home-data-for-ml-course/train.csv'
iowa_file_path = 'train.csv' 
home_data = pd.read_csv(iowa_file_path)

In [8]:
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
            'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

In [9]:
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [10]:
# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)

In [11]:
# Fit Model
iowa_model.fit(train_X, train_y)

DecisionTreeRegressor(random_state=1)

In [12]:
# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

Validation MAE when not specifying max_leaf_nodes: 29,653


In [13]:
# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))

Validation MAE for best value of max_leaf_nodes: 27,283


## Exercises

La ciencia de datos no siempre es tan fácil. 

Pero reemplazar el árbol de decisiones con un bosque aleatorio será una victoria fácil.

## Paso 1: Usa un bosque aleatorio

In [14]:
import pandas as pd 
from sklearn.metrics import mean_absolute_error 
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestRegressor

In [15]:
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state = 1)

In [16]:
# fit your model
home_data = pd.read_csv('train.csv')
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
            'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
y = home_data.SalePrice
train_X,test_X,train_y,test_y = train_test_split(X,y, random_state=1)
rf_model.fit(train_X,train_y) 

RandomForestRegressor(random_state=1)

In [17]:
# Calculate the mean absolute error of your Random Forest model on the validation data
rf_predic = rf_model.predict(test_X)
rf_val_mae = mean_absolute_error(rf_predic,test_y) 

print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))


Validation MAE for Random Forest Model: 21857.15912981083
