# Modelos de Regressão: Random Forest Regression

### Importando libs e funções:

Importando libs

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

Importando funções

In [0]:
# Função de escalonamento
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

### Etapa de exploração e tratamento dos **dados**

Importando o dataset do nosso estudo. Esta é uma tarefa dificil de regressão, em que o objetivo é prever a área queimada de incêndios florestais em Portugal usando dados meteorológicos.
Fonte: [UCL](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/intelligentagents/aprendizagem-supervisionada/master/data/forestfires.csv')

Explorando o dataset:

In [4]:
# Exporando o dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
X        517 non-null int64
Y        517 non-null int64
month    517 non-null object
day      517 non-null object
FFMC     517 non-null float64
DMC      517 non-null float64
DC       517 non-null float64
ISI      517 non-null float64
temp     517 non-null float64
RH       517 non-null int64
wind     517 non-null float64
rain     517 non-null float64
area     517 non-null float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB


In [5]:
# Visualizando o sumário das colunas numéricas do dataset
df.describe()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0
mean,4.669246,4.299807,90.644681,110.87234,547.940039,9.021663,18.889168,44.288201,4.017602,0.021663,12.847292
std,2.313778,1.2299,5.520111,64.046482,248.066192,4.559477,5.806625,16.317469,1.791653,0.295959,63.655818
min,1.0,2.0,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,0.0
25%,3.0,4.0,90.2,68.6,437.7,6.5,15.5,33.0,2.7,0.0,0.0
50%,4.0,4.0,91.6,108.3,664.2,8.4,19.3,42.0,4.0,0.0,0.52
75%,7.0,5.0,92.9,142.4,713.9,10.8,22.8,53.0,4.9,0.0,6.57
max,9.0,9.0,96.2,291.3,860.6,56.1,33.3,100.0,9.4,6.4,1090.84


Visualizando o dataset

In [6]:
df.head(10)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0
5,8,6,aug,sun,92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,0.0
6,8,6,aug,mon,92.3,88.9,495.6,8.5,24.1,27,3.1,0.0,0.0
7,8,6,aug,mon,91.5,145.4,608.2,10.7,8.0,86,2.2,0.0,0.0
8,8,6,sep,tue,91.0,129.5,692.6,7.0,13.1,63,5.4,0.0,0.0
9,7,5,sep,sat,92.5,88.0,698.6,7.1,22.8,40,4.0,0.0,0.0


Algumas colunas possuem dados categóricos. Portanto, iremos codificar os valores das variáveis categóricas com valores númericos.

In [0]:
le = LabelEncoder() 
df['month'] =  le.fit_transform(df['month'])
df['day'] =  le.fit_transform(df['day'])

Visualizando novamente os dados após as transformações: 

In [9]:
df.head(10)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,7,0,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,10,5,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,10,2,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,7,0,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,7,3,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0
5,8,6,1,3,92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,0.0
6,8,6,1,1,92.3,88.9,495.6,8.5,24.1,27,3.1,0.0,0.0
7,8,6,1,1,91.5,145.4,608.2,10.7,8.0,86,2.2,0.0,0.0
8,8,6,11,5,91.0,129.5,692.6,7.0,13.1,63,5.4,0.0,0.0
9,7,5,11,2,92.5,88.0,698.6,7.1,22.8,40,4.0,0.0,0.0


Definindo as variáveis dependentes/independentes:

In [0]:
X = df.iloc[:,:12].values
y = df.iloc[:,12].values

Criando os subconjuntos de treinamento e testes:

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Normalizando as features :

In [0]:
X_train = feature_scaling(X_train)
X_test = feature_scaling(X_test)

### Etapa de Treinamento e Validação do Modelo

Importando e treinando o modelo de Regressao com o Conjunto de Treinamento:

In [0]:
regressor =  RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

Prevendo os resultados com o conjunto de testes

In [0]:
y_pred = regressor.predict(X_test)

y_pred

array([2.61700000e+01, 1.18600000e+00, 7.66800000e+00, 0.00000000e+00,
       3.67800000e+00, 1.30360000e+01, 8.28900000e+00, 9.86000000e+00,
       2.04370000e+01, 5.27300000e+00, 5.68000000e+00, 1.20939333e+01,
       5.07500000e+00, 6.65400000e+00, 5.02900000e+00, 6.44300000e+00,
       2.74000000e+00, 2.21000000e+00, 1.12680000e+01, 1.12070000e+01,
       7.88000000e+00, 3.30670000e+01, 3.00006667e+01, 1.64950000e+01,
       2.99980000e+01, 3.38100000e+01, 5.71700000e+00, 1.34340000e+01,
       7.82200000e+00, 6.85000000e-01, 4.51500000e+00, 1.29410000e+01,
       4.44100000e+00, 1.65140000e+01, 2.37028000e+02, 1.93300000e+00,
       9.19800000e+00, 4.23700000e+00, 3.30600000e+01, 8.50390000e+01,
       1.20540000e+01, 8.82500000e+00, 6.22300000e+00, 1.04900000e+00,
       7.50900000e+00, 1.50510000e+01, 3.48626000e+02, 2.13000000e-01,
       1.05200000e+00, 1.68016667e+01, 2.88160000e+01, 7.97600000e+00,
       3.97680000e+01, 6.56000000e+00, 8.48700000e+00, 2.39500000e+00,
      

Avaliando o modelo com a métrica r²:

In [0]:
regressor.score(X_test, y_test)

-0.13485407683254502

Avaliando o modelo com a métrica rmse

In [0]:
mean_squared_error(y_test, y_pred)

13717.25835636175