# Random Forests Regression

* It is based on the evaluation of the predictions produced by more than one decision tree.



* Observations for trees are selected using bootstrap random sample selection method. Independent variables are selected by the random subspace method.



* At each node of the decision tree, the best branching variable is chosen from among independent variables previously randomly selected.It is not chosen from all independent variables.


* 2/3 of the data set is used in tree generation. The remaining data is used for performance evaluation of trees and determination of variable importance.



* For the final prediction model, when requesting the estimation values from the trees, the trees are weighted by considering the previously calculated error rates of each tree

## 1-)Data Preprocessing

In [1]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split

In [2]:
hit = pd.read_csv("Hitters.csv")
df = hit.copy()
df = df.dropna()
dms = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df["Salary"]
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
X = pd.concat([X_, dms[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

## 2-) Model

In [3]:
from sklearn.ensemble import RandomForestRegressor

In [4]:
rf_model = RandomForestRegressor(random_state = 42)

In [5]:
rf_model.fit(X_train, y_train)

RandomForestRegressor(random_state=42)

## 3-) Prediction

In [9]:
from sklearn.metrics import mean_squared_error

In [10]:
y_pred=rf_model.predict(X_test)

In [11]:
test_error_before=np.sqrt(mean_squared_error(y_test, y_pred))
test_error_before #test error before  model tuning

345.00286717448006

## 4-) Model Tuning

* In this section, we will try to determine the optimum **max_depth, max_features,n_estimators**  with the GridSearchCV method.


* GridSearchCV: Grid Search Cross Validation Methode



* Then , we will create the most optimum model by using optimum **max_depth, max_features,n_estimators** .





* **max_depth, max_features,n_estimators** are the hyperparameters that we will determine according to ourselves and we want it to be the most optimum.



* But instead of relying on our own feeling and sense in order to find the  optimum value of these hyperparameters   , we will find the optimum value of these hyperparameters   by using the gridsearch method.


* **max_features** ==>> Maximum number of independent variables to be used


* **n_estimators** ==>> number of trees to be used

In [12]:
from sklearn.model_selection import GridSearchCV

In [13]:
rf_params = {'max_depth': list(range(1,10)),
            'max_features': [3,5,10,15],
            'n_estimators' : [100, 200, 500, 1000, 2000]}


In [15]:
rf_model = RandomForestRegressor(random_state = 42)

In [16]:
rf_cv_model = GridSearchCV(rf_model, 
                           rf_params, 
                           cv = 10, 
                            n_jobs = -1)

* For the most optimum model, multiple parameters are crossed with each other to find the best parameters.



* This process takes a long time. The **n_jobs=-1** variable can be used to reduce this time. This variables allows the processor to run at full performance.

In [17]:
rf_cv_model.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'max_features': [3, 5, 10, 15],
                         'n_estimators': [100, 200, 500, 1000, 2000]})

In [18]:
rf_cv_model.best_params_

{'max_depth': 8, 'max_features': 3, 'n_estimators': 100}

### 4.1)Tuned Model

In [22]:
rf_tuned = RandomForestRegressor(max_depth  = 8, 
                                 max_features = 3, 
                                 n_estimators =100)

In [23]:
rf_tuned.fit(X_train, y_train)

RandomForestRegressor(max_depth=8, max_features=3)

In [24]:
y_pred1 = rf_tuned.predict(X_test)

In [25]:
test_error_after=np.sqrt(mean_squared_error(y_test, y_pred1))
test_error_after # test error after model tuning

337.9853190716199