# Random Forest Classifier
This notebook reviews the basic concepts you need to know in order to
sucessfully create a random forest model. We will use the
```RandomForestClassifier``` from **sklearn**.

We start by importing some functions

In [1]:
#Import the train_test_split function
from sklearn.ensemble import RandomForestClassifier

#Import the function that creates the data we will split
from mlb_misc_functions import create_clf_table_1

#Numpy
import numpy as np

## Data prep

Now, we will create our modeling table (```model_data```). This is a dummy/fake table that resembles the format of real data. The table has 7 columns. 1 column is a row unique identifier, 5 columns are features, and the last column is the target. In this case the target takes the values 0 and 1.

In [2]:
#Call the funciton that creates the table. 10,000 rows
model_data = create_clf_table_1(10000)
model_data.head()

Unnamed: 0,row_id,feature_1,feature_2,feature_3,feature_4,feature_5,target
0,0,0,-8,6,5,-5,0
1,1,3,3,-4,6,-5,1
2,2,-4,0,-8,0,8,0
3,3,-1,-7,-3,-10,9,0
4,4,-2,-5,6,3,0,1


Now, we will split the table into training and testing set. For more info about how to split your data into training and testing click [here](https://github.com/sebaszb/DataScienceSimple/blob/master/machine_learning_basics/training_testing_split.ipynb).

In [3]:
#Import the train_test_split function
from sklearn.model_selection import train_test_split

#split the data into training and testing
model_train, model_test = train_test_split(model_data, test_size=0.20, random_state=11)

Now we need to prepare the data for the **Randon Forest Classifier**. This is done in the following block of code.

In [4]:
#Lists with the name of the features (columns), and the name of the target
features = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5"]
target = ["target"]

#Create arrays witht he "x values" (features) and the "y values" (targets) for the train data
train_x = model_train[features].values
train_y = np.ravel(model_train[target].values)

#Create arrays witht he "x values" (features) and the "y values" (targets) for the test data
test_x = model_test[features].values
test_y = np.ravel(model_test[target].values)

Before we continue, let’s see what happened in the block of code above. The first two lines just create two lists (```features``` and ```target```) containing the column names for the features and the target.

After that, four more variables are created: ```train_x```, ```test_x```, ```train_y```, and ```test_y```. ```train_x``` and ```test_x``` are an array of arrays, where each array contains the values of each one of the features, for the train and test data respectively. On the other hand, ```train_y``` and ```test_y``` are just simple arrays containing the targets (0s or 1s) for the train and test data, respectively.

## Classifier training

Now we will create our classifier and training it on the training data (```train_x``` and ```train_y```), se below:

In [108]:
#Import the random forest classifier
from sklearn.ensemble import RandomForestClassifier

#create classifier and train it
rf_clf = RandomForestClassifier(n_estimators=10, max_depth=5)
rf_clf.fit(train_x, train_y)

It is important to notice that we are initializing our random forest classifier with ```n_estimators=10``` and  ```max_depth=5```. ** n_estimators** defines to the number of trees in the forest (in the classifier), while ** max_depth** defines the maximum depth of the tree. As you change their values, the performance of your classifier will change as well.

There are several other hyperparameter that can be specified, and they can be tuned in order to obtain an optimal performance. We will talk about how to tune them in a different notebook. For now, you can read about all the other hyperparameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

## Predictions and Probabilities

Once you have your Random Forest classifier trained, you can give it new data and either obtain predictions on the target (0s or 1s) or probabilities (numbers)

it is ready to make predictions on new data. Usually you will be interested in getting the predictions for the target 

In [105]:
predictions = rf_clf.predict(test_x)
probabilities = rf_clf.predict_proba(test_x)

In [106]:
from sklearn.metrics import precision_score, recall_score, confusion_matrix
print("Precision: ", precision_score(test_y, predictions))
print("Recall:    ", recall_score(test_y, predictions))
print("############")
print("Confusion matrix")
print(confusion_matrix(test_y, predictions))

Precision:  0.8293413173652695
Recall:     0.7446236559139785
############
Confusion matrix
[[1142  114]
 [ 190  554]]
