## Random Forest Classifier

Random forests is a supervised learning algorithm that is comprised of decision trees. Although it can be used both for classification and regression, today we are mainly focus on how to create a random forest classifier using python. 

### What is the random forest algorithm?

Random forest is a popular supervised machine learning, based on the concept of ensemble learning (which means that multiple classifiers is used collectively to solve a problem). The random forest algorithm relies on multiple decision trees and then accepts the most-voted results from the predictions of each tree. 

Random forest classifiers have a plethora of applications in multiple domains. Some examples of Random Forest's applications are: Credit Card Fraud Detection, Diabetes Prediction, Breast Cancer Prediction, and Bitcoin Price Detection.

### How does the algorithm work?

It works in four steps:

1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each decision tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.

In this workshop, you will be building a RF model on the iris flower data set. Iris flower data set is a very famous classification set. It comprises the sepal length, sepal width, petal length, petal width, and type of flowers. You can access the data set by importing the datasets library from scikit-learn, and load the iris dataset with `load_iris()`

### Let's Dive into Some Python Codes Now

Now we are building a model on the iris flower dataset. Start by importing the datasets library from `sklearn`, and load the iris data set.

In [1]:
from sklearn import datasets
#Load dataset
iris =  datasets.load_iris()

Let's print the target and feature variable names just to make sure that we are using the right dataset

In [2]:
print(iris.target_names)
print(iris.feature_names)

['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Now we create a dataframe of this iris dataset. Since the species of iris flower is what we are interested in classifying, we first separate columns accordingly into dependent and independent variables.   
Steps as follow:

In [8]:
import pandas as pd
data = pd.DataFrame({
    'sepal_length': iris.data[:,0],
    'sepal_width': iris.data[:,1],
    'petal_length': iris.data[:,2],
    'petal_width': iris.data[:,3],
    'species': iris.target
})

data.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [9]:
# Seperate cols to dependent and independent variables
y = data['species']
X = data[['sepal_length','sepal_width','petal_length','petal_width']]

We then use the `train_test_split` function to split variables into train and test set (Let's take 75% to training and 25% to testing), and train the model on the train set and perform predictions on the test set. Don't forget to import `RandomForestClassifier` from `sklearn.ensemble`

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25)
    
# create the classifier
classifier = RandomForestClassifier(n_estimators = 100)
classifier.fit(X_train, y_train)
# make the prediction using the RF model
y_pred = classifier.predict(X_test)

We now examine the RF model's accuracy using the actual y (species) value and the predicted values given by the model.

In [22]:
from sklearn import metrics

print("The accuracy for RF model is: ", round(metrics.accuracy_score(y_test, y_pred),4))

The accuracy for RF model is:  0.9474


And we say the accuracy is pretty high for such model! To make a prediction on a single item, we can also use the `predict()` function.  
For example:  
    - sepal length = 3  
    - sepal width = 6  
    - petal length = 6  
    - petal width = 4  
Now we can predict which type of the iris flower it is as below.

In [24]:
classifier.predict([[3,6,6,4]])



array([2])

Here, the output is 2, which indicates an iris type of Virginica.

Congratulation! You have made it so far and know what a typical random forest classifier in python looks like. If you are interested in learning more about the random forest algorithm, we encourage you to browse through the internet as there are a lot more interesting readings and tutorials regarding the random forest waiting for you to discover!