# **Demo: Random Forest Algorithm In Python**

## Problem Definition:
We are using 'breast cancer' dataset, present in the sklearn package. In our dataset there are 2 classes(benign and malignant) based on 30 measurements.
We have to predict in which class an instance belongs to based on its measurements. We are using SVM algorithm to solve this binary-class classification problem

## Objective
>* **Classify:** We want to predict if a patient has malignant tumor or benign.
>* **Understanding Random Forest:** For classification here we are using Random Forest Classifier, so let's see how it works.
>* **Collecting the data**
>* **Splitting the dataset for training and testing:** Since we want to know how good our model is, we will split the main dataset into training and testing datasets. The test data will be used later for evaluating.
>* **Implenting Random Forest Classifier using sklearn**
>* **Training the model:** We will create the model by training the algorithm on the training dataset(which contains the actual labels).
>* **Testing the model:**  We will test the model on the test dataset to check how good our model works when it sees a new sample. 
>* **Model Performance:** We will calculate our model's performance, by comparing our predicted values with actual values.

##Understanding Random Forest
Random Forest is an ensemble method, it uses multiple decision trees for prediction. 

It creates different decision trees by creating different subsets of training data. For final prediction it takes voting from all the trees.

##Importing the libraries

In [1]:
import warnings
warnings.filterwarnings(action='ignore')

In [2]:
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

##Loading the data

In [3]:
cancer=load_breast_cancer()
X = pd.DataFrame(cancer.data)
y = pd.DataFrame(cancer.target)

In [4]:
X.shape

(569, 30)

##Splitting the dataset for training and testing.

In [5]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=30,random_state=1)

##Implementing Random Forest using sklearn

In [6]:
rfc = RandomForestClassifier()     #by default random forest is using 100 trees
rfc.fit(X_train,y_train)

y_pred=rfc.predict(X_test)

print(accuracy_score(y_test,y_pred))

0.9666666666666667


## Let us try to understand the parameters of Random Forest Classifier

**n_estimator** represents the number of trees used by random forest

In [7]:
for trees in range(10, 150, 10):
    
    rf=RandomForestClassifier(n_estimators=trees,random_state=7)

    rf.fit(X_train,y_train)
    y_pred=rf.predict(X_test)
    
    print(trees, accuracy_score(y_test,y_pred))
  

10 0.9333333333333333
20 0.9333333333333333
30 0.9333333333333333
40 0.9333333333333333
50 0.9333333333333333
60 0.9333333333333333
70 0.9333333333333333
80 0.9333333333333333
90 0.9333333333333333
100 0.9333333333333333
110 0.9333333333333333
120 0.9333333333333333
130 0.9333333333333333
140 0.9333333333333333


**criterion:** a method to calculate the quality of split, for example gini,  entropy.

**max_depth** maximum depth of each decision tree.

In [8]:
r=RandomForestClassifier(n_estimators=30,criterion='entropy',max_depth=3,random_state=7)     #by default criterion is gini
r.fit(X_train,y_train)
y_pred=r.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.9333333333333333


In [9]:
r=RandomForestClassifier(n_estimators=30,criterion='gini',max_depth=13,random_state=7)
r.fit(X_train,y_train)
y_pred=r.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.9333333333333333


**max_features** represent the maximum number of features which will be used to create best_split by each decision tree.

In [10]:
rfc2=RandomForestClassifier(n_estimators=30,criterion='gini',max_features=5,random_state=7)
rfc2.fit(X_train,y_train)
y_pred=rfc2.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.9333333333333333


In [11]:
rfc2=RandomForestClassifier(n_estimators=30, oob_score=True, random_state=111)

rfc2.fit(X_train,y_train)

y_pred=rfc2.predict(X_test)

print(accuracy_score(y_test,y_pred))

print(rfc2.oob_score_)

0.9333333333333333
0.9554730983302412


In [12]:
0.9536178107606679

0.9536178107606679

the **feature_importances_** attribute gives us which features are important for prediction. It can be further used for feature selection in large datsets.

In [13]:
print(rfc.feature_importances_)

[0.02233103 0.01651001 0.0576928  0.05666557 0.00908835 0.00999851
 0.05971653 0.1082808  0.00284243 0.00371258 0.01907734 0.00435095
 0.00473628 0.0310185  0.0054387  0.00520413 0.00966317 0.00337525
 0.003989   0.00333746 0.08738531 0.01477041 0.10394331 0.13692927
 0.01042421 0.01520646 0.04197656 0.13751913 0.00799141 0.00682454]


The above output shows feature importance of each column, for example first column importance is 2.7%.

In [14]:
cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')