This blog will cover four parts:

1. *What is random forest*
2. *Why we want to use it*
3. *How it works *
4. *it's implementation in python sklearn*


### What is random forest?


Random forest is the advanced version of decision tree. It combines/ensembles numbers of decision trees for prediction, instead of just one. 
Note: If you don't know what decision tree is and are interested in the rationale behind it, please refer my previous blog on decision tree - https://medium.com/bite-sized-machine-learning/decision-tree-classifier-explained-9543dd052746

![title](Capture0.PNG)

### What's the benefit of using an ensemble method

if you are not familiarize with the ensemble method/ensemble learning, you may ask why I need to use it.

*Ensemble method*, in general, is a predictive model that make prediction based on a numbers of different models. The top three ensemble architectures are bagging, boosting and stacking. Random forest is an example of bagging, and that will be the focus for this blog.

*Bagging* *is using random subsets of the training set to train the model*,  instead of using the entire training set.

*Why we need ensemble method?* We can benefit from ensemble method because the individual model usually suffers the bias or variance. By combining different individual model, the ensemble model tends to more flexible (less bias) and less data-sensitive (less variance)

* Bias is when the model not paying enough attention to the training data and not flexible enough to adapt along with the training data (e.g. model sticks with some wrong assumption)
* Variance is when the model pay too much attention to the training data and captures a lot of noise of the data instead of the pattern



### How does random forest work ?

*Random forest* is an ensemble model using *bagging *as the ensemble architecture and *decision tree* as individual model as weak learner. 

*Here are the four steps of building a random forest model:*

![title](Capture1.PNG)


**Step 1: ** select n (e.g. 1000) random subsets from the training set

**Step 2: ** train n decision tree
1.  one random subset is used for training one decision tree
2. the optimal splits for each decision trees is based on a random subset of features (e.g. 10 feature in total, use a 5 random feature to split)

**Step 3:** each tree predicts the records/candidates in the test set, independently. 

**Step 4:** select the majority vote from these 1000 tree on each record/candidate in the test set as the final decision 


### Implementation in python sklearn

In [1]:
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Step1: Create data set
X, y = make_moons(n_samples=10000, noise=.5, random_state=0)

In [3]:
# Step2: Split the training test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Step 3: Fit a Decision Tree model as comparison
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.7555

In [5]:
# Step 4: Fit a Random Forest model, " compared to "Decision Tree model, accuracy go up by 5-6%
clf = RandomForestClassifier(n_estimators=100,max_features="auto",random_state=0)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.7965

**n_estimators **is how many tree you want to grow. In other word, how many subset you want to split and train

**max_features **is the number of random features that individual decision tree will be used for finding the optimal splitting.

* If “auto”, then max_features=sqrt(n_features).
* If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
* If “log2”, then max_features=log2(n_features).
* If None, then max_features=n_features.

