# Machine Learning Notebook 7: Random Forest Regression

### Compiled by Amit Purswani
LinkedIn: https://www.linkedin.com/in/amit-purswani-2a073777/

<b>GitHub Repositories</b>
1. Data Analysis:
https://github.com/kranemetal/Data-Analysis-Projects

2. Machine Learning:
https://github.com/kranemetal/MachineLearning

<b>Notes on Random Forest:</b>
1. Random Forest is one type of Ensemble Learning.
2. In Ensemble Learning, you use multiple machine learning algorithms or same algorithm multiple times, to make something much more powerful than the original one.

<b>Random Forest prediction steps:</b>
1. Pick 'k' data points randomly from Training set.
2. Build a Decision Tree associated with these 'k' data points.
3. Choose the number Ntree of trees you want to build and repeat Steps 1 and 2. 
4. For a new data point, make each one of your Ntree trees predict the value of Y for the data point in question.
5. Assign to Y the average of values calculated by all trees used for prediction in step 4, for that particular data point.
6. Here we didnt get single prediction for Y, but got multiple predictions and averaged them to get the result, hence it is more accurate than prediction of any given tree.
7. As, several trees are used for prediction hence it is known as <b>Random Forest.</b>

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Import dataset

In [2]:
df = pd.read_csv('C:\\Users\krane\Desktop\datasets\Position_Salaries.csv')

### Basic sanity checks on dataset

In [3]:
df.head()

Unnamed: 0,Position,Level,Salary
0,Business Analyst,1,45000
1,Junior Consultant,2,50000
2,Senior Consultant,3,60000
3,Manager,4,80000
4,Country Manager,5,110000


In [4]:
df.shape

(10, 3)

In [5]:
df

Unnamed: 0,Position,Level,Salary
0,Business Analyst,1,45000
1,Junior Consultant,2,50000
2,Senior Consultant,3,60000
3,Manager,4,80000
4,Country Manager,5,110000
5,Region Manager,6,150000
6,Partner,7,200000
7,Senior Partner,8,300000
8,C-level,9,500000
9,CEO,10,1000000


In [6]:
df.describe()

Unnamed: 0,Level,Salary
count,10.0,10.0
mean,5.5,249500.0
std,3.02765,299373.883668
min,1.0,45000.0
25%,3.25,65000.0
50%,5.5,130000.0
75%,7.75,275000.0
max,10.0,1000000.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Position  10 non-null     object
 1   Level     10 non-null     int64 
 2   Salary    10 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 368.0+ bytes


In [8]:
df.isnull().sum()

Position    0
Level       0
Salary      0
dtype: int64

### Splitting independent variable X and dependent variable Y

In [9]:
x = df.iloc[:,1:-1].values #only second column as feature
y = df.iloc[:,-1].values #last column as target

#### Feature Scaling not required for Random Forest as it works on splitting data recursively using multiple Decision Trees and any other equations are not involved, where numeric value of features can have adverse effect on machine learning output.

### Training Random Forest model on whole dataset

In [10]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state=0) #estimator is no. trees
regressor.fit(x, y)

RandomForestRegressor(n_estimators=10, random_state=0)

### Predict result for a new value 6.5

In [11]:
regressor.predict([[6.5]])

array([167000.])

As per dataset, Salary for Level 6 and 7 is 150,000 and 200,000 respectively. Hence the prediction for Level 6.5 i.e. 167,000 seems reasonable and a fairly good prediction.

### <center>The End