# Random Forest 

## Overview
 The Random Forest algorithm is an ensemble method that combines multiple decision trees to make predictions. The algorithm is a supervised learning method that can be used for classification and regression. The Random Forest algorithm is a popular choice for Kaggle competitions because it is robust against overfitting and can be used for both classification and regression problems.

## References
- [Random Forest](https://en.wikipedia.org/wiki/Random_forest)

In [30]:
# import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier , RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix , classification_report
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score

In [2]:
# load the data 
df = sns.load_dataset('tips')
df.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
# encode features which are categorical or object using for loop
le = LabelEncoder()
for col in df.columns:
    if df[col].dtypes == 'object' or df[col].dtypes == 'category' :
        df[col] = le.fit_transform(df[col])

df.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [29]:
# split the data into X and y for classsification
X = df.drop('sex',axis=1)
y = df['sex']
# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train shape: ", X_train.shape)
print("Test shape: ", X_test.shape)
# create the model 
model = RandomForestClassifier(n_estimators=200, criterion='entropy', max_depth=100, random_state=42 ,bootstrap=True)
# train the model 
model.fit(X_train, y_train)
# make predictions on the test set
y_pred = model.predict(X_test)
#evaluate the model with print
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))


Train shape:  (195, 6)
Test shape:  (49, 6)
Accuracy:  0.6122448979591837
Classification Report: 
               precision    recall  f1-score   support

           0       0.50      0.37      0.42        19
           1       0.66      0.77      0.71        30

    accuracy                           0.61        49
   macro avg       0.58      0.57      0.57        49
weighted avg       0.60      0.61      0.60        49

Confusion Matrix: 
 [[ 7 12]
 [ 7 23]]


In [34]:
# use random forest for regression task 

# split the data into X and y for regression

X = df.drop('tip',axis=1)
y = df['tip']
# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train shape: ", X_train.shape)
print("Test shape: ", X_test.shape)

# create the model
model = RandomForestRegressor(random_state=42)
# train the model
model.fit(X_train, y_train)
# make predictions on the test set
y_pred = model.predict(X_test)
# evaluate the model

print("Mean Absolute Error: ", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error: ", mean_squared_error(y_test, y_pred))
print("R2 Score: ", r2_score(y_test, y_pred))
print('root mean squared error:',np.sqrt(mean_squared_error(y_test,y_pred)))

Train shape:  (195, 6)
Test shape:  (49, 6)
Mean Absolute Error:  0.7750510204081635
Mean Squared Error:  0.9625607446938791
R2 Score:  0.2299337514142753
root mean squared error: 0.9811018013916186
