# 1. Random Forest
Random Forest is a `supervised learning algorithm`. Like you can already see from it’s name, it creates a `forest` and makes it somehow random. The forest it builds, is an `ensemble of Decision Trees`, most of the time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

In [20]:
# import libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [21]:
# load the data
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [22]:
# encode categorical variables using LabelEncoder
le = LabelEncoder()
for col in df.columns:
    if df[col].dtype == 'object' or df[col].dtype == 'category':
        df[col] = le.fit_transform(df[col])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [23]:
# Split the data into X and y for classification
X = df.drop('sex', axis=1)
y = df['sex']

# NOW split the data into train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [26]:
# Create, train and predict the model
model_cl = RandomForestClassifier(n_estimators=100, random_state=42)
model_cl.fit(X_train, y_train)


0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [None]:
y_pred_cl = model_cl.predict(X_test)

In [29]:
# Evaluate the model
print('accuracy score:', accuracy_score(y_test, y_pred_cl))
print(confusion_matrix(y_test, y_pred_cl))
print(classification_report(y_test, y_pred_cl))

accuracy score: 0.5918367346938775
[[ 6 13]
 [ 7 23]]
              precision    recall  f1-score   support

           0       0.46      0.32      0.38        19
           1       0.64      0.77      0.70        30

    accuracy                           0.59        49
   macro avg       0.55      0.54      0.54        49
weighted avg       0.57      0.59      0.57        49



### Using Random Forest for Regression 


In [30]:
X = df.drop('tip', axis=1)
y = df['tip']

# NOW split the data into train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create, train and predict the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [31]:
y_pred = model.predict(X_test)

In [32]:
# Evalute the model 
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
print('R^2 Score:', r2_score(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))

Mean Squared Error: 0.9625189093877565
Mean Absolute Error: 0.7745428571428574
R^2 Score: 0.2299672204264097
Root Mean Squared Error: 0.9810804805864586
