# **Random Forest**

## **Random Forest for `Classification`**

#### **1. Import Libraries:**

In [1]:
# import libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

#### **2. Import Dataset:**

In [2]:
# load the data
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    int32  
 3   smoker      244 non-null    int32  
 4   day         244 non-null    int32  
 5   time        244 non-null    int32  
 6   size        244 non-null    int64  
dtypes: float64(2), int32(4), int64(1)
memory usage: 9.7 KB


#### **3. Encoding Features which are Categorical / Object:**

In [3]:
# encode features which are categorical or object using for loop
le = LabelEncoder()
for i in df.columns:
    if df[i].dtype == 'object' or df[i].dtype == 'category':
        df[i] = le.fit_transform(df[i])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


#### **4. Splitting the Data into Features(X) and Target(y) for `Classification`:**

In [4]:
# split the data into X and y for classification
X = df.drop('sex', axis = 1)
y = df['sex']

#### **5. Train & Test Split:**

In [5]:
# Train Test Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

#### **6. Build the Model:**

In [6]:
# Create, train and predict the mode
model_cl = RandomForestClassifier(n_estimators=200, random_state=42) # n_estimators is the number of trees in the forest
model_cl.fit(X_train, y_train)

#### **7. Predict the Model:**

In [7]:
# Predict the model
y_pred = model_cl.predict(X_test)

#### **8. Model Evaluation:**

In [8]:
# Evaluate the model
print('accuracy score: ', accuracy_score(y_test, y_pred))
print('confusion matrix:\n', confusion_matrix(y_test, y_pred))
print('classification report:\n', classification_report(y_test, y_pred))

accuracy score:  0.6122448979591837
confusion matrix:
 [[ 7 12]
 [ 7 23]]
classification report:
               precision    recall  f1-score   support

           0       0.50      0.37      0.42        19
           1       0.66      0.77      0.71        30

    accuracy                           0.61        49
   macro avg       0.58      0.57      0.57        49
weighted avg       0.60      0.61      0.60        49



#### **Observations from the Output:**
| Metric | Class 0 | Class 1 | Overall | Description |
| --- | --- | --- | --- | --- |
| Precision | 0.50 | 0.66 | - | Out of all the instances that the model predicted for each class, how many it got correct |
| Recall | 0.37 | 0.77 | - | Out of all the actual instances of each class, how many the model correctly identified |
| F1-Score | 0.42 | 0.71 | - | The harmonic mean of Precision and Recall. It provides a single score that balances both the concerns of precision and recall in one number |
| Support | 19 | 30 | 49 | The number of actual occurrences of the class in the specified dataset |
| Accuracy | - | - | 0.61 | The proportion of the total number of predictions that were correct |

In `summary`, your model is performing better on Class 1 than Class 0. The overall accuracy is moderate at 61%. The model's performance might be satisfactory depending on the context and the cost of misclassification. However, there's room for improvement, especially in correctly identifying Class 0 instances. You might want to consider techniques like adjusting the class weights, resampling the dataset, trying different algorithms, or tuning the model parameters.

## **Random Forest for `Regression`:**

In [9]:
# Use Random Forest for Regression task
X = df.drop('tip', axis = 1)
y = df['tip']

# train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

#create, train and predict the model
model_reg = RandomForestRegressor()
model_reg.fit(X_train, y_train)
y_pred = model_reg.predict(X_test)

# evaluate the model
print('mean squared error: ', mean_squared_error(y_test, y_pred))
print('mean absolute error: ', mean_absolute_error(y_test, y_pred))
print('r2 score: ', r2_score(y_test, y_pred))
print('root mean squared error: ', np.sqrt(mean_squared_error(y_test, y_pred)))

mean squared error:  0.9404607542857156
mean absolute error:  0.7677183673469391
r2 score:  0.24761414904238266
root mean squared error:  0.9697735582525003


#### **Interpretation of the Regression Metrics:**

| Metric | Value | Description | Interpretation |
| --- | --- | --- | --- |
| Mean Squared Error (MSE) | 0.940 | The average of the squares of the differences between the actual and predicted values | The model's predictions are, on average, off by 0.940 (when squared) from the actual values |
| Mean Absolute Error (MAE) | 0.768 | The average of the absolute differences between the actual and predicted values | On average, the model's predictions are approximately 0.768 units away from the actual values |
| R2 Score | 0.248 | The proportion of the variance in the dependent variable that is predictable from the independent variable(s) | Only about 24.8% of the variance in the dependent variable can be explained by the model, which is quite low |
| Root Mean Squared Error (RMSE) | 0.970 | The square root of the average of the squares of the differences between the actual and predicted values | On average, the model's predictions are approximately 0.970 units (when squared and then square rooted) away from the actual values |

From these metrics, it appears that the model's performance is not very good. The errors (MSE, MAE, RMSE) are relatively high, indicating that the model's predictions are often quite far from the actual values. The R2 score is also quite low, suggesting that the model is not explaining much of the variability in the data. 

This suggests that there is a lot of room for improvement in the model's performance. You might want to consider techniques such as feature engineering, hyperparameter tuning, or trying different regression algorithms to improve the model's performance.