<h1>Fine-Tuning Random Forest Model</h1>
<p>Author: HendryHB</p>

<p>There are several ways to fine-tune a Random Forest classifier to improve its performance. Fine-tuning involves adjusting hyperparameters to find the optimal configuration for your specific dataset and problem. Here are some common hyperparameters that can adjusted and methods for fine-tuning. However, before to start with Hyperparameters tuning, it is good to have a little bit information of what Random Forest is.</p>

<h3>Random (Decision) Forest</h3>

<p>
    Random Forest is an ensemble learning method primarily used for classification and regression tasks. It was developed by Tin Kam Ho in 1995. This technique combines multiple decision trees to produce a more accurate and stable prediction compared to individual decision trees.[1]

<strong>Random Forest Conceptual</strong>
<ol>
    <li>Dataset Splitting</li>
    The Random Forest algorithm starts by creating multiple subsets (bootstrapped samples) of the original dataset. Each subset is generated by randomly sampling the data with replacement.
    <li>Building Decision Trees</li>
    For each bootstrapped sample, a decision tree is built. However, unlike traditional decision trees, Random Forest introduces two key modifications:
    <ol><li><strong>Random Feature Selection:</strong> At each node of the tree, a random subset of features is selected. The best feature for splitting the data is chosen from this subset, not from the full feature set.</li>
        <li><strong>Bootstrapped Samples:</strong> Each tree is built using a different subset of the data, ensuring diversity among the trees.</li></ol>
    <li>Tree Independence</li>
    Each tree in the forest is grown independently and without pruning. This independence helps in capturing different patterns from the data, leading to a more robust model.
    <li>Aggregating Results</li>
        <ul><li>For classification tasks, each tree in the forest votes for a class, and the class with the majority votes is selected as the final prediction.</li>
            <li>For regression tasks, the average of the outputs from all trees is computed to produce the final prediction.</li></ul>
[2]</ol>

<img src="https://github.com/hendryhb/kecakbali/blob/main/rf_1.png?raw=true">

<strong>Bibliography:</strong><br>
[1] Tin Kam Ho, “Random decision forests,” in Proceedings of 3rd International Conference on Document Analysis and Recognition, Montreal, Que., Canada: IEEE Comput. Soc. Press, 1995, pp. 278–282. doi: 10.1109/ICDAR.1995.598994.<br>
[2] S. Raschka and V. Mirjalili, Python machine learning: machine learning and deep learning with Python, scikit-learn, and TensorFlow 2, Third edition. in Expert insight. Birmingham Mumbai: Packt, 2019.
</p>

In [2]:
from google.colab import drive  # for google colab
import zipfile  # for google colab
import sys
import os
import numpy as np
from PIL import Image
import pandas as pd
import tensorflow as tf
import itertools
from tensorflow.keras.preprocessing.image import ImageDataGenerator

from keras.utils import to_categorical
from keras.preprocessing.image import array_to_img

from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

np.random.seed(42)

In [104]:
drive.mount('/content/drive')  # for google colab

In [None]:
with zipfile.ZipFile("/content/drive/MyDrive/raw_batik_v2.1.zip") as zip_ref:  # for colab
  zip_ref.extractall("./")

# Constants

In [2]:
IMAGE_WIDTH = 224
IMAGE_HEIGHT = 224
COLOR_CHANNELS = 3
BATCH_SIZE = 32

# Data Gathering

In [3]:
# Directory containing the data - fetching dataset from google drive
DATA_DIR = "/content/raw_batik_v2.1.zip"

common_datagen = ImageDataGenerator(
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2,
    fill_mode='constant',
    rescale=1./255 
)

# Set up data generators for training, validation, and testing
train_generator = common_datagen.flow_from_directory(
    directory=os.path.join(DATA_DIR, "train"),
    target_size=(IMAGE_HEIGHT, IMAGE_WIDTH),  # Set the target image size
    batch_size=BATCH_SIZE,
    color_mode='rgb', # Set color mode to RGB
    class_mode='categorical'  # For multi classes
)

test_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
    directory=os.path.join(DATA_DIR, "test"),
    target_size=(IMAGE_HEIGHT, IMAGE_WIDTH),
    batch_size=BATCH_SIZE,
    color_mode='rgb', # Set color mode to RGB
    class_mode='categorical'
)

Found 640 images belonging to 20 classes.
Found 160 images belonging to 20 classes.


# Data Exploration

## Train Dataset

<p><code>.n</code> is an attribute to the total count of images found and <code>.num_classes</code> is the number of subdirectories found (classes)</p>

In [4]:
x_train_images = train_generator.n
y_train_classes = train_generator.num_classes
print(f"Number of images: {train_generator.n}, Number of classes:{train_generator.num_classes}")

Number of images: 640, Number of classes:20


In [5]:
x_train_all = []
y_train_all = []

# Looping, check: // BATCH_SIZE or + 1
for _ in range(x_train_images // BATCH_SIZE ):
    x_batch, y_batch = next(train_generator)
    x_train_all.append(x_batch)
    y_train_all.append(y_batch)

# Concatenate
x_train = np.concatenate(x_train_all, axis=0)
y_train = np.concatenate(y_train_all, axis=0)

# Display the shapes of x and y_train
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)

x_train shape: (640, 224, 224, 3)
y_train shape: (640, 20)


## Test Dataset

<p><code>.n</code> is an attribute to the total count of images found and <code>.num_classes</code> is the number of subdirectories found (classes)</p>

In [8]:
x_test_images = test_generator.n
y_test_classes = test_generator.num_classes
print(f"Number of images: {test_generator.n}, Number of classes:{test_generator.num_classes}")

Number of images: 160, Number of classes:20


In [9]:
x_test_all = []
y_test_all = []

# Looping, check: // BATCH_SIZE or + 1
for _ in range(x_test_images // BATCH_SIZE):
    x_test_batch, y_test_batch = next(test_generator)
    x_test_all.append(x_test_batch)
    y_test_all.append(y_test_batch)

# Concatenate
x_test = np.concatenate(x_test_all, axis=0)
y_test = np.concatenate(y_test_all, axis=0)

# Display the shapes of x and y_train
print("x_test shape:", x_test.shape)
print("y_test shape:", y_test.shape)

x_test shape: (160, 224, 224, 3)
y_test shape: (160, 20)


<h1>Flattening the images for RandomForestClassifier</h1>
<p>
    <ul>
        <li><code>x_train and <code>x_test</code> are flattened since <code>RandomForestClassifier</code> in scikit-learn expects 2D input.</li>
        <li>Labels <code>y_train</code> and <code>y_test</code> are converted from one-hot encoded format to class indices using np.argmax.</li></ul>
</p>

In [12]:
x_train_flat = x_train.reshape(x_train.shape[0], -1)
x_test_flat = x_test.reshape(x_test.shape[0], -1)

In [13]:
# Convert one-hot encoded labels to integer labels
y_train_int = np.argmax(y_train, axis=1)
y_test_int = np.argmax(y_test, axis=1)

<h1>Hyperparameters for Fine-Tuning</h1>
<ol>
        <li><code>n_estimators</code>: Number of trees in the forest. Increasing this generally improves performance but also increases computational cost.</li>
        <li><code>max_depth</code>: Maximum depth of each tree. Controlling this can help prevent overfitting.</li>
        <li><code>min_samples_split</code>: Minimum number of samples required to split an internal node. Higher values prevent overfitting.</li>
        <li><code>min_samples_leaf</code>: Minimum number of samples required to be at a leaf node. Useful for smoothing the model.
</li>
        <li><code>max_features</code>: Number of features to consider when looking for the best split. Can be an integer, a float (fraction of total features), or one of several options <code>('auto', 'sqrt', 'log2')</code>.</li>
        <li><code>bootstrap</code>: Whether bootstrap samples are used when building trees. Default is True.</li>
        <li><code>criterion</code>: Function to measure the quality of a split. Options include 'gini' for the Gini impurity and 'entropy' for information gain.</li>
</ol>
    <p>For RandomForestClassifier parameters detail could be found in <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Sklearn documentation></a> </p>    
    <h3>Methods for Fine-Tuning</h3>
    <ol>
        <li>Grid Search Cross-Validation (GridSearchCV)</li>
        <li>Random Search Cross-Validation (RandomizedSearchCV)</li>
    </ol>
    <p>Below is Grid Search Cross-Validation fine-tuning sample.</p>
    

In [92]:
n_estimators = [ 64, 100, 128, 200]
max_features = [32, 40, 48, 64, 70]  # [2, 3, 4]
bootstrap = [True, False]
oob_score = [True, False]

In [93]:
params_grid = {'n_estimators':n_estimators,
              'max_features':max_features,
               'bootstrap':bootstrap,
               'oob_score':oob_score
              }

In [94]:
rfc = RandomForestClassifier()
grid = GridSearchCV(rfc, params_grid)

In [95]:
grid.fit(x_train_flat, y_train_int)

100 fits failed out of a total of 400.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
100 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/miniconda3/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 434, in fit
    raise ValueError("Out of bag estimation only available if bootstrap=True")
ValueError: Out of bag estimation only available if bootstrap=True

 0.2296875 0.175     0.1703125 0.1875    0.18125   0.1921875 0.1765625
 0.1953125 0.209375  0.1671875 0.1765625 0.1890625 0.1890625 0.2
 0.1765625 0.1921875 0.22343

In [96]:
grid.best_params_

{'bootstrap': False,
 'max_features': 64,
 'n_estimators': 200,
 'oob_score': False}

<h3>Summary</h3>
<p>By systematically exploring and fine-tuning these parameters, improvement the performance of Random Forest model could be achieved. Then include parameter values into Random Forest instantiation model, <code>model = RandomForestClassifier(n_estimators=200, max_features=64, random_state=42)</code> like in <a href="https://github.com/hendryhb/kecakbali/blob/main/random_forest_classification.ipynb">random_forest_classification</a> notebook.</br></p>