## Establishing a Baseline

### Objective

The purpose of this section is to provide with a baseline model which we will use to benchmark subsequent models. A baseline provides a reference point to evaluate whether future enhancements (like feature engineering, hyperparameter tuning, or algorithm changes) actually deliver meaningful improvements.

For this task, we will use a Random Forest Classifier due to its robustness, ability to handle high-dimensional data, and minimal preprocessing requirements.

### Data Extraction
We will first extract all the data required. We have a helper function for this which will drop the id as well as this is not needed.

In [None]:
from src.loader.data_loader import load_dataset
import pandas as pd

data: pd.DataFrame = load_dataset(filepath="data/ClassifyProducts.csv")


### Data Preparation

We begin by splitting our dataset into two key components:

- `data_features`: Contains all feature columns used as inputs for the model.
- `data_targets`: Contains the target variable we aim to predict — in this case, the product category.

In [None]:
data_features: pd.DataFrame = data.drop(columns=["target"])
data_targets: pd.Series = data["target"]

### Train-Test Split

To evaluate our model's ability to generalize to unseen data, we divide the dataset into a training set and a testing set. We allocate 80% of the data for training and 20% for testing.

We also apply stratification to ensure that the distribution of classes in the target variable remains consistent across both sets. This is particularly important when dealing with potential class imbalances.

In [None]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_targets, test_targets = train_test_split(
    data_features, data_targets, test_size=0.2, random_state=42, stratify=data_targets
)

### Training the baseline model

We initialize a `RandomForestClassifier` with default parameters, setting only the number of trees (`n_estimators=100`) and a `random_state` for reproducibility.

The first challenge is to determine the n_estimators (number of trees). For this, we will draw a performance curve to determine the point of diminishing returns

In [None]:
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from typing import List

# Define the range of tree counts to evaluate
number_of_trees_list: List[int] = [10, 50, 100, 200, 300, 400, 500, 700, 1000]

# Initialize a list to store accuracy scores (We will draw the curve from these)
test_set_accuracies: List[float] = []

for number_of_trees in number_of_trees_list:
    random_forest_model: RandomForestClassifier = RandomForestClassifier(
        n_estimators=number_of_trees,
        random_state=65
    )

    random_forest_model.fit(train_features, train_targets)

    test_set_predictions = random_forest_model.predict(test_features)

    test_set_accuracy: float = accuracy_score(test_targets, test_set_predictions)
    test_set_accuracies.append(test_set_accuracy)

    print(f"Random Forest with {number_of_trees} trees --> Test Set Accuracy: {test_set_accuracy:.4f}")

# Plot the curve of test set accuarcy
plt.figure(figsize=(10, 6))
plt.plot(number_of_trees_list, test_set_accuracies, marker='o')
plt.title("Test Set Accuracy vs Number of Trees in Random Forest")
plt.xlabel("Number of Trees (n_estimators)")
plt.ylabel("Test Set Accuracy")
plt.grid(True)
plt.show()

Interpretation of results:
- A significant improvement in accuracy is observed when increasing the number of trees from 10 to 50, where accuracy rises from 0.7830 to 0.8086.
- Further increases up to 200-300 trees yield marginal gains, peaking at an accuracy of 0.8129.
- Beyond 300 trees, additional trees do not contribute to meaningful improvements in predictive performance. Minor fluctuations in accuracy (within ±0.0015) are attributable to randomness and model variance, despite the stability offered by ensemble methods.
- The curve clearly demonstrates a performance plateau after approximately 300 trees, indicating the point of diminishing returns.

For the baseline model, we will use 300 trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Initialize and train baseline Random Forest
baseline_model = RandomForestClassifier(n_estimators=300, random_state=65)
baseline_model.fit(train_features, train_targets)

# Predict on test set
test_predictions = baseline_model.predict(test_features)

# Print classification report
print(classification_report(test_targets, test_predictions))

              precision    recall  f1-score   support

     Class_1       0.79      0.43      0.56       386
     Class_2       0.72      0.88      0.80      3224
     Class_3       0.64      0.51      0.57      1601
     Class_4       0.87      0.43      0.58       538
     Class_5       0.97      0.97      0.97       548
     Class_6       0.93      0.94      0.94      2827
     Class_7       0.77      0.59      0.67       568
     Class_8       0.88      0.94      0.91      1693
     Class_9       0.85      0.89      0.87       991

    accuracy                           0.81     12376
   macro avg       0.83      0.73      0.76     12376
weighted avg       0.81      0.81      0.80     12376