# Random Forest with One-vs-Rest


In this notebook, I will create a random forest model adapted for multi-label classification, since our data seems to also be imbalanced, and it was one of the best performing baseline model.

Random Forest can be adapted for multi-label classification using a one-vs-rest (OvR) approach or libraries like scikit-multilearn.

In [20]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import hamming_loss, f1_score, accuracy_score
from sklearn.model_selection import GridSearchCV


In [21]:
import os
# Set the working directory
os.chdir(r'/Users/saram/Desktop/Erdos_Institute/project/Data')

## Load and Preprocess the data

In [22]:
# Read train features
mars_data = pd.read_csv("../Data/train_features_new_with_PCA.csv")
mars_data.set_index(mars_data.sample_id, inplace=True)
mars_data

Unnamed: 0_level_0,sample_id,basalt,carbonate,chloride,iron_oxide,oxalate,oxychlorine,phyllosilicate,silicate,sulfate,...,2.12,0.13,1.13,2.13,0.14,1.14,2.14,0.15,1.15,2.15
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S0000,S0000,0,0,0,0,0,0,0,0,1,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0001,S0001,0,1,0,0,0,0,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0002,S0002,0,0,0,0,0,1,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0003,S0003,0,1,0,1,0,0,0,0,1,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0004,S0004,0,0,0,1,0,1,1,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
S0749,S0749,0,0,0,0,0,0,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0750,S0750,0,0,0,0,0,0,1,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0751,S0751,0,0,0,0,0,0,0,1,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0752,S0752,0,0,0,1,0,0,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15


In [23]:
print(mars_data.columns)

Index(['sample_id', 'basalt', 'carbonate', 'chloride', 'iron_oxide', 'oxalate',
       'oxychlorine', 'phyllosilicate', 'silicate', 'sulfate', 'sulfide', '0',
       '1', '2', '0.1', '1.1', '2.1', '0.2', '1.2', '2.2', '0.3', '1.3', '2.3',
       '0.4', '1.4', '2.4', '0.5', '1.5', '2.5', '0.6', '1.6', '2.6', '0.7',
       '1.7', '2.7', '0.8', '1.8', '2.8', '0.9', '1.9', '2.9', '0.10', '1.10',
       '2.10', '0.11', '1.11', '2.11', '0.12', '1.12', '2.12', '0.13', '1.13',
       '2.13', '0.14', '1.14', '2.14', '0.15', '1.15', '2.15'],
      dtype='object')


In [24]:
# Data preprocessing 
# Drop 'sample_id' and separate features and target labels
X = mars_data.drop(columns=['sample_id', 'basalt', 'carbonate', 'chloride', 'iron_oxide', 'oxalate', 'oxychlorine',
                       'phyllosilicate', 'silicate', 'sulfate', 'sulfide'])
y = mars_data[['basalt', 'carbonate', 'chloride', 'iron_oxide', 'oxalate', 'oxychlorine',
          'phyllosilicate', 'silicate', 'sulfate', 'sulfide']]

In [25]:
# Ensure we have correct dimensions
print(X.shape)
print(y.shape)

(754, 48)
(754, 10)


In [26]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

In [27]:
labels = ['basalt', 'carbonate', 'chloride', 'iron_oxide', 'oxalate', 'oxychlorine',
          'phyllosilicate', 'silicate', 'sulfate', 'sulfide']  # Label columns

## The model

One-vs-Rest Approach:

Random Forest does not natively support multilabel classification.
OneVsRestClassifier trains one Random Forest model per label, effectively creating multiple binary classifiers.

In [28]:
# Define the Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Wrap with OneVsRestClassifier for multilabel classification
ovr_classifier = OneVsRestClassifier(rf_model)

# Train the model
ovr_classifier.fit(X_train, y_train)

# Make predictions
predictions = ovr_classifier.predict(X_test)

# Evaluate the model
hamming = hamming_loss(y_test, predictions)
f1 = f1_score(y_test, predictions, average='micro')
accuracy = accuracy_score(y_test, predictions)

print("Hamming Loss:", hamming)
print("F1 Score:", f1)
print("Accuracy:", accuracy)


Hamming Loss: 0.07880794701986756
F1 Score: 0.6860158311345647
Accuracy: 0.5231788079470199


In [32]:
# Define hyperparameter grid
param_grid = {
    'estimator__n_estimators': [50, 100, 150],
    'estimator__max_depth': [10, 20],
    'estimator__min_samples_split': [2, 5, 10],
}

# Wrap the base model for GridSearchCV
grid_search = GridSearchCV(
    estimator=OneVsRestClassifier(RandomForestClassifier(random_state=42)),
    param_grid=param_grid,
    scoring='accuracy',
    cv=3,
    verbose=3
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters and evaluation
print("Best Parameters:", grid_search.best_params_)

# Use the best estimator
best_rf_model = grid_search.best_estimator_
predictions = best_rf_model.predict(X_test)

# Evaluate
hamming = hamming_loss(y_test, predictions)
f1 = f1_score(y_test, predictions, average='micro')
accuracy = accuracy_score(y_test, predictions)

print("Hamming Loss:", hamming)
print("F1 Score:", f1)
print("Accuracy:", accuracy)


Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV 1/3] END estimator__max_depth=10, estimator__min_samples_split=2, estimator__n_estimators=50;, score=0.398 total time=   1.1s
[CV 2/3] END estimator__max_depth=10, estimator__min_samples_split=2, estimator__n_estimators=50;, score=0.428 total time=   1.0s
[CV 3/3] END estimator__max_depth=10, estimator__min_samples_split=2, estimator__n_estimators=50;, score=0.388 total time=   1.0s
[CV 1/3] END estimator__max_depth=10, estimator__min_samples_split=2, estimator__n_estimators=100;, score=0.398 total time=   2.4s
[CV 2/3] END estimator__max_depth=10, estimator__min_samples_split=2, estimator__n_estimators=100;, score=0.418 total time=   2.2s
[CV 3/3] END estimator__max_depth=10, estimator__min_samples_split=2, estimator__n_estimators=100;, score=0.423 total time=   3.3s
[CV 1/3] END estimator__max_depth=10, estimator__min_samples_split=2, estimator__n_estimators=150;, score=0.423 total time=   5.0s
[CV 2/3] END estimator__m