A random forest classifier is an ensemble learning method that combines multiple decision trees to make more accurate predictions. It works by training a number of decision trees on different samples of the training data and using the average of their predictions to make a final prediction.

Here's a summary of the steps involved in training and using a random forest classifier in Python:

- 1. Initialize the random forest classifier with the desired hyperparameters, such as the number of trees (n_estimators), the maximum depth of each tree (max_depth), and the minimum number of samples required to split a node (min_samples_split).

- 2. Split the training data into random subsets and train a decision tree on each subset.

- 3. For each tree, make predictions on the test data and store the predictions.

- 4. Calculate the average of the predictions made by the individual trees to make the final prediction for the random forest classifier.

- 5. Evaluate the performance of the random forest classifier using metrics such as accuracy, precision, recall, and f1-score.



In [4]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier

class RandomForestClassifier:
    def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='auto', random_state=0):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.random_state = random_state
        self.trees = []
    
    def fit(self, X, y):
        np.random.seed(self.random_state)
        
        # create decision trees
        for i in range(self.n_estimators):
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                min_samples_leaf=self.min_samples_leaf,
                max_features=self.max_features,
                random_state=np.random.randint(0, 1000)
            )
            tree.fit(X, y)
            self.trees.append(tree)
    
    def predict(self, X):
        predictions = []
        for tree in self.trees:
            predictions.append(tree.predict(X))
        return np.mean(predictions, axis=0)

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# load the breast cancer dataset from scikit-learn
data = load_breast_cancer()
X = data['data']
y = data['target']

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [5]:
# create a random forest classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)

# fit the classifier to the training data
clf.fit(X_train, y_train)

# make predictions on the test data
y_pred = clf.predict(X_test)

# calculate the accuracy of the predictions
accuracy = np.mean(y_pred == y_test)
print('Accuracy:', accuracy)

Accuracy: 0.46853146853146854


In [14]:
from sklearn.metrics import precision_score, recall_score, f1_score

split_value = 0.5
split = y_pred <= split_value
y_pred[split] = 0
y_pred[~split] = 1

# calculate precision, recall, and f1-score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('Precision:', precision)
print('Recall:', recall)
print('F1-score:', f1)

Precision: 0.9886363636363636
Recall: 0.9666666666666667
F1-score: 0.9775280898876404
