<div style="background-color: black; color: white; padding: 10px;">
<h1>Exploring Random Forest Performance with ROC Curve Visualization and Parameter Tuning</h1>
</div>


In [None]:
pip install numpy pandas matplotlib scikit-learn ipywidgets


In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_blobs
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
from ipywidgets import interact, FloatSlider, IntSlider, Dropdown, Output
from sklearn.cluster import KMeans


This Python code helps you understand how a machine learning model called Random Forest works on data that's like a fake version of real-world data. We split the data into two parts: one part to teach the model (training) and the other to see how well it learned (testing).

The main part of the code makes a picture called an ROC curve. It shows how good the model is at telling things apart. For example, in a medical test, it could show how good the test is at finding sick people without wrongly saying healthy people are sick.

The code also lets you play with two things:

Class Weight: It helps the model handle when there are more examples of one thing than another. For instance, if there are way more healthy people than sick ones in the data, the model might think everyone is healthy. So, we use class weight to help it understand that being sick is important too.
Number of Trees (n_estimators): This is how many mini-models the Random Forest uses. More trees can make the model better but can also make it slower.
The sliders you see let you change these things, and the plot updates to show you how these changes affect the ROC curve. It's like a playground where you can try different settings and see how they influence how well the model works.

In [25]:

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def visualize_model(class_weight=0.1, n_estimators=100):
    # Create and train the model
    model = RandomForestClassifier(n_estimators=n_estimators, class_weight={0: 1-class_weight, 1: class_weight}, random_state=42)
    model.fit(X_train, y_train)

    # Predict probabilities
    y_probs = model.predict_proba(X_test)[:, 1]

    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_probs)
    roc_auc = auc(fpr, tpr)

    # Plot ROC curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', fontsize=14)
    plt.ylabel('True Positive Rate', fontsize=14)
    plt.title('Receiver Operating Characteristic', fontsize=16)
    plt.legend(loc='lower right', fontsize=12)
    plt.grid(True)
    plt.tick_params(axis='both', which='major', labelsize=12)
    plt.show()

# Create interactive sliders for parameters
class_weight_slider = FloatSlider(min=0.01, max=0.5, step=0.01, value=0.1, description='Class Weight')
n_estimators_slider = IntSlider(min=10, max=200, step=10, value=100, description='n_estimators')

# Display the interactive dashboard
interact(visualize_model, class_weight=class_weight_slider, n_estimators=n_estimators_slider);


interactive(children=(FloatSlider(value=0.1, description='Class Weight', max=0.5, min=0.01, step=0.01), IntSli…

In [26]:

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def visualize_model(**kwargs):
    # Create a Random Forest classifier
    rf = RandomForestClassifier(random_state=42, **kwargs)

    # Perform grid search
    grid_search = GridSearchCV(estimator=rf, param_grid={}, cv=5, scoring='roc_auc')
    grid_search.fit(X_train, y_train)

    # Get the best model and its parameters
    best_model = grid_search.best_estimator_
    best_params = best_model.get_params()

    # Evaluate the best model
    y_probs = best_model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_test, y_probs)
    roc_auc = auc(fpr, tpr)

    # Plot the ROC curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', fontsize=14)
    plt.ylabel('True Positive Rate', fontsize=14)
    plt.title('Receiver Operating Characteristic', fontsize=16)
    plt.legend(loc='lower right', fontsize=12)
    plt.grid(True)
    plt.tick_params(axis='both', which='major', labelsize=12)
    plt.show()

# Create interactive dropdowns for hyperparameters
n_estimators_dropdown = Dropdown(options=[10, 50, 100, 200], value=100, description='n_estimators')
max_depth_dropdown = Dropdown(options=[None, 5, 10, 20], value=None, description='max_depth')
min_samples_split_dropdown = Dropdown(options=[2, 5, 10], value=2, description='min_samples_split')
min_samples_leaf_dropdown = Dropdown(options=[1, 2, 4], value=1, description='min_samples_leaf')
class_weight_dropdown = Dropdown(options=['balanced', None], value=None, description='class_weight')

# Display the interactive dashboard
interact(visualize_model,
         n_estimators=n_estimators_dropdown,
         max_depth=max_depth_dropdown,
         min_samples_split=min_samples_split_dropdown,
         min_samples_leaf=min_samples_leaf_dropdown,
         class_weight=class_weight_dropdown);


interactive(children=(Dropdown(description='n_estimators', index=2, options=(10, 50, 100, 200), value=100), Dr…

This Below code generates synthetic data with two features using the make_blobs function from scikit-learn. It then visualizes the data points and cluster centroids using KMeans clustering. The number of clusters can be interactively adjusted using a dropdown menu, allowing for exploration of different clustering configurations.

In [27]:

# Generate synthetic data with 2 features
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

def visualize_clusters(n_clusters=4):
    # Create a KMeans clustering model
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(X)

    # Plot data points and cluster centroids
    plt.figure(figsize=(10, 7))
    plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, cmap='viridis')
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', c='red', s=200, label='Centroids')
    plt.title(f'KMeans Clustering with {n_clusters} Clusters')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.grid(True)
    plt.show()

# Create interactive dropdown for number of clusters
n_clusters_dropdown = Dropdown(options=[2, 3, 4, 5, 6], value=4, description='n_clusters')

# Display the interactive dashboard
interact(visualize_clusters, n_clusters=n_clusters_dropdown);


interactive(children=(Dropdown(description='n_clusters', index=2, options=(2, 3, 4, 5, 6), value=4), Output())…