Random Forest Classification with Python and Scikit-Learn

In this project, a Random Forest Classifier is built to predict company bankruptcy. Two models are developed: one with default parameters and another with tuned hyperparameters. The implementation is done with Python and Scikit-Learn, utilizing a custom dataset (data.csv) prepared from financial metrics.

Table of Contents

Introduction to Random Forest Algorithm

Random Forest Algorithm Intuition

Ensemble Learning and Feature Importance

Problem Statement

Dataset Description

Import Libraries

Import Dataset

Exploratory Data Analysis

Data Preprocessing

Split Data into Separate Training and Test Set


Random Forest Classifier with Default Parameters

Confusion Matrix

Classification Report

Results and Conclusion


1. Introduction to Random Forest Algorithm

The Random Forest algorithm is a popular ensemble machine learning technique that extends decision trees for improved accuracy and robustness. It belongs to the supervised learning category and is widely used for classification tasks, such as predicting company bankruptcy.

2. Random Forest Algorithm Intuition

Random Forest operates by constructing multiple decision trees during training and combining their outputs through majority voting (for classification). Each tree is trained on a random subset of the data and features, reducing overfitting and enhancing generalization.

3. Ensemble Learning and Feature Importance

Random Forest leverages ensemble learning by aggregating predictions from multiple trees. It also provides feature importance scores, indicating which features (e.g., financial metrics) most influence the bankruptcy prediction.

4. Problem Statement
The goal is to predict whether a company will go bankrupt based on financial metrics. This project builds a Random Forest Classifier using Python and Scikit-Learn, utilizing a custom dataset (data.csv) derived from financial indicators.

5. Dataset Description

The dataset (data.csv) contains financial metrics such as ROA, Operating Profit Rate, Debt Ratios, and others, used to predict bankruptcy status (Bankrupt?). It was prepared from a collection of company financial data, with structural information removed to focus on key attributes. The target variable is binary (0 for not bankrupt, 1 for bankrupt).

6. Import Libraries

Import necessary Python libraries for data manipulation, visualization, and modeling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import streamlit as st

from sklearn.preprocessing import StandardScaler


6. Import Libraries

Import necessary Python libraries for data manipulation, visualization, and modeling.

In [None]:
# Load dataset
df = pd.read_csv('data.csv')
df

8. Exploratory Data Analysis

Perform initial data exploration, including summary statistics, missing value checks, and visualizations

In [None]:
# Exploratory Data Analysis (EDA)
eda_results = {}

# 1. Summary statistics (mean, median, mode, etc.)
eda_results['summary_stats'] = df.describe()
eda_results['mode'] = df.mode().iloc[0]

# 2. Data types and unique value counts
eda_results['data_types'] = df.dtypes
eda_results['unique_values'] = df.nunique()

# 3. Missing value analysis
eda_results['missing_values'] = df.isnull().sum()

# 4. Correlation analysis
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='coolwarm', annot=False)
plt.title('Correlation Matrix')
plt.savefig('correlation_matrix.png')
plt.close()

# 5. Distribution of target variable
plt.figure(figsize=(6, 4))
sns.countplot(x='Bankrupt?', data=df)
plt.title('Distribution of Bankruptcy Status')
plt.savefig('bankruptcy_distribution.png')
plt.close()

# 6. Percentage of bankrupt vs. non-bankrupt (pie chart)
plt.figure(figsize=(6, 6))
df['Bankrupt?'].value_counts().plot.pie(autopct='%1.1f%%', labels=['Not Bankrupt', 'Bankrupt'])
plt.title('Percentage of Bankruptcy Status')
plt.savefig('bankruptcy_pie.png')
plt.close()

# 7. Feature distribution (top 5 numerical features)
numerical_cols = df.select_dtypes(include=np.number).columns[:5]
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.savefig('feature_distributions.png')
plt.close()

# 8. Box plots for outlier detection (top 3 features)
plt.figure(figsize=(15, 5))
for i, col in enumerate(numerical_cols[:3], 1):
    plt.subplot(1, 3, i)
    sns.boxplot(y=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.savefig('boxplots.png')
plt.close()

# 9. Pairwise relationships (top 4 numerical features)
# Use only numerical columns for pairplot axes, with Bankrupt? as hue
sns.pairplot(df, vars=numerical_cols[:4], hue='Bankrupt?')
plt.savefig('pairplot.png')
plt.close()

# 10. Feature importance via correlation with target
correlations = df.corr()['Bankrupt?'].sort_values(ascending=False)
eda_results['feature_correlations'] = correlations

# 11. Distribution of top 3 correlated features with target
top_correlated = correlations.index[1:4]  # Exclude Bankrupt? itself
plt.figure(figsize=(15, 5))
for i, col in enumerate(top_correlated, 1):
    plt.subplot(1, 3, i)
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col} (High Correlation)')
plt.tight_layout()
plt.savefig('top_correlated_distributions.png')
plt.close()

# 12. Grouped aggregations by bankruptcy status
eda_results['group_by_bankruptcy'] = df.groupby('Bankrupt?')[numerical_cols].mean()

# 13. Skewness and kurtosis analysis
eda_results['skewness'] = df[numerical_cols].skew()
eda_results['kurtosis'] = df[numerical_cols].kurtosis()

# 14. Variance analysis
eda_results['variance'] = df[numerical_cols].var()

# 15. Violin plots for top 3 features by bankruptcy status
plt.figure(figsize=(15, 5))
for i, col in enumerate(numerical_cols[:3], 1):
    plt.subplot(1, 3, i)
    sns.violinplot(x='Bankrupt?', y=col, data=df)
    plt.title(f'Violin Plot of {col}')
plt.tight_layout()
plt.savefig('violin_plots.png')
plt.close()

# Print EDA results
print("Summary Statistics:\n", eda_results['summary_stats'])
print("\nMode:\n", eda_results['mode'])
print("\nData Types:\n", eda_results['data_types'])
print("\nUnique Values:\n", eda_results['unique_values'])
print("\nMissing Values:\n", eda_results['missing_values'])
print("\nFeature Correlations with Bankrupt?:\n", eda_results['feature_correlations'])
print("\nGroup by Bankruptcy Status:\n", eda_results['group_by_bankruptcy'])
print("\nSkewness:\n", eda_results['skewness'])
print("\nKurtosis:\n", eda_results['kurtosis'])
print("\nVariance:\n", eda_results['variance'])

9. Data Preprocessing and Split Data into Separate Training and Test Set

In [None]:
# Data Preprocessing
# Handle missing values (impute with median for numerical)
numerical_cols = df.select_dtypes(include=np.number).columns
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())

# Remove outliers using IQR method
Q1 = df[numerical_cols].quantile(0.25)
Q3 = df[numerical_cols].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df[numerical_cols] < (Q1 - 1.5 * IQR)) | (df[numerical_cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

# No categorical variables, so no encoding needed

# Split features and target
X = df.drop('Bankrupt?', axis=1)
y = df['Bankrupt?']

# Scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Save scaler for Streamlit app
with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

# Print preprocessing results
print("\nData preprocessing completed. Shapes:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}")

# Save preprocessed data for model training
with open('preprocessed_data.pkl', 'wb') as f:
    pickle.dump((X_train, X_test, y_train, y_test, X.columns), f)

10. Random Forest Classifier with Default Parameters

Train a Random Forest Classifier with default settings and evaluate its performance.

11. Confusion Matrix

Generate confusion matrices for both models to assess true positives, false positives, true negatives, and false negatives

12. Classification Report

Produce classification reports for both models, including precision, recall, and F1-score for bankruptcy prediction.

In [None]:
# Load preprocessed data
with open('preprocessed_data.pkl', 'rb') as f:
    X_train, X_test, y_train, y_test, feature_names = pickle.load(f)

# Convert y_train and y_test to numerical (from categorical)
y_train = y_train.astype(int)
y_test = y_test.astype(int)

# Train Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Save model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Print evaluation results
print(f"Accuracy: {accuracy:.4f}")
print(f"F1-Score: {f1:.4f}")
print("\nClassification Report:\n", report)

# Feature importance
feature_importance = pd.Series(model.feature_importances_, index=feature_names).sort_values(ascending=False)
print("\nTop 10 Feature Importances:\n", feature_importance.head(10))

Results and Conclusion

In [None]:


# User input for predictions
st.subheader("Make a Prediction")
input_data = {}
for feature in feature_names[:10]:  # Limit to top 10 features for simplicity
    input_data[feature] = st.number_input(f"{feature}", value=0.0)

if st.button("Predict"):
    # Prepare input data
    input_df = pd.DataFrame([input_data])
    input_scaled = scaler.transform(input_df)
    prediction = model.predict(input_scaled)
    prob = model.predict_proba(input_scaled)[0][1]
    
    # Display result
    result = "Bankrupt" if prediction[0] == 1 else "Not Bankrupt"
    st.write(f"Prediction: **{result}**")
    st.write(f"Probability of Bankruptcy: **{prob:.2%}**")

# Conclusion
st.header("Conclusion")
st.write("""
Key takeaways:
- The dataset is imbalanced, with fewer bankrupt companies, requiring careful model evaluation.
- Features like Net Income to Total Assets and Debt Ratios are highly correlated with bankruptcy.
- The Random Forest model provides reliable predictions and can be used for real-time decision-making.
- Outlier removal and scaling improve model performance.
- Future improvements could include handling class imbalance with SMOTE or testing other models.
""")