SECTION 3: Train Model 1: Stacking Method

employs a Random Forest classifier to predict group membership (represented by cluster IDs) in a dataset. The process involves selecting features, splitting the data into training and testing sets, training the classifier, and then evaluating its performance using accuracy and a detailed classification report. The aim is to assess how effectively the model can categorize data points into predefined clusters based on their features.








In [20]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

In [21]:
X = data[fifty_features]
y_cluster = data['Cluster_ID']

# Splitting data for training group model
X_train_cluster, X_test_cluster, y_train_cluster, y_test_cluster = train_test_split(X, y_cluster, test_size=0.2, random_state=42)

In [22]:
# Model to predict the group
group_model = RandomForestClassifier(n_estimators=100, random_state=42)
group_model.fit(X_train_cluster, y_train_cluster)

In [23]:
# Evaluate the model
y_pred_cluster = group_model.predict(X_test_cluster)
print("Accuracy of group prediction model:", accuracy_score(y_test_cluster, y_pred_cluster))
print(classification_report(y_test_cluster, y_pred_cluster))

Accuracy of group prediction model: 0.9578313253012049
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       829
           1       0.95      0.90      0.92       332
           2       1.00      1.00      1.00         1

    accuracy                           0.96      1162
   macro avg       0.97      0.96      0.97      1162
weighted avg       0.96      0.96      0.96      1162



Cluster 0: High precision (96%) and recall (98%), resulting in an F1-score of 97%.
Cluster 1: Good precision (95%) and recall (90%), with an F1-score of 92%.
Cluster 2: Perfect precision, recall, and F1-score of 100%

In [24]:
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Preparing data for bankruptcy prediction
y_bankruptcy = data['Bankrupt?']

# We should ideally use train_test_split here to ensure we're not leaking data across models
X_train, X_test, y_train, y_test = train_test_split(X, y_bankruptcy, test_size=0.2, random_state=42)


In [25]:
# Define base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('dt', DecisionTreeClassifier(random_state=42))
]

Base Model Configuration: A list of base models for the stacking ensemble is defined, which includes:

1. A Random Forest Classifier with 100 trees.
2. A Gradient Boosting Classifier, also with 100 estimators.
3. A Decision Tree Classifier, all initialized with a fixed random state for reproducibility.

This setup prepares for the application of a Stacking Classifier, which will combine these base models' predictions to make a final prediction on the bankruptcy status, aiming to leverage the strengths of each individual model for improved overall performance.








In [26]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

# Dictionary to store accuracy scores
accuracy_scores = {}



# Perform 3-fold cross-validation for each model and calculate the average accuracy
total_average = 0  # to store the sum of average accuracies
for name, model in base_models:
    scores = cross_val_score(model, X, y_bankruptcy, cv=3, scoring='accuracy')
    average_accuracy = scores.mean()
    accuracy_scores[name] = average_accuracy
    total_average += average_accuracy
    print(f"Average accuracy for {name}: {average_accuracy:.4f}")

# Calculate total average accuracy for all base models
total_average /= len(base_models)
print(f"Total average accuracy for all base models: {total_average:.4f}")

Average accuracy for rf: 0.9681
Average accuracy for gb: 0.7592
Average accuracy for dt: 0.9277
Total average accuracy for all base models: 0.8850


In [27]:
# Meta-model
meta_model = LogisticRegression()

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Prepare the base model predictions as new features
X_train_transformed = np.column_stack([
    model.fit(X_train, y_train).predict(X_train) for _, model in base_models
])
X_test_transformed = np.column_stack([
    model.predict(X_test) for _, model in base_models
])

# Train and evaluate the Logistic Regression meta-model on these new features
meta_model = LogisticRegression()
meta_model.fit(X_train_transformed, y_train)
y_pred_meta = meta_model.predict(X_test_transformed)

# Calculate the accuracy of the meta-model
print("Accuracy of Logistic Regression meta-model:", accuracy_score(y_test, y_pred_meta))


Accuracy of Logistic Regression meta-model: 0.9664371772805508


A Logistic Regression model is chosen as the meta-model. This model will use the output from the base models as input features to make the final prediction.

In [29]:
# Stacking classifier
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)
stacking_model.fit(X_train, y_train)

In [30]:
# Evaluate the stacking model
y_pred = stacking_model.predict(X_test)
print("Accuracy of stacking model:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy of stacking model: 0.96815834767642
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1124
           1       0.53      0.21      0.30        38

    accuracy                           0.97      1162
   macro avg       0.75      0.60      0.64      1162
weighted avg       0.96      0.97      0.96      1162



**0.9681** is the **accuarcy of the STACKING MODEL**

In [37]:
# Save the trained model to a file
model_filename = 'model-3.joblib'
dump(pipeline, model_filename)

print("Model saved successfully to", model_filename)

Model saved successfully to model-3.joblib


SECTION 4: Train Model 2: k-fold Cross Validation

In [46]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as IMBPipeline

In [47]:
selected_features_s4 = data.columns[1:41]  # using 40 featurs
y = data['Bankrupt?']  # Ensure this is the correct target variable

In [48]:
# Setup pipeline with SMOTE and scaling
smote = SMOTE()
scaler = StandardScaler()
model = LogisticRegression(max_iter=1000, random_state=42)  # Increased max_iter

pipeline = IMBPipeline(steps=[
    ('smote', smote),
    ('scaler', scaler),  # Adding a scaler
    ('classifier', model)
])

SMOTE (Synthetic Minority Over-sampling Technique): This component is used to address class imbalance by generating synthetic samples from the minority class. SMOTE helps in providing a more balanced dataset, which can improve model performance, especially in cases where the minority class is significantly underrepresented.

In [53]:
# k-Fold Cross-Validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
cv_results = cross_val_score(pipeline, data[selected_features], y, cv=kfold, scoring=make_scorer(accuracy_score))


In [54]:
# Output the results
print("Mean CV accuracy:", cv_results.mean())

Mean CV accuracy: 0.879457097401072


**Mean CV accuracy: 0.879457097401072:** The output of the mean cross-validation accuracy is  high,  suggesting that the model is performing extremely well in predicting whether a company will go bankrupt based on the given features.

AS WE CAN SEE, THE MODEL TRAINED IN SECTION 3 IS GIVING THE HIGHEST ACCURACY AS COMPARED TO SECTION 4, HENCE WE WILL BE USING THE MODEL FROM SECTION 3 ON THE GIVEN TEST DATA


In [55]:
from joblib import dump

# Save the trained model to a file
model_filename = 'model-4.joblib'
dump(pipeline, model_filename)

print("Model saved successfully to", model_filename)


Model saved successfully to model-4.joblib
