## Support Vector Machines (SVMs)

Support Vector Machines (SVMs) are a type of machine learning algorithm used for classification and regression analysis. SVMs work by finding the optimal hyperplane that separates the data into different classes. The hyperplane is chosen to maximize the margin between the closest data points from each class, also known as the support vectors. This margin is the distance between the hyperplane and the closest data points from each class. 

In [2]:
import pandas as pd

#for upsampling
from sklearn.utils import resample

#for svm model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# 

0    6599
1    6599
Name: Bankrupt?, dtype: int64

In [None]:
# Load the data from CSV file
df = pd.read_csv("bankruptcy.csv")
# df.head()

# Separate majority and minority classes
df_majority = df[df["Bankrupt?"] == 0]
df_minority = df[df["Bankrupt?"] == 1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(df_majority),    # to match majority class
                                 random_state=42)  # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled["Bankrupt?"].value_counts()

In [3]:
# select columns that are not numerical
non_numeric_cols = df_upsampled.select_dtypes(exclude=['int64', 'float64', 'complex128'])

# print the non-numerical columns
if not non_numeric_cols.empty:
    print(f"The non-numerical columns are: {', '.join(non_numeric_cols.columns)}")
else:
    print("All columns are numerical.")

All columns are numerical.


In [5]:



# Preprocess the data
# X = df.drop("Bankrupt?", axis=1) # input features
# y = df["Bankrupt?"] # target variable
X = df_upsampled.drop("Bankrupt?", axis=1)
y = df_upsampled["Bankrupt?"]
X = pd.get_dummies(X, drop_first=True) # encode categorical variables
X = StandardScaler().fit_transform(X) # scale numerical variables

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train an SVM model with RBF kernel
model = SVC(kernel='rbf')
model.fit(X_train, y_train)

# Evaluate the model on the testing set
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.9446    0.9140    0.9291      1268
           1     0.9229    0.9504    0.9364      1372

    accuracy                         0.9330      2640
   macro avg     0.9337    0.9322    0.9328      2640
weighted avg     0.9333    0.9330    0.9329      2640



Precision: The precision is the ratio of true positive predictions to the total number of positive predictions. It measures the accuracy of positive predictions. A high precision indicates that the model has a low false positive rate.

Recall: The recall is the ratio of true positive predictions to the total number of actual positive instances. It measures the completeness of positive predictions. A high recall indicates that the model has a low false negative rate.

F1-score: The F1-score is the harmonic mean of precision and recall. It combines both metrics into a single score that balances precision and recall.

Support: The support is the number of actual occurrences of each class in the test set.

Accuracy: The accuracy is the ratio of correct predictions to the total number of predictions.

Macro/micro averages: The macro average calculates the metric for each class and then takes the average. The micro average calculates the metric globally by counting the total true positives, false negatives, and false positives.