Credit Risks
Step 1: Download the dataset from the following link:  https://www.openml.org/d/31

Step 2: Read the dataset into a pandas dataframe.

In [2]:
import pandas as pd
from scipy.io import arff

data, meta = arff.loadarff("dataset_31_credit-g.arff")
df = pd.DataFrame(data)
#print(df.head(6))


Step 3: Feature Selection
Choose the features relevant to our analysis.
Numeric Attributes: duration, age, residence_since, credit_amount
Nominal Attributes: credit_history, employment, job

In [3]:
# Step 3: Feature Selection
numeric_features = ['duration', 'age', 'residence_since', 'credit_amount']
nominal_features = ['credit_history', 'employment', 'job']
selected_features = numeric_features + nominal_features

# Create a new DataFrame with selected features
df_selected = df[selected_features].copy()
print("Selected Features:")
print(df_selected.head())

Selected Features:
   duration   age  residence_since  credit_amount  \
0       6.0  67.0              4.0         1169.0   
1      48.0  22.0              2.0         5951.0   
2      12.0  49.0              3.0         2096.0   
3      42.0  45.0              4.0         7882.0   
4      24.0  53.0              4.0         4870.0   

                      credit_history employment                    job  
0  b'critical/other existing credit'     b'>=7'             b'skilled'  
1                   b'existing paid'  b'1<=X<4'             b'skilled'  
2  b'critical/other existing credit'  b'4<=X<7'  b'unskilled resident'  
3                   b'existing paid'  b'4<=X<7'             b'skilled'  
4              b'delayed previously'  b'1<=X<4'             b'skilled'  


Step 4: Preprocessing
Perform any needed pre-processing on the chosen features, including: Scaling, Encoding, Dealing with NaN values.

In [4]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

# Decode byte strings if necessary
for col in df_selected.columns:
    if df_selected[col].dtype == object:
        # Check if the first element is a byte string
        if isinstance(df_selected[col].iloc[0], bytes):
             df_selected[col] = df_selected[col].str.decode('utf-8')

# Handle NaN values
# Numeric: Impute with mean
imputer_num = SimpleImputer(strategy='mean')
df_selected[numeric_features] = imputer_num.fit_transform(df_selected[numeric_features])

# Nominal: Impute with most_frequent
imputer_cat = SimpleImputer(strategy='most_frequent')
df_selected[nominal_features] = imputer_cat.fit_transform(df_selected[nominal_features])

# Encoding Nominal Features
# Using sparse_output=False for compatibility with pandas concatenation
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_nominal = encoder.fit_transform(df_selected[nominal_features])
# Create DataFrame for encoded variables with proper column names
encoded_nominal_df = pd.DataFrame(encoded_nominal, columns=encoder.get_feature_names_out(nominal_features))

# Scaling Numeric Features
scaler = StandardScaler()
scaled_numeric = scaler.fit_transform(df_selected[numeric_features])
scaled_numeric_df = pd.DataFrame(scaled_numeric, columns=numeric_features)

# Combine all features
df_processed = pd.concat([scaled_numeric_df, encoded_nominal_df], axis=1)

# Process Target Variable
target = df['class']
if target.dtype == object and isinstance(target.iloc[0], bytes):
    target = target.str.decode('utf-8')

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(target)

print("Processed Data Shape:", df_processed.shape)
print("Target Shape:", y.shape)
#print(df_processed.head())

Processed Data Shape: (1000, 18)
Target Shape: (1000,)


Step 5: Splitting the Data
Split the data into 80% training, 10% validation, and 10% test sets.

In [5]:
from sklearn.model_selection import train_test_split

# Split: 80% Train, 20% Temp (Validation + Test)
X_train, X_temp, y_train, y_temp = train_test_split(df_processed, y, test_size=0.2, random_state=42)

# Split Temp: 50% Validation, 50% Test (which is 10% and 10% of total)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"Training set shape: {X_train.shape}, {y_train.shape}")
print(f"Validation set shape: {X_val.shape}, {y_val.shape}")
print(f"Test set shape: {X_test.shape}, {y_test.shape}")

Training set shape: (800, 18), (800,)
Validation set shape: (100, 18), (100,)
Test set shape: (100, 18), (100,)


Step 6: Training Classifiers
Use the KNN-classifier model to train your data. Choose the best k for the KNN algorithm by trying different values and validating performance on the validation set.
Classification Metrics: Print the accuracy score of your final classifier, print the confusion matrix.

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Choose the best k
best_k = 0
best_accuracy = 0
accuracies = []

for k in range(1, 21, 2):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    val_pred = knn.predict(X_val)
    accuracy = accuracy_score(y_val, val_pred)
    accuracies.append((k, accuracy))
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_k = k

print(f"Best k found: {best_k} with Validation Accuracy: {best_accuracy:.4f}")

# Train final classifier with best k
final_knn = KNeighborsClassifier(n_neighbors=best_k)
final_knn.fit(X_train, y_train)

# Evaluate on Test set
test_pred = final_knn.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)
conf_matrix = confusion_matrix(y_test, test_pred)

print(f"\nFinal Test Accuracy: {test_accuracy:.4f}")
print("Confusion Matrix:")
print(conf_matrix)

Best k found: 19 with Validation Accuracy: 0.7300

Final Test Accuracy: 0.7200
Confusion Matrix:
[[ 2 26]
 [ 2 70]]
