# Data Science for Business - Predicting Credit Card Default with KNN

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [41]:
np.random.seed(42)
pd.options.display.float_format = '{:.2f}'.format # Turn off scientific notation for large numbers

## Case description

### Credit card default

Credit card default occurs when a cardholder fails to pay the minimum due on their credit card bill for a certain period of time. This can lead to a number of negative consequences, including late fees, increased interest rates, and damage to the cardholder's credit score. In some cases, credit card default can even result in legal action being taken against the cardholder.

## Load Data
Importing the dataset from a CSV file.

In [21]:
df = pd.read_csv('https://raw.githubusercontent.com/olivermueller/ds4b-2024/96d117a1f864c0a2701580f784645b1e409fb7b0/Session_01/default.csv')

In [None]:
df.head()

## Summary Statistics
Generating summary statistics of the numerical variables `balance` and `income` and cross-tabulation for the `default` variable.

In [None]:
df.describe()

In [None]:
pd.crosstab(df['default'], columns='default')


## Visualizations
Use seaborn to visually explore the dataset. We create a scatterplot with `balance` and `income` on the x and y axes and `default` as hue.

In [None]:
sns.scatterplot(x='balance', y='income', hue='default', alpha=0.5, data=df)
plt.xlabel('Balance')
plt.ylabel('Income')
plt.show()

## Machine Learning - KNN
Training a K-Nearest Neighbors model to predict credit card default.

We first split the data into features and labels as well as training and testing sets.

In [26]:
X = df[['balance', 'income']]
y = df['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Scale the features using StandardScaler, as KNN is sensitive to the scale of the variables.

In [27]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Fit the KNN model to the training data using k=5.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

Use the fitted model to make predictions on the test data. More specifically, we calculate the probabilities of the positive class (default).

In [29]:
y_prob = knn.predict_proba(X_test)[: , 1]

In [None]:
y_prob

Let's plot the distribution of predicted probabilities for default.

In [None]:
sns.histplot(y_prob)
plt.xlabel('Probability of Default')
plt.ylabel('Frequency')
plt.show()

Transform probabilities into binary outcomes using a threshold of 0.5.

In [36]:
decision_threshold = 0.5
y_pred = np.where(y_prob > decision_threshold, 'Yes', 'No')

Plot a confusion matrix to evaluate the model's performance.

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Print a classification report with precision, recall, F1-score, and accuracy.
* The macro average F1 score is computed using the arithmetic mean of all the per-class F1 scores. Thta is, all classes are treated equally regardless of their support values.
* The weighted average F1 score is computed by averaging the F1 scores of each class, weighted by their support values.

In [None]:
print(classification_report(y_test, y_pred))

## Your Turn!

Experiment with the above code and:

1.  Change `n_neighbors` and observe how the accuracy of the classifier changes.

2.  Change the `decision threshold` and observe how the accuracy of the classifier changes.