## Dataset

This dataset was sourced from Kaggle 'Retail Credit Bank Data'. 

A retail credit bank would like us to build a credit default model for their credit card portfolio. Datset consists of 13,444 observations and 14 variables. The description for the variables are listed below:

1. *CARDHLDR* Dummy variable, 1 if application for credit card accepted, 0 if not
2. *DEFAULT* 1 if defaulted 0 if not (observed when CARDHLDR=1, 10,499 observations)
3. *AGE* Age in years plus twelfths of a year
4. *ACADMOS* months living at current address
5. ADEPCNT 1 + number of dependents
6. MAJORDRG Number of major derogatory reports
7. MINORDRG Number of minor derogatory reports
8. OWNRENT 1 if owns their home, 0 if rent
9. *INCOME* Monthly income (divided by 10,000)
10. SELFEMPL 1 if self employed, 0 if not
11. *INCPER* Income divided by number of dependents
12. EXP_INC Ratio of monthly credit card expenditure to yearly income
13. SPENDING Average monthly credit card expenditure (for CARDHOLDER = 1)
14. LOGSPEND Log of spending




## Tasks

1. Identify the accuracy of model at predicting a default

2. identify which factors most significantly increase or decrease likelihood of credit default

## Table of Contents
1. Exploratory Data Analysis
2. Data Transformation
3. Data Aggregation
4. Analysis
5. Insight Summary
6. Export

## 1. Exploratory Data Analysis

First and foremost, let's familiarize ourselves with the data. There are always questions you need to ask your data
+ Is the data set clean?
+ What are the data types?
+ What the actual data looks like?

In [1]:
#import necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

DATA_FILE = 'credit_data.csv'
df = pd.read_csv(DATA_FILE)
print(df.head())
print(df.info())
print(df.describe())

   CARDHLDR  DEFAULT        AGE  ACADMOS  ADEPCNT  MAJORDRG  MINORDRG  \
0         0        0  27.250000        4        0         0         0   
1         0        0  40.833332      111        3         0         0   
2         1        0  37.666668       54        3         0         0   
3         1        0  42.500000       60        3         0         0   
4         1        0  21.333334        8        0         0         0   

   OWNRENT       INCOME  SELFEMPL   INCPER   EXP_INC     SPENDING   LOGSPEND   
0        0  1200.000000         0  18000.0  0.000667                           
1        1  4000.000000         0  13500.0  0.000222                           
2        1  3666.666667         0  11300.0  0.033270  121.9896773  4.8039364   
3        1  2000.000000         0  17250.0  0.048427   96.8536213  4.5732008   
4        0  2916.666667         0  35000.0  0.016523   48.1916700  3.8751862   
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13444 entries, 0 to 13443
Data 

Now we know some information about the dataset's attributes, we can select our target variable and features.

In [2]:
# Target variable: Predicting whether an accepted cardholder defaults.
TARGET_COLUMN_NAME = 'DEFAULT'

# Features relevant to the default decision. (Using bolded columns from your list,
# plus some other strong indicators like ACADMOS and MAJORDRG).
# NOTE: SPENDING and LOGSPEND are only observed when CARDHLDR=1, so they are ideal features here.
FEATURES_TO_USE = [
    'AGE', 'ACADMOS', 'ADEPCNT', 'MAJORDRG', 'MINORDRG',
    'OWNRENT', 'INCOME', 'SELFEMPL', 'INCPER', 'EXP_INC',
    'SPENDING', 'LOGSPEND '
]

## 2. Data Transformation
Here we will
+ Filter the data based on the required Features and the Target column
+ Clean the data

In [3]:
# Crucial Step: Filter the data set to only include observations
# where the DEFAULT status is known (i.e., card application was accepted).
df_filtered = df[df['CARDHLDR'] == 1].copy()
print(f"   Original samples: {len(df)}")
print(f"   Filtered samples (CARDHLDR=1): {len(df_filtered)}\n")
# a) Select the required Features and Target
X_raw = df_filtered[FEATURES_TO_USE]
y = df_filtered[TARGET_COLUMN_NAME]

# b) Handle Missing Values (NaNs)
# Simple approach: Drop rows with ANY missing data in the selected features or target.
data_combined = pd.concat([X_raw, y], axis=1).dropna()
X = data_combined[FEATURES_TO_USE]
y = data_combined[TARGET_COLUMN_NAME]
print(f"   Final samples after dropping NaNs: {len(X)}")

   Original samples: 13444
   Filtered samples (CARDHLDR=1): 10499

   Final samples after dropping NaNs: 10499


## 3. Data Training and Testing

Split Data into Training and Testing Sets.

In [4]:
print(f"   Features (X) shape: {X.shape}, Target (y) shape: {y.shape}\n")


# Step 3: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print("Step 3: Splitting Data...")
print(f"   Training samples: {len(X_train)}")
print(f"   Testing samples: {len(X_test)}\n")


   Features (X) shape: (10499, 12), Target (y) shape: (10499,)

Step 3: Splitting Data...
   Training samples: 7349
   Testing samples: 3150



## 4. Train Logistic Regression Model

Were going to initialize and train the Logistic Regression Model

In [5]:
print(f"   Features (X) shape: {X.shape}, Target (y) shape: {y.shape}\n")


# Step 3: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print("Splitting Data...")
print(f"   Training samples: {len(X_train)}")
print(f"   Testing samples: {len(X_test)}\n")

#4.A Initialize and Train the Logistic Regression Model
model = LogisticRegression(random_state=42, solver='liblinear')
print("Step 4: Training Logistic Regression Model...")
model.fit(X_train, y_train)
print("   Model training complete.\n")

   Features (X) shape: (10499, 12), Target (y) shape: (10499,)

Splitting Data...
   Training samples: 7349
   Testing samples: 3150

Step 4: Training Logistic Regression Model...
   Model training complete.



### Checking for Model Accuracy

In [6]:
from sklearn.metrics import confusion_matrix


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Evaluating Model Performance...")
print(f"   Model Accuracy on Test Set: {accuracy * 100:.2f}%")
print("\n--- Confusion Matrix ---")
print("   [[True Negatives (Correctly predicted NOT to default), False Positives (Incorrectly predicted to default)]")
print("    [False Negatives (Incorrectly predicted NOT to default), True Positives (Correctly predicted to default)]]")
print(conf_matrix)

Evaluating Model Performance...
   Model Accuracy on Test Set: 90.73%

--- Confusion Matrix ---
   [[True Negatives (Correctly predicted NOT to default), False Positives (Incorrectly predicted to default)]
    [False Negatives (Incorrectly predicted NOT to default), True Positives (Correctly predicted to default)]]
[[2858    0]
 [ 292    0]]


## 5. Insight Summary
The model's accuracy at predicting is at 90.73%

## 6. Identify which attribute are the most important predictors

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)

### A. changing from y_pred to predict_proba, to get the probability score

In [8]:
probabilities = model.predict_proba(X_test)[:, 1] 

B. Set threshould at 20%, meaning predict default if probability is greater than 20%

In [9]:
custom_pred = (probabilities > 0.2).astype(int)

C. Evaluating model

In [10]:
custom_conf_matrix = confusion_matrix(y_test, custom_pred)
print("\nConfusion Matrix using 20% Probability Threshold:")
print(custom_conf_matrix)


Confusion Matrix using 20% Probability Threshold:
[[2800   58]
 [ 270   22]]


D. After model is trained, features with largest absolute coefficient values (positive or negative) are the most important predictors.

In [11]:
print("\n--- Model Coefficients (Feature Importance) ---")
coefficients = pd.Series(model.coef_[0], index=X.columns)
print(coefficients.sort_values(ascending=False))


--- Model Coefficients (Feature Importance) ---
MINORDRG     0.012643
MAJORDRG     0.007143
ADEPCNT      0.001732
SELFEMPL     0.000788
ACADMOS      0.000780
INCPER      -0.000018
EXP_INC     -0.000114
SPENDING    -0.000123
INCOME      -0.000422
OWNRENT     -0.003291
AGE         -0.020991
LOGSPEND    -0.040245
dtype: float64


# Conclusion
A large positive coefficient for MINORDRG means that feature strongly increases the likelihood of default. A large negative coefficient LOGSPEND means that feature strongly decreases the likelihood of default.