# Credit Scoring Dataset – First Look
In this session, you will:
- Explore a raw dataset for credit scoring.
- Identify and fix common data quality issues.
- Engineer new features useful for predicting credit risk.

The dataset mimics real-world bank data, with deliberate issues (missing values, outliers, duplicates).

> **Note for Students**  
> This dataset is **synthetic** and created for **academic purposes only**.  
> It is not real customer data, but it mimics some of the real issues (like missing values, inconsistent categories, and outliers) that we often face in actual credit scoring use cases.


In [14]:
import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


# Section 1: Quick EDA (Exploratory Data Analysis)
Before cleaning and feature engineering, let's understand the dataset.

Goals:
- Check dataset shape and column types
- Inspect distributions of numeric variables
- Explore categorical variables
- Identify issues: missing values, outliers, inconsistencies

### Step 1.1 – Dataset Overview
- Check shape and column types
- Preview first rows

In [15]:
# TODO: Read the CSV file into a DataFrame

In [None]:
# TODO: Preview the first 10 rows of the dataset

### Step 1.2 – Summary Statistics
- Describe numeric variables
- Count categorical values

In [None]:
# TODO: check data types and non-null counts

In [None]:
# TODO: Check value counts for 'employment_type'

### Step 1.3 – Missing Values
- Identify columns with NaN
- Check proportions of missing


*Explanation*: Missing values can bias results. Small amounts may be dropped, but important fields are often imputed with logical strategies.

In [None]:
# TODO: Find missing values in each column

### Step 1.4 – Distributions & Outliers
- Plot histograms for numeric columns
- Use boxplots to detect outliers


*Explanation*: Visualizing distributions helps detect skewness, outliers, or unrealistic values (e.g., negative income).

In [None]:
fig = px.histogram(df, x="age", nbins=30, title="Age Distribution")
fig.show()

In [None]:
# TODO: Create a histogram to visualize income distribution

In [None]:

fig = px.histogram(df, x="monthly_balance", nbins=30, title="Monthly Balance Distribution")
fig.show()


In [None]:
# TODO: Create a boxplot to check for outliers in monthly_balance

**Interpretation:**

- Income shows wide variation, with some higher earners but many around lower ranges.
- Monthly balance has outliers — some customers go very negative or very high.

### Step 1.5 – Target Variable
- Check distribution of the target
- See if classes are balanced (default vs non-default)

*Explanation*: Checking class balance is critical. If imbalanced, advanced techniques like resampling or weighted models may be needed.

In [None]:

# TODO: Print raw counts of default vs non-default instead of percentages


In [None]:

# Distribution of default vs non-default
default_dist = df['default'].value_counts(normalize=True).reset_index()
default_dist.columns = ['default', 'proportion']

fig = px.bar(
    default_dist,
    x='default',
    y='proportion',
    text='proportion',
    title="Default vs Non-default Distribution",
    labels={'default': 'Class', 'proportion': 'Proportion'}
)
fig.update_traces(texttemplate='%{text:.2%}', textposition='outside')
fig.update_yaxes(tickformat=".0%")
fig.show()



### Class Imbalance Note

In this dataset, about 20% of customers are **defaults** while 80% are **non-defaults**. This reflects a realistic imbalance: most customers repay their loans. However, this imbalance means accuracy alone is misleading — we need to also look at precision, recall, and F1.

# Section 2: Data Cleaning
Now we fix issues found in EDA.

Goals:
- Handle missing values, duplicates, and outliers.
- Standardize categories and check data consistency.

### Step 2.1 – Handle Missing Values
Choose a strategy:
- Drop rows
- Fill with mean/median/mode
- Use domain knowledge (e.g., employment type → "Unknown")


*Explanation*: Imputation choice depends on distribution. Median is safer with skewed data (like income), while mean works for symmetric data.

In [None]:
# TODO: Fill missing values for 'age' and 'employment_type'

In [None]:
df['income'] = df['income'].fillna(df['income'].median())

### Step 2.2 – Remove Duplicates
- Check duplicate rows
- Drop if necessary

*Explanation*: Duplicates usually arise from repeated entries. Always confirm before dropping, since some repeats may be legitimate.

In [None]:

# TODO: Remove duplicate rows from the dataset


### Step 2.3 – Standardize Categories
- Fix inconsistent labels
- Ensure uniform formatting (e.g., lowercase)


*Explanation*: Standardizing categories avoids treating 'Self employed' and 'self-employed' as separate groups.

In [None]:

# TODO: Standardize categories in 'employment_type'
# Hint: lowercase and strip spaces


### Step 2.4 – Treat Outliers
- Cap extreme values
- Or replace with thresholds

*Explanation*: Outliers can heavily influence models. Options include capping, transformation (log), or removal.

In [None]:
# Solved example: cap outliers at 99th percentile
cap = df['monthly_balance'].quantile(0.99)
df['monthly_balance'] = np.where(df['monthly_balance'] > cap, cap, df['monthly_balance'])


### Step 2.5 – Validate Logic
- Check impossible values (e.g., negative age)


In [None]:
df['age'] = np.where(df['age'] < 0, abs(df['age']), df['age'])

# Section 3: Feature Engineering
We create new features that add business insight.

Goals:
- Encode categorical variables.
- Build useful ratios (e.g., debt-to-income, credit utilization).
- Group variables (e.g., age buckets).

### Step 3.1 – Encode Categorical Variables
- Convert categories into numeric form (e.g., one-hot encoding)
- Avoid implying order in non-ordinal categories

*Explanation*: Encoding turns categories into numeric form. One-hot encoding avoids implying order in non-ordinal categories.


In [None]:
# Solved example: encode 'education'
df = pd.get_dummies(df, columns=['education'], drop_first=True)


In [None]:

# TODO: Encode other categorical variables like 'employment_type' and 'gender'


### Step 3.2 – Create Ratios
- Debt-to-Income Ratio
- Credit Utilization (%)

*Explanation*: Ratios like debt-to-income capture relative financial health better than raw numbers.


In [None]:
# Solved example: Debt-to-Income ratio
df['debt_to_income'] = df['total_debt'] / (df['income']+1)


In [None]:
# TODO: Create credit utilization (balance / credit_limit)


### Step 3.3 – Group Continuous Variables
- Age buckets
- Income ranges

*Explanation*: Bucketing continuous variables can reveal non-linear relationships (e.g., young borrowers may behave differently).  
It also makes it easier to compare customers across categories (e.g., low-income vs high-income).


In [None]:
df['age_group'] = pd.cut(df['age'], bins=[18,30,50,100], labels=['Young','Mid','Senior'])

# TODO: Create alternative age groups with different splits

# TODO: Create income ranges


## 🔍 Feature Insights After Engineering

Now that we created **debt-to-income ratio, credit utilization, age groups, and income groups**, we can visualize how these engineered features relate to default risk.

The goal is to see whether these transformations reveal clearer patterns compared to the raw features.

In [None]:
# Solved example: Debt-to-Income vs Default
fig1 = px.box(
    df,
    x='default',
    y='debt_to_income',
    title='Debt-to-Income Ratio vs Default',
)
fig1.show()

# TODO: Plot Credit Utilization vs Default
# Hint: Use px.box with 'default' on x-axis and 'credit_utilization' on y-axis

# Solved example: Default rate by Income Group
income_group_default = df.groupby('income_group')['default'].mean().reset_index()
fig3 = px.bar(
    income_group_default,
    x='income_group',
    y='default',
    title='Default Rate by Income Group'
)
fig3.show()

# TODO: Plot Default rate by Age Group
# Hint: Group by 'age_group' and calculate mean default rate, then use px.bar


**Interpretation:**

- Customers with high **debt-to-income ratios** are clearly riskier.
- **Credit utilization** close to 1.0 (maxing out their credit limit) signals much higher default probability.

**Interpretation:**

- Lower income groups show a much higher proportion of defaults.

**Observations:**
- Customers with **higher debt-to-income ratios** are more likely to default.
- **Credit utilization above 0.8** signals higher default risk.
- Lower income groups have visibly higher default rates.

## Section 4: Simple Modelling

Now that the dataset is cleaned and features engineered, we can try a simple model.
- Split data into train/test sets
- Fit a Logistic Regression model
- Evaluate using accuracy, precision, recall, and F1-score

⚠️ Note: The goal is not to achieve the best model here, but to connect the data preparation steps with prediction.

In [None]:

# Define target and features
y = df['default']
X = df.drop(columns=['default','customer_id'])

# One-hot encode categorical variables if any remain
X = pd.get_dummies(X, drop_first=True)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Default','Default'], yticklabels=['Non-Default','Default'])
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()