# Credit Scoring Dataset – First Look
In this session, you will:
- Explore a raw dataset for credit scoring.
- Identify and fix common data quality issues.
- Engineer new features useful for predicting credit risk.

The dataset mimics real-world bank data, with deliberate issues (missing values, outliers, duplicates).

> **Note for Students**  
> This dataset is **synthetic** and created for **academic purposes only**.  
> It is not real customer data, but it mimics some of the real issues (like missing values, inconsistent categories, and outliers) that we often face in actual credit scoring use cases.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For scaling and model later
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Display settings
pd.set_option('display.max_columns', None)

  from pandas.core import (


# Section 1: Quick EDA (Exploratory Data Analysis)
Before cleaning and feature engineering, let's understand the dataset.

Goals:
- Check dataset shape and column types
- Inspect distributions of numeric variables
- Explore categorical variables
- Identify issues: missing values, outliers, inconsistencies

### Step 1.1 – Dataset Overview
- Check shape and column types
- Preview first rows

In [2]:
# TODO: Read the CSV file into a DataFrame

In [None]:
# TODO: Preview the first 10 rows of the dataset
# Hint: use df.head(10)

In [None]:
# Solved example: dataset shape and info
print("Shape:", df.shape)
df.info()


### Step 1.2 – Summary Statistics
- Describe numeric variables
- Count categorical values

In [None]:
# TODO: Use df.info() to check data types and non-null counts

In [None]:
# TODO: Check value counts for 'employment_type'

### Step 1.3 – Missing Values
- Identify columns with NaN
- Check proportions of missing


*Explanation*: Missing values can bias results. Small amounts may be dropped, but important fields are often imputed with logical strategies.

In [6]:

# TODO: Find missing values in each column
# Hint: use df.isnull().sum()


### Step 1.4 – Distributions & Outliers
- Plot histograms for numeric columns
- Use boxplots to detect outliers


*Explanation*: Visualizing distributions helps detect skewness, outliers, or unrealistic values (e.g., negative income).

In [None]:
# Solved example: histogram of numeric features
df[['age','income','monthly_balance']].hist(bins=30, figsize=(10,6))
plt.show()


In [None]:
# TODO: Create a boxplot to check for outliers in monthly_balance

### Step 1.5 – Target Variable
- Check distribution of the target
- See if classes are balanced (default vs non-default)

*Explanation*: Checking class balance is critical. If imbalanced, advanced techniques like resampling or weighted models may be needed.

In [None]:

# TODO: Print raw counts of default vs non-default instead of percentages


In [None]:

# Solved example: target variable distribution
df['default'].value_counts(normalize=True).plot(kind='bar')
plt.title("Default vs Non-default Distribution")
plt.show()


# Section 2: Data Cleaning
Now we fix issues found in EDA.

Goals:
- Handle missing values, duplicates, and outliers.
- Standardize categories and check data consistency.


### Step 2.1 – Handle Missing Values
Choose a strategy:
- Drop rows
- Fill with mean/median/mode
- Use domain knowledge (e.g., income → "Unknown")


*Explanation*: Imputation choice depends on distribution. Median is safer with skewed data (like income), while mean works for symmetric data.

In [None]:

# TODO: Fill missing values for 'age' and 'employment_type'


In [None]:
# Solved example: Fill missing income with median
df['income'] = df['income'].fillna(df['income'].median())

### Step 2.2 – Remove Duplicates
- Check duplicate rows
- Drop if necessary

*Explanation*: Duplicates usually arise from repeated entries. Always confirm before dropping, since some repeats may be legitimate.

In [None]:

# TODO: Remove duplicate rows from the dataset


### Step 2.3 – Standardize Categories
- Fix inconsistent labels
- Ensure uniform formatting (e.g., lowercase)


*Explanation*: Standardizing categories avoids treating 'Self employed' and 'self-employed' as separate groups.

In [None]:

# TODO: Standardize categories in 'employment_type'
# Hint: lowercase and strip spaces


### Step 2.4 – Treat Outliers
- Cap extreme values
- Or replace with thresholds

*Explanation*: Outliers can heavily influence models. Options include capping, transformation (log), or removal.

In [None]:

# TODO: Experiment with removing rows instead of capping outliers


In [None]:

# Solved example: cap outliers at 99th percentile
cap = df['monthly_balance'].quantile(0.99)
df['monthly_balance'] = np.where(df['monthly_balance'] > cap, cap, df['monthly_balance'])


### Step 2.5 – Validate Logic
- Check impossible values (e.g., negative age)

In [None]:

# Solved example: fix negative ages
df['age'] = np.where(df['age'] < 0, abs(df['age']), df['age'])


# Section 3: Feature Engineering
We create new features that add business insight.

Goals:
- Encode categorical variables.
- Build useful ratios (e.g., debt-to-income, credit utilization).
- Group variables (e.g., age buckets).

### Step 3.1 – Encode Categorical Variables
- Convert categories into numeric form (e.g., one-hot encoding)
- Avoid implying order in non-ordinal categories

*Explanation*: Encoding turns categories into numeric form. One-hot encoding avoids implying order in non-ordinal categories.


In [None]:

# TODO: Encode other categorical variables like 'employment_type' and 'gender'


In [None]:
# Solved example: encode 'education'
df = pd.get_dummies(df, columns=['education'], drop_first=True)


### Step 3.2 – Create Ratios
- Debt-to-Income Ratio
- Credit Utilization (%)

*Explanation*: Ratios like debt-to-income capture relative financial health better than raw numbers.


In [None]:

# TODO: Create a new feature: debt minus balance (remaining debt)


In [None]:

# TODO: Create credit utilization (balance / credit_limit)


In [None]:
# Solved example: Debt-to-Income ratio
df['debt_to_income'] = df['total_debt'] / (df['income']+1)


### Step 3.3 – Group Continuous Variables
- Age buckets
- Income ranges

*Explanation*: Bucketing continuous variables can reveal non-linear relationships (e.g., young borrowers may behave differently).  
It also makes it easier to compare customers across categories (e.g., low-income vs high-income).


In [None]:
df['age_group'] = pd.cut(df['age'], bins=[18,30,50,100], labels=['Young','Mid','Senior'])

# TODO: Create alternative age groups with different splits

# TODO: Create income ranges


## Bonus Homework – Logistic Regression Model

For extra practice, try building a simple model to predict `default`:

Steps you might consider:
- Split the dataset into train/test sets.
- Choose relevant features (hint: numeric ones work directly, categories need encoding).
- Fit a Logistic Regression model.
- Evaluate using accuracy, precision, and recall.

   *Hint*: Look at scikit-learn's `LogisticRegression` and `train_test_split`.

This is optional and meant as bonus homework, not part of the live session.
