# Data Management & Analytics in Python

### WU Executive Academy: Data Science

### Post Module - João Reis

***

*In the Advanced track, you will do a mini end-to-end Data Science project on a topic of your choice using an existing publicly available data set. The notebook you hand in should be roughly structured as follows:*

### 1. Case Description

***Motivation***

*This project focuses on analyzing bank personal loan data to develop predictive models that can enhance the decision-making process in retail banking. In today's competitive banking environment, accurately assessing loan applications and determining appropriate loan amounts is crucial for both risk management and customer satisfaction. By leveraging machine learning techniques, banks can potentially automate and improve their loan approval processes while maintaining prudent risk management practices.*

***Research Questions***


1. *Loan Approval Prediction:*
   - Can we accurately predict whether a customer will accept a personal loan offer based on their demographic and financial characteristics?
   - Which customer attributes are the most significant predictors of loan acceptance?

2. *Maximum Loan Amount Estimation:*
   - What is the optimal maximum loan amount that can be safely offered to different customer segments?
   - How do factors such as income, education level, and existing financial obligations influence the recommended loan amount?

3. *Customer Segmentation:*
   - Are there distinct customer profiles that emerge from the data?
   - How do different customer segments vary in their loan acceptance patterns and creditworthiness?

4. *Risk Assessment:*
   - What combination of factors indicates higher or lower risk in personal loan customers?
   - How can we balance maximizing loan approvals while minimizing potential defaults?

*This analysis aims to provide insights that could help banks optimize their personal loan offerings while maintaining appropriate risk levels and improving customer targeting strategies.*


### 2. Dataset

The dataset was downloaded from Kaggle (https://www.kaggle.com/datasets/samira1992/bank-loan-intermediate-dataset) and contains information about bank customers, including their demographic and financial characteristics.

I chose to use this dataset because of its simplicity and the fact it is already cleaned. Given that I'm only starting my journey in Data Science, I thought it would be best to focus on the more complex problems and apply the techniques I've learned in class.

### 3. Related work

Deferences to literature and other notebooks available online that you have
looked into.

### 4. Step 1: Load Data

In [24]:
import numpy as np # linear algebra
import pandas as pd # data processing

# Import data
df=pd.read_csv("https://raw.githubusercontent.com/joao-reis25/dataSciencePython/refs/heads/main/csv/Bank_Personal_Loan_Modelling.csv")

### 5. Step 2: Prepare and explore the data

Before starting the analysis, it is important to understand the data. First, we can check the basic information about the dataset and then explore it further with visualizations.

In [None]:
# Explore the data
# Basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nDataset Info:")
df.info()

In [None]:
# Display first few rows and basic statistics
print("\nFirst 5 rows:")
display(df.head())
print("\nBasic Statistics:")
display(df.describe())

In [None]:
# Check for missing values
print("\nMissing Values:")
display(df.isnull().sum())

In [None]:
# Display unique values in each column
print("\nUnique Values per Column:")
for column in df.columns:
    print(f"\n{column}:")
    print(df[column].value_counts().head())

In [None]:
# Create visualizations
import matplotlib as mpl # visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the plotting style
plt.style.use('seaborn')
sns.set_palette("husl")

# Create a figure with multiple subplots
fig = plt.figure(figsize=(15, 10))

# Distribution of Personal Loan (target variable)
plt.subplot(2, 2, 1)
sns.countplot(data=df, x='Personal Loan')
plt.title('Distribution of Personal Loan')

# Income distribution
plt.subplot(2, 2, 2)
sns.histplot(data=df, x='Income', bins=30)
plt.title('Distribution of Income')

# Age distribution
plt.subplot(2, 2, 3)
sns.histplot(data=df, x='Age', bins=30)
plt.title('Distribution of Age')

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')

plt.tight_layout()
plt.show()

# Additional insights: Education level vs Personal Loan
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Education', hue='Personal Loan')
plt.title('Education Level vs Personal Loan')
plt.show()

# Income vs Personal Loan with Education
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Personal Loan', y='Income', hue='Education')
plt.title('Income Distribution by Loan Status and Education')
plt.show()

**6. Step 3: ML Modeling**

The first model is a classification model, trying to predict if a customer will take a personal loan based on their income, age, education, mortgage, and credit card average usage.

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Prepare the data
X = df.drop(['Personal Loan', 'ID', 'ZIP Code'], axis=1)  # Remove ID and ZIP Code as they're likely not relevant
y = df['Personal Loan']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train and evaluate multiple models
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC()
}

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"\n{name} Results:")
    print(classification_report(y_test, y_pred))

After predicting if a customer will take a personal loan, I want to predict the maximum loan amount a customer can afford based on their income, age, education, mortgage, and credit card average usage. 


In [None]:
# Create a synthetic maximum loan amount based on common banking rules
# This is a simplified example - actual banking rules are more complex
def calculate_theoretical_max_loan(row):
    # Common rules of thumb:
    # - Maximum monthly payment = 40% of monthly income
    # - Loan term = 3 years
    # - Consider existing obligations (mortgage, credit card)
    
    monthly_income = row['Income'] * 1000 / 12  # Assuming income is in thousands
    existing_obligations = (row['Mortgage'] / 12) + (row['CCAvg'] * 1000)
    available_monthly_payment = (monthly_income * 0.4) - existing_obligations
    
    # Simple present value calculation for 3-year loan at 10% annual interest
    r = 0.07 / 12  # monthly interest rate
    n = 36         # number of months
    max_loan = available_monthly_payment * ((1 - (1 + r)**(-n)) / r)
    
    return max(0, max_loan)  # Ensure non-negative

# Create target variable
df['Theoretical_Max_Loan'] = df.apply(calculate_theoretical_max_loan, axis=1)

# Prepare features for regression
features = ['Age', 'Experience', 'Income', 'CCAvg', 'Education', 'Mortgage']
X = df[features]
y = df['Theoretical_Max_Loan']

# Split and scale data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple regression models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100)
}

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"R2 Score: {r2_score(y_test, y_pred):.3f}")
    print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")

# Function to predict max loan for new customers
def predict_max_loan(model, customer_data):
    customer_scaled = scaler.transform(customer_data)
    predicted_amount = model.predict(customer_scaled)[0]
    return max(0, predicted_amount)

In [None]:
# Test different customer profiles
test_customers = pd.DataFrame({
    'Age':        [25,    35,    45,    55],
    'Experience': [3,     10,    20,    30],
    'Income':     [40,    80,    120,   200],
    'CCAvg':      [1,     2,     4,     6],
    'Education':  [1,     2,     3,     3],  # 1=Undergrad, 2=Graduate, 3=Advanced/Professional
    'Mortgage':   [0,     100000, 200000, 300000]
})

print("Predicted maximum loan amounts for different customer profiles:")
print("-" * 80)
for idx, customer in test_customers.iterrows():
    customer_df = pd.DataFrame([customer])
    max_loan = predict_max_loan(models['Random Forest'], customer_df)
    print(f"\nCustomer Profile {idx + 1}:")
    print(f"Age: {customer['Age']}, Experience: {customer['Experience']} years")
    print(f"Income: ${customer['Income']*1000:,.0f}, Education: {customer['Education']}")
    print(f"Credit Card Avg: ${customer['CCAvg']*1000:,.0f}, Mortgage: ${customer['Mortgage']:,.0f}")
    print(f"Predicted maximum loan amount: ${max_loan:,.2f}")