Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction

Problem Statement:
The goal is to build a predictive model that estimates the probability of loan default.
When borrower is unable to repay loan within a specific agreed time (90+ days),bankroptcy.

- Types of Data Needed:

1) Applicant details – age, employment, income, dependents, residency, education.

2) Credit bureau data – credit scores, utilization, bankruptcies.

3) Loan details – amount, term, interest rate, collateral, repayment schedule.

4) Repayment history – on-time payments, late payments 

5) Banking/transactional data – inflows/outflows, income stability

- Sources of Data:

1) Internal records from loan origination systems, core banking systems, repayment logs, and customer databases.

2) Credit bureaus for external credit history.

3) Open Banking APIs or payroll verification services for income and cashflow data

Conclusion:
By combining internal records, bureau data, and external economic indicators, the project can build a robust dataset to train and evaluate a loan default prediction model.

Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction

In [3]:
import pandas as pd
df = pd.read_csv("train_dataset.csv")
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [11]:
df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

Features which will be most relevant in predicting loan default:

1) Credit_History -> strongest predictor (clean history)
2) ApplicantIncome + CoapplicantIncome -> TotalIncome.
3) Loan Amount & Loan Amount Term: convert to monthly instalments (EMI). (EMI = loan amount/loan amount term). From here we can calculate:
 -  DTI (Debt-to-Income): EMI/ Total Income (how much % does monthly payment occupies from a total income of the household). High DTI ->higher risk of default.
 -  LTI (Loan-to-Income Ratio): Loan Amount/ Total Income. (how large the loan is compared to the borrower’s income.).
 4) Dependents: (more dependents = more financial obligations)



Exercise 3 : Training, Evaluating, and Optimizing the Model
Since it's a supervised ml problem (we labeled Loan status (y/n)),we can use the folowwing models:

1) Logistic Regression -> simple, interpretable baseline.

2) Decision Trees -> can capture non-linear rules, easy to visualize.

Evaluate model's performance:

We split the data into train set (to learn) and test set (to check).
Then we look at these numbers:

1) Accuracy -> % correct overall.

2) Recall -> how many real defaulters we actually caught (very important for banks).

3) Precision -> of the ones we predicted as default, how many were correct.

4) F1-score -> balance of precision & recall.


Exercise 4 : Designing Machine Learning Solutions for Specific Problems

For each of these scenario, decide which type of machine learning would be most suitable. Explain.

1) Predicting Stock Prices : predict future prices -> supervised learning-> regression model, since we predict a continuous value.
2) Organizing a Library of Books : group books into genres or categories based on similarities  -> non-supervised learning(clustering) since we group items without labeling.
3) Program a robot to navigate and find the shortest path in a maze -> reinforcement learning -> since the robot learns actions thrugh rewards.

Exercise 5 : Designing an Evaluation Strategy for Different ML Models

1. Supervised learning: Classification model

Example: Predict if an email is spam or not spam.

- Data is already labeled.
- Model is trained with “questions + answers,” then test it on new data.

Evaluation strategy:

1) Accuracy -> how many emails model correctly identifies as spam? (out of 100 emails).
2) Precision -> how many wee truly spam? (from emails marked "spam").
3) Recall -> Of all the spam emails, how many did it catch?
4) F1-score -> A mix of precision and recall.
5) Cross-validation -> Train and test on different slices of the data to be more reliable.

Challenge: Sometimes accuracy is misleading.
Example: If 95% of emails are not spam, a dumb model that always says “not spam” will get 95% accuracy.


2. Unsupervised learning: Clustering model
Example: Doctors group patients by symptoms or genetic data without knowing desease.

Clusters might be:
- Patients with similar disease patterns.
- Similar types of cells in research.

Evaluation strategy:

1) Silhouette score -> Tells how well each patient fits into their cluster.

2) Elbow method -> Helps choose how many patient groups make sense.

3) Domain knowledge -> Doctors check if the clusters are medically meaningful (e.g., one cluster = patients with heart disease symptoms, another cluster = patients with lung disease symptoms).

Challenge: There is no single “correct” clustering — two doctors (or two algorithms) might group patients differently, and both could be useful.

3. Reinforcement learning
Example: Treatment recommendation system

- Agent = AI recommending treatments.
- Actions = suggest treatment or dosage.
- Reward = patient health improves.
- Penalty = health worsens or side effects.

Evaluation Strategy:

1) Cumulative Reward -> Over many patients, is health improving overall?
2) Convergence -> Does the system settle on a good treatment strategy instead of constantly changing?
3) Exploration vs. Exploitation -> Is it testing new treatments (exploration) while also using proven ones (exploitation)?

Challenge: In medicine, testing directly on real patients is risky. Often, reifircement learning is tested first on simulations or with strong safety controls. Designing the right reward function is also very hard (how do you measure “improved health” correctly?).
