In [1]:
import pandas as pd

# Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction
**Instructions**
- Write a clear problem statement for predicting loan defaults.
- Identify and list the types of data you would need for this project (e.g., personal details of applicants, credit scores, loan amounts, repayment history).
- Discuss the sources where you can collect this data (e.g., financial institution’s internal records, credit bureaus).<br>
- 
**Expected Output**: A document detailing the problem statement and a comprehensive plan for data collection, including data types and sources.

### Problem Statement for Predicting Loan Defaults

The goal is to predict whether a loan applicant is likely to default on a loan. By building a machine learning model, financial institutions can reduce financial risk, improve decision-making, and automate loan approval processes.

### Required Data Types

To build this model, the following data would be needed:

**Applicant Information**
- Age
- Employment status
- Self-employed or not
- Education level
- Marital status

**Financial Data**
- Applicant income
- Co-applicant income
- Total loan amount
- Loan term
- Existing debts

**Credit-Related Data**
- Credit history / credit score
- Past repayment behavior
- Previous defaults
  
**Loan Details**
- Loan purpose
- Interest rate
- Collateral (if any)

### Data Sources
- **Internal bank records**: loan applications, repayment history, income documents
- **Credit bureaus** *(e.g., Experian, Equifax)*: credit score, past credit behavior
- **Government or regulatory agencies**: verified identity and employment data
- **Customer-provided documents**: pay slips, tax reports, bank statements

# Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction
**Instructions**
- From this dataset : Loan Predication, identify which features might be most relevant for predicting loan defaults.
- Justify your choice of features.

In [2]:
train = pd.read_csv('train_u6lujuX_CVtuZ9i (1).csv')

In [5]:
train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Based on the dataset, the following features are likely the most relevant for predicting loan defaults:

**1. ApplicantIncome**

- Higher income increases a borrower’s ability to repay the loan.
- Low or unstable income often correlates with a higher probability of default.

**2. CoapplicantIncome**

- If a co-applicant contributes to the household income, the total repayment capacity increases.
- Combined income is a strong financial stability indicator.

**3. LoanAmount**

- This represents the total size of the loan.
- Borrowers with high loan amounts relative to their income may be at higher risk of default because the repayment burden becomes heavier.

**4. Loan_Amount_Term**

- Shorter loan terms imply higher monthly payments.
- Higher monthly obligations may increase default risk if income is insufficient.

**5. Credit_History**

- This is one of the strongest predictors.
- Borrowers with a positive credit history have proven reliability in repaying past loans.
- A poor credit history is strongly associated with default risk.

**6. Self_Employed**

- Employment type can influence income stability.
- Self-employed individuals may have more variable income compared to salaried employees, which can increase the likelihood of repayment issues.

# Exercise 3 : Training, Evaluating, and Optimizing the Model
**Instructions**
- Which model(s) would you pick for a Loan Prediction ?
- Outline the steps to evaluate the model’s performance, mentioning specific metrics that would be relevant to evaluate the model.

### Models for a loan Prediction
For a loan prediction problem, the task is to classify whether an applicant is likely to default or successfully repay the loan. This is a binary classification problem, so several machine learning models are suitable.

**Recommended Models**:

- Logistic Regression<br>
This model is a strong baseline because it is simple, interpretable, and provides clear insights into how each feature affects the prediction. It works well when the relationship between features and the target is mostly linear.

- Decision Tree–based Models (e.g., Random Forest, Gradient Boosting)<br>
These models can capture more complex, non-linear relationships in the data.
    - Random Forest helps reduce overfitting and works well with mixed data.
    - Gradient Boosting models (e.g., XGBoost, LightGBM, CatBoost) often achieve the highest predictive performance on tabular datasets.

Using at least one linear model (Logistic Regression) and one tree-based model provides a strong comparison.

### Metrics that Would be Relevant to Evaluate the Model
Loan default prediction usually involves imbalanced classes — far fewer defaults than non-defaults.
Because of this, accuracy is not a reliable metric.

The most important goal for a bank is not to miss potential defaulters.
Therefore, the key metrics are:

**Recall (Sensitivity) — Primary Metric**

- Recall measures the proportion of actual defaulters the model correctly identifies.
- A high recall ensures that the model does not overlook risky applicants.

**F1-score — Balanced Evaluation Metric**

- F1 combines precision and recall into a single score and is especially useful for imbalanced datasets.
- It ensures the model does not simply classify everyone as a defaulter.

# Exercise 4 : Designing Machine Learning Solutions for Specific Problems
**Instructions**<br>
For each of these scenario, decide which type of machine learning would be most suitable. Explain.

- Predicting Stock Prices : predict future prices
- Organizing a Library of Books : group books into genres or categories based on similarities.
- Program a robot to navigate and find the shortest path in a maze.

### 1. Predicting Stock Prices

Type: **Supervised Learning**

Explanation:
- Stock price prediction involves forecasting a continuous numerical value based on historical data — such as past prices, volumes, indicators, or external factors.
- Because the training dataset includes known target values (previous stock prices), the model learns a mapping from features to outcomes.
- Therefore, supervised learning, specifically regression, is the most appropriate method.

### 2. Organizing a Library of Books

Type: **Unsupervised Learning**

Explanation:
- In this scenario, the goal is to group books based on similarities without pre-labeled categories.
- Unsupervised learning, especially clustering algorithms, can automatically detect natural groupings — such as genre, writing style, themes, or author characteristics.
- This allows the system to discover structure in the data without predefined labels.

### 3. A Robot Navigating a Maze and Finding the Shortest Path

Type: **Reinforcement Learning**

Explanation:
- A robot learning to navigate a maze must make sequential decisions and adjust its actions based on feedback from the environment.
- Reinforcement learning is ideal because the robot receives rewards or penalties depending on its actions (e.g., reaching the correct path or bumping into a wall).
- Over time, it learns an optimal strategy to reach the goal through trial and error, similar to how autonomous vacuum cleaners map a room.

### Exercise 5 : Designing an Evaluation Strategy for Different ML Models
**Instructions**
- Select three types of machine learning models: one from supervised learning (e.g., a classification model), one from unsupervised learning (e.g., a clustering model), and one from reinforcement learning. For the supervised model, outline a strategy to evaluate its performance, including the choice of metrics (like accuracy, precision, recall, F1-score) and methods (like cross-validation, ROC curves).
- For the unsupervised model, describe how you would assess the effectiveness of the model, considering techniques like silhouette score, elbow method, or cluster validation metrics.
- For the reinforcement learning model, discuss how you would measure its success, considering aspects like cumulative reward, convergence, and exploration vs. exploitation balance.
- Address the challenges and limitations of evaluating models in each category.

### 1. Supervised Learning Model
**Chosen model**: Logistic Regression (classification)<br>

**Metrics**:
- Accuracy – general performance
- Precision & Recall – important when false positives/negatives matter
- F1-score – balances precision and recall

**Methods**:
- Cross-validation to test model stability on different data splits
- ROC Curve & AUC to evaluate how well the model separates classes.

**Challenges**
- Imbalanced classes can make accuracy misleading
- Cross-validation may be computationally expensive
- Choosing the right threshold affects precision/recall trade-off

### 2. Unsupervised Learning Model
**Chosen model**: K-Means Clustering<br><br>
**Evaluation Strategy**
- Silhouette Score – shows how well clusters are separated
- Elbow Method – helps choose the optimal number of clusters
- Cluster Validation Metrics (e.g., Davies–Bouldin Index)

**Challenges**
- No ground truth labels → evaluation is subjective
- Results depend heavily on feature scaling
- Different cluster shapes may be poorly handled by K-Means

### 3. Reinforcement Learning Model
**Chosen model**: Q-Learning<br><br>
**Evaluation Strategy**
- Cumulative Reward – measures how much total reward the agent collects
- Convergence Rate – how quickly the policy becomes stable
- Exploration vs. Exploitation – ensuring the agent still explores enough to avoid suboptimal policies.

**Challenges**
- Performance depends heavily on reward design
- Long training times
- Balancing exploration and exploitation is difficult
- Harder to reproduce and evaluate compared to supervised models