<a href="https://colab.research.google.com/github/moaadouhmad99-ops/DI_Bootcamp/blob/main/Week6/Day1/ExercisesXP_W6_D1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises XP: Student Notebook

For each exercise, the **Instructions** from the plateform are guided, and the **Guidance** explains exactly what you must do to complete the task.

## What you will learn
- How to clearly define and articulate a machine learning problem statement.

- The process of data collection, including identifying relevant data types and potential data sources.
Skills in feature selection and justification for machine learning models, particularly in the context of loan default prediction.

- Understanding of different types of machine learning models and their suitability for various real-world scenarios.

- Techniques and strategies for evaluating the performance of different machine learning models, including choosing appropriate metrics and understanding their implications.

## What you will create
- A detailed problem statement and data collection plan for a loan default prediction project, including identification of key data types and sources.
- A comprehensive feature selection analysis for a hypothetical loan default prediction dataset.
- A theoretical evaluation strategy for three different types of machine learning models, addressing the unique challenges and metrics relevant to each model type.
- Thoughtful analyses and justifications for choosing specific machine learning approaches for varied scenarios such as stock price prediction, library organization, and robot navigation.
- A document or presentation that showcases your understanding and approach to evaluating and optimizing machine learning models in diverse contexts.

## ðŸŒŸ Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction

### Instructions
- Write a clear problem statement for predicting loan defaults.
- Identify and list the types of data you would need for this project (e.g., personal details of applicants, credit scores, loan amounts, repayment history).
- Discuss the sources where you can collect this data (e.g., financial institutionâ€™s internal records, credit bureaus).

**Expected Output:** A document detailing the problem statement and a comprehensive plan for data collection, including data types and sources.

### Guidance
- Please write your answer as a short document. Begin by stating the prediction objective in a complete sentence that names the target variable and the decision it will support. Then, describe the data types you would collect in complete sentences. For each data type, explain in one sentence why it could help predict loan defaults.

- After that, name realistic data sources in complete sentences, and briefly describe how you would obtain or integrate each source.

- Finally, include one paragraph that explains risks and constraints such as privacy, regulation, data quality, sampling bias, and governance.

### Your answer

The goal is to predict whether a loan applicant will default (yes/no) to support loan approval and risk decisions.

I would collect demographic, financial, credit history, loan, and banking behavior data because they directly affect repayment ability and risk.

Data would be collected from internal bank systems, credit bureaus via secure APIs, and government financial data portals.

Key risks include privacy compliance, data quality issues, sampling bias, and the need for strong data governance.

## ðŸŒŸ Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction

### Instructions
From this dataset, identify which features might be most relevant for predicting loan defaults.
Justify your choice of features.

### Guidance
- First, identify the features that you believe are most relevant, and write their names in a sentence.
Then, provide a justification in complete sentences that explains how each selected feature relates to the likelihood of default.

- If you decide to exclude common features, write one sentence for each excluded feature to explain why it is not appropriate in this context.

- Conclude with two complete sentences that explain how you would encode categorical features and how you would impute missing values.

In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It provides a simple template you can modify while writing your justification.

import pandas as pd

# This placeholder DataFrame allows the cell to run even if you did not load a dataset yet.
example_columns = [
    "age","employment_length","annual_income","credit_score","loan_amount","interest_rate",
    "debt_to_income","num_delinquencies","num_open_accounts","total_utilization","home_ownership",
    "purpose","term","application_type","state","zip_code"
]
df = pd.DataFrame(columns=example_columns)

# Please replace this list with the actual columns that you select.
selected_features = [
    # e.g., "credit_score","debt_to_income","annual_income","loan_amount","interest_rate",
    # "employment_length","num_delinquencies","total_utilization"
]

print("You will now justify the selected features in complete sentences below.")

You will now justify the selected features in complete sentences below.


### Your justification

The most relevant features are credit_score, debt_to_income, annual_income, loan_amount, interest_rate, employment_length, num_delinquencies, and total_utilization.

These features are important because they reflect credit behavior, income stability, debt burden, and financial stress, which strongly influence default risk.

I excluded zip_code, state, and application_type because they can introduce bias and have low direct predictive value.

I would use one-hot encoding for categorical variables and median or mode imputation for missing values.

## ðŸŒŸ Exercise 3 : Training, Evaluating, and Optimizing the Model

### Instructions
Which model(s) would you pick for a Loan Prediction ?
Outline the steps to evaluate the modelâ€™s performance, mentioning specific metrics that would be relevant to evaluate the model.

### Guidance
- Begin by naming one or two candidate models in a complete sentence and explain why each model is suitable for this problem.

- Next, describe an evaluation plan in complete sentences that covers the data split, the cross-validation strategy, the metrics you will report, and how you will choose a decision threshold.

- Then, explain in complete sentences how you will address class imbalance using stratification, class weights, or resampling.

- Finally, state in one or two complete sentences how you would iterate on hyperparameters to improve performance while avoiding data leakage.

In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It demonstrates standard classification metrics for binary loan default prediction.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, confusion_matrix, classification_report

# Please replace these placeholders with your true labels and predicted probabilities.
y_true = [0,1,0,1,0,0,1,0,1,0]            # placeholder labels
y_pred_proba = [0.05,0.80,0.10,0.65,0.20,0.15,0.70,0.30,0.85,0.25]  # placeholder probabilities

# You should set a decision threshold that reflects the precisionâ€“recall trade-off for your business case.
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))
print("\nConfusion matrix:\n", confusion_matrix(y_true, y_pred))
print("\nClassification report:\n", classification_report(y_true, y_pred, zero_division=0))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
ROC-AUC: 1.0
PR-AUC (Average Precision): 1.0

Confusion matrix:
 [[6 0]
 [0 4]]

Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         6
           1       1.00      1.00      1.00         4

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10



### Your answer

I would use Logistic Regression and Gradient Boosting because they work well for binary classification and capture both linear and non-linear patterns.

I would use stratified train/validation/test splits, k-fold cross-validation, and evaluate using precision, recall, F1-score, ROC-AUC, and PR-AUC with a business-driven threshold.

I would handle class imbalance using class weights and resampling methods.

I would tune hyperparameters on validation data only to avoid data leakage.

## ðŸŒŸ Exercise 4 : Designing Machine Learning Solutions for Specific Problems

### Instructions
For each of these scenario, decide which type of machine learning would be most suitable. Explain.

Predicting Stock Prices : predict future prices
Organizing a Library of Books : group books into genres or categories based on similarities.
Program a robot to navigate and find the shortest path in a maze.

### Guidance
Please identify the appropriate machine learning paradigm for each scenario in complete sentences and justify your choice.

For each scenario, write one complete sentence that describes the input data, one complete sentence that describes the output, and one complete sentence that describes the learning signal or objective.

### Your answer

For stock price prediction, supervised time-series learning is best; the input is historical market data, the output is future prices, and the objective is to minimize prediction error.

For organizing books, unsupervised clustering is best; the input is book metadata, the output is clusters, and the objective is to group similar books.

For robot navigation, reinforcement learning is best; the input is sensor and position data, the output is actions, and the objective is to maximize cumulative reward.


## ðŸŒŸ Exercise 5 : Designing an Evaluation Strategy for Different ML Models

### Instructions
- Select three types of machine learning models: one from supervised learning (e.g., a classification model), one from unsupervised learning (e.g., a clustering model), and one from reinforcement learning. - For the supervised model, outline a strategy to evaluate its performance, including the choice of metrics (like accuracy, precision, recall, F1-score) and methods (like cross-validation, ROC curves).
- For the unsupervised model, describe how you would assess the effectiveness of the model, considering techniques like silhouette score, elbow method, or cluster validation metrics.
- For the reinforcement learning model, discuss how you would measure its success, considering aspects like cumulative reward, convergence, and exploration vs. exploitation balance.
- Address the challenges and limitations of evaluating models in each category.

### Guidance
- Please write a separate paragraph for each of the three model categories.
- In the supervised paragraph, describe your validation plan and list the metrics you will report in complete sentences.
- In the unsupervised paragraph, explain how you would measure cluster quality or structure in complete sentences and mention any diagnostic plots.
- In the reinforcement learning paragraph, describe how you would track cumulative reward, assess convergence, and balance exploration and exploitation using complete sentences.
Conclude with one complete sentence per category that states a key evaluation challenge.

In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Supervised classification metrics template with placeholders.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score

# Replace these placeholders with your real outputs.
y_true = [0,1,1,0,1,0,0,1,0,1]
y_pred_proba = [0.1,0.7,0.8,0.2,0.6,0.3,0.4,0.9,0.2,0.85]
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
ROC-AUC: 1.0
PR-AUC (Average Precision): 1.0


In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Unsupervised clustering metrics template with synthetic data.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, n_init="auto", random_state=42)
labels = kmeans.fit_predict(X)
sil = silhouette_score(X, labels)
print("Silhouette score (higher is better):", round(sil, 4))

print("Please explain in complete sentences when you would use the elbow method and how you would interpret it.")

Silhouette score (higher is better): 0.848
Please explain in complete sentences when you would use the elbow method and how you would interpret it.


### My answer

For supervised models, I would use cross-validation and report accuracy, precision, recall, F1-score, ROC-AUC, and PR-AUC, with class imbalance as the main challenge.

For unsupervised models, I would use silhouette score and the elbow method to evaluate clusters, with the main challenge being the lack of labels.

For reinforcement learning, I would track cumulative reward and convergence while balancing exploration and exploitation, with unstable training as the main challenge.
