<p style="text-align:center">
    <a href="https://www.ict.mahidol.ac.th/en/" target="_blank">
    <img src="https://www3.ict.mahidol.ac.th/ICTSurveysV2/Content/image/MUICT2.png" width="400" alt="Faculty of ICT">
    </a>
</p>

# Lab08: ML Basics: Classification (Extra Assignment)

**If you have finished the First Lab Assignment, you may start on Extra Assignment: Part 4.**

This covers a common and difficult problem in classification with imbalanced datasets, trying to make important decisions with limited data. 

This dataset is on deciding whether to **Lend Money**.

**Steps:**
1. Stratification
2. Cost Sensitive Models
3. Many Models
4. Undersampling
5. Oversampling
6. Curriculum Learning (Easy, Medium, Hard)



__Instructions:__
1. Append your ID at the end of this jupyter file name. For example, ```ITCS227_Lab0X_Assignment_Extra_6788123.ipynb```
2. Complete each task and question in the lab.
3. Once finished, raise your hand to call a TA.
4. The TA will check your work and give you an appropriate score.
5. Submit your IPYNB source code to MyCourses as record-keeping.

## Library of Helper Functions

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import learning_curve
import warnings
from sklearn.exceptions import FitFailedWarning

def _show_classification_report(model, y_true, y_pred, target_names):
    '''
        Function to print performance metrics
    '''
    accuracy = accuracy_score(y_true, y_pred)
    sensitivity = recall_score(y_true, y_pred, pos_label=1, average='weighted')
    specificity = recall_score(y_true, y_pred, pos_label=0, average='weighted')
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Sensitivity: {sensitivity:.4f}")
    print(f"Specificity: {specificity:.4f}")
    print("Classification Report:")
    class_report = classification_report(y_true, y_pred, target_names=target_names)
    print(class_report)
    res = classification_report(y_true, y_pred, target_names=target_names, output_dict=True)
    return pd.json_normalize(res, sep='_')
    
def _show_confusion_matrix(model, y_true, y_pred, target_names):
    '''
        Function to plot confusion matrix
    '''
    cm = confusion_matrix(y_true, y_pred, labels=model.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                  display_labels=target_names)
    disp.plot(cmap=plt.cm.Blues,)
    plt.gcf().set_size_inches(3.5, 3.5)
    disp.ax_.set_title(f'Confusion Matrix for {model.__class__.__name__}', fontsize=8)
    plt.show()
    
def _plot_histogram_of_frequencies(data, ax=None):
    if not ax:
        fig, ax = plt.subplots(figsize=(7,2.5))
    unique_values, counts = np.unique(data, return_counts=True)
    barh = plt.bar(unique_values, counts)
    plt.xlabel("Values")
    plt.ylabel("Frequency")
    plt.title("Histogram")
    plt.xticks(unique_values)
    ax.bar_label(barh, fmt='%.2f')
    ax.set_ylim(bottom=0, top=1.25*max(counts))
    print('Class Split:', counts/sum(counts))
    plt.show()
    
def _make_learning_curve(model, X_train, y_train, scoring="f1_weighted", num_training_sizes=10):
    def _plot_learning_curve(model, train_sizes, train_scores, valid_scores, metric='F1 Score', plt_text='', ax=None):
        if not ax:
            fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 3), sharey=True, sharex=True)
        train_errors = train_scores.mean(axis=1)
        valid_errors = valid_scores.mean(axis=1)
        ax.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train")
        ax.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid")
        ax.set_xlabel("Training set size")
        ax.set_ylabel(f'{metric}')
        # plt.gca().set_xscale("log", nonpositive='clip')
        ax.grid()
        ax.legend(loc="upper right")
        ax.set_ylim(bottom=0, top=1.25*max([1]))
        ax.set_title(f'{model.__class__.__name__}\n{plt_text}', fontsize=8)
        plt.show()
        
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=FitFailedWarning)
        train_sizes, train_scores, valid_scores = learning_curve( model, 
                                                    X_train, y_train, 
                                                    train_sizes=np.linspace(0.01, 1.0, num_training_sizes), # e.g. `num` size intervals, from 1% to 100%
                                                    cv=5,     # CV=5 means  Train = 80%  , Test = 20%.
                                                              # CV=10 means Train = 90%  , Test = 10%.
                                                              #   - The fit/predict is repeated 5 times with random samples taken from X/Y.
                                                              #   - The resulting error is the average across all 5 trials; so a smoother and fairer result than CV=1 , which is hold-out.
                                                    scoring=scoring,
                                                    n_jobs=-1
                                                )
    _plot_learning_curve(model, train_sizes, train_scores, valid_scores, metric=scoring.replace('_',' ').title(), plt_text='')


```



```

**Recommended Steps:**
* Think about the techniques you know about that might improve the model's F1 Score to 0.95 or above. 
* Starting with the **simplest first** (Ocam's Razor!), try those approaches to improve the F1 Score, until 0.95 or above and report back on which worked for you:

## Q1: Which techniques / steps led to improving the F1 Score to 0.95 or above?

Ans: _________

## Q2: Include a Screenshot of your final Orange Workspace and including its best F1-Score:

Ans: _________

```





















```

# Part 4 - Analysing the Lending Dataset: Deciding whether to lend. [ML Task: Handling Imbalanced Data]

Fill in the code (find examples in Tutorial and in slides) and answer the questions according to the steps below:

**Steps:**

- **Step 0:** 	Define the Dataset and Objective:
    - Objective:
        - Predict applicants that will fail to pay a loan, before lending.
    - Consider Actions:
        * 0 → Loan fully paid (`Low Risk`) :  `POSITIVE` Class (`P`) -> Lend to Borrower.
        * 1 → Loan not fully paid (`High Risk`) :  `NEGATIVE` Class (`N`) -> Do not Lend to Borrower.
- **Step 1:** 	Identify which cell is most crucial.
    - TP `"Loan fully paid"` - Lender will gain customer, Borrower will receive loan.
    - TN `"Loan not fully paid"` - Lender will lose customer, Borrower will not receive loan.
    - FP - Lender will lose customer, Borrower will not be given loan.
    - FN - Lender will lose money, Borrower may go bankrupt.
- **Step 2:** 	Define Positive (P) and Negative (N)
    - Consider the two classes are:
        - `Loan fully paid` (P)
        - `Loan not fully paid` (N)
- **Step 3:** 	State True and False
-  ... fit model,  evaluate model ,  count errors …
- **Step 4:** 	Calculate the metric
- **Step 5:** 	Decide if objective met.
    - Consider Minimum Metric Score is **0.9 or 90%**.
    - If above this level, it can be used by the financial institution to make lending choices.
    - If no, **Return to Step 3**: ... many trials to improve performance..

## Q1: Identify which Class is most important:
(Either - `Loan fully paid` (P) or `Loan not fully paid` (N) )
* Ans: _______

## Q2: Identify which Error (FP or FN) is most important:

(Either (1)`FP and FN` or (2) `FP` or (3) `FN` is most important. Explain why in a sentence.)

* Ans: _______

## Q3: Which Metric is helps measure that condition?

* Ans: _______

##### Domain Knowledge:
For a financial institution, this kind of task falls under both "saving money" and "making money". The bank/ lender wishes to lend money to an applicant that will successfully repay the loan in the longest (maximum) period of time. 
* `Low Risk` borrower: - Repays loan - repays early, the net income on the loan is smaller.
* `Low` (-Medium) `Risk` borrower: - Repays loan - repays in over longer lending period (i.e. maximum period), more interest will be paid, leading to greater net income to the bank.
* `High risk` borrower: - Fails to repay loan - The bank reclaims the applicant's collateral (assets) to resell. Leads to extra work and in most cases a **severe loss in net income**.

The task is to predict whether a borrower is likely to fail to fully repay a loan (not_fully_paid = 1) based on their financial history and credit-related factors. This model helps lenders assess loan default risk, enabling better decision-making in loan approvals, interest rate adjustments, and risk mitigation strategies.

##### Dataset Description

This dataset contains loan data from LendingClub.com, a platform connecting borrowers with investors, spanning the years 2007 to 2010. It includes information on over 9,500 loans, detailing loan structure, borrower characteristics, and loan repayment status. The data is derived from publicly available information on LendingClub.com.
**Source:** [Kaggle - Loan Data](https://www.kaggle.com/itssuru/loan-data)


| Variable           | Explanation                                                                                                                                                                                                    |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| credit_policy      | 1 if the customer meets the credit underwriting criteria; 0 otherwise.                                                                                                                                       |
| purpose            | The purpose of the loan.                                                                                                                                                                                      |
| int_rate           | The interest rate of the loan (higher rates indicate higher risk).                                                                                                                                            |
| installment        | The monthly installments owed by the borrower if the loan is funded.                                                                                                                                         |
| log_annual_inc     | The natural logarithm of the borrower's self-reported annual income.                                                                                                                                          |
| dti                | The borrower's debt-to-income ratio (debt divided by annual income).                                                                                                                                         |
| fico               | The borrower's FICO credit score.                                                                                                                                                                            |
| days_with_cr_line  | The number of days the borrower has had a credit line.                                                                                                                                                        |
| revol_bal          | The borrower's revolving balance (unpaid amount at the end of the credit card billing cycle).                                                                                                                |
| revol_util         | The borrower's revolving line utilization rate (credit line used relative to total available credit).                                                                                                       |
| inq_last_6mths     | The borrower's number of credit inquiries in the last 6 months.                                                                                                                                              |
| delinq_2yrs        | The number of times the borrower was 30+ days past due on a payment in the past 2 years.                                                                                                                       |
| pub_rec            | The borrower's number of derogatory public records.                                                                                                                                                         |
| not_fully_paid     | 1 if the loan was not fully paid; 0 otherwise.                                                                                                                                                              |


### Load Dataset

In [None]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load a DataFrame with a specific version of a CSV
target_names = ['Fully Paid', 'Not Fully Paid'] # [0,1]
target_name = 'not.fully.paid'
fn = "Dataset_Loan_Data_2007-2010__Imbalance_Kaggle-ITSSURU_2021/loan_data.csv"
df = pd.read_csv( fn )
df.head(1)

##### Explore Dataset:

In [None]:
# Display basic info
print(df.info())
df.describe()

##### Feature Distribution Analysis

In [None]:
# Histogram for key numerical features
num_features = ["fico", "int.rate", "dti", "installment"]
df[num_features].hist(figsize=(12, 8), bins=30, edgecolor="black")
plt.suptitle("Feature Distributions")
plt.show()

##### Boxplots for outlier detection

In [None]:
plt.figure(figsize=(12, 6))
for i, feature in enumerate(num_features):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(y=df[feature], color="skyblue")
    plt.title(f"Boxplot of {feature}")
plt.tight_layout()
plt.show()

##### Encode categorical column 'purpose' using Label Encoding
* Now we can treat `purpose` as a numerical column.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["purpose"] = le.fit_transform(df["purpose"])

In [None]:
# Class distribution (`not_fully_paid` target feature counts)
sns.countplot(data=df, x='purpose', hue=target_name)

### Measure Class Balance:

In [None]:
_plot_histogram_of_frequencies(df[target_name])

### Checkpoint: The number of records per class are heavily imbalanced!
* In `Part 2` you were recommended to use `StratifiedShuffleSplit` for the mildly imbalanced dataset.
* This dataset has a higher **84:16 ratio** imbalance.
* In the `slides` example, we artificially `undersampled` a balanced dataset (from 1:1 to 99:1).
* In this task, you are asked to implement a solution to `Imbalanced Classification`slides **`-> from 84:16 to 1:1`**

### Select `X` (input features) and `y` (target feature) and Split as X and y:
* Define your X and y data, from features in the dataset.
* Split X / y into train and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=[target_name])
y = df[target_name]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, #stratify=y
)

### Define a Reusable Function to Evaluate a Model:
- Below is a reusable function to evaluate several models. This will let us experiment with many variations to find the best.
- Run the cell to use it later.

In [None]:
import warnings; from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
from sklearn.preprocessing import StandardScaler

def evaluate_a_model( model, X_train, y_train, X_test, y_test, target_names, scaling=True, return_pd=False):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning)
        
        # Scale the X data using StandardScaler. (This can improve the performance for this dataset.)
        if scaling:
            scaler = StandardScaler()
            _X_train = scaler.fit_transform(X_train)
            _X_test = scaler.transform(X_test)
        else:
            _X_train = X_train
            _X_test = X_test
            
        # Evalaute the model using Stratified K-Folds:
        kf = StratifiedShuffleSplit(n_splits=5)
        train_f1_scores = cross_val_score(model, _X_train, y_train, cv=kf, scoring='f1_weighted')
        print('Stratified CV - F1 Weighted Score: ', train_f1_scores)
        model.fit(_X_train, y_train)
        y_pred = model.predict(_X_test)
        
        # Show the model performance:
        res = _show_classification_report(model, y_test, y_pred, target_names)
        res['Model'] = model.__class__.__name__
        _show_confusion_matrix(model, y_test, y_pred, target_names)
        _make_learning_curve(model, _X_train, y_train)
        if return_pd:
            return res

### Measure the model using a Baseline Classifier Algorithm:

In [None]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
evaluate_a_model( model, X_train, y_train, X_test, y_test, target_names)

## Q4i: What is the metric score (as you identified above) according to the model's performance on the `test` set:
*(example:  F1 Score and 0.95)*
* Ans: _______

### Q4.ii: What is the model's `weighted avg F1-Score`?

Ans: _________

### Q4.iii: Are the model's `weighted avg Precision` and `weighted avg Recall` approximately equal?

Ans: _________


### Q4.iv: Do the `Majority` and `Minority` class(es) have approximately equal F1 Score?

Ans: _________

## Q5: Is the objective met?
- Considering we defined the **Minimum Metric Score is 0.9 or 90%.**
- Does the score from your model meet the objective to deliver this model to a financial institution to make lending choices.
 
* Ans: _______

## Next Steps: - Experimentation:

See the code examples in the Lab Tutorial (and Slides), to apply these steps. 

1. Stratification
2. Cost Sensitive Models
3. Many Models
4. Undersampling
5. Oversampling
6. Curriculum Learning (Easy, Medium, Hard)

Your task is to find a model that can reach the **Minimum Metric Score** objective.

### 1. Different Classifier Algorithms:

In [None]:
# insert your code here


### 2. Cost-Sensitive Algorithms with Scaling and Stratified `train_test_split`:

In [None]:
# insert your code here


### 3. Cost-Sensitive Algorithms with Scaling:

In [None]:
# insert your code here


### 4. Cost-Sensitive Algorithms without Scaling:

In [None]:
# insert your code here


### 5. Undersampling:

In [None]:
# insert your code here


### 6. Oversampling:

In [None]:
# insert your code here


### 7. Curriculum Learning:

In [None]:
# insert your code here


```





```

### Q: Describe which model and technique / step led to the highest "" ?

Ans: _________


### Q: Include a Screenshot of its `Classification Report and Confusion Matrix` and including your chosen `Metric` and its `best Metric Score`:

Ans: _________

```






```
<p style="text-align:center;">That's it! Extra Congratulations!! <br> 
    Now, call an LA to check your solution. Then, upload your code on MyCourses.</p>