## ***Supervised Learning Models***

## **Introduction**

##### This is Day 3 of AI and ML Learning Plan. Today's theme is **Supervised Learning Models (Regression and Classification)**

### Agenda

- Train regression and classification models using scikit-learn
- Correctly split data into train / validation / test sets
- Evaluate models using appropriate metrics
- Compare multiple models objectively

## **Supervised Learning Fundamentals & Data Splitting**

### **Objectives**

- Understand supervised learning workflow
- Avoid data leakage
- Establish correct experiment structure

### **Hands-on Tasks**

#### Load dataset (reuse Titanic dataset)

In [59]:
import pandas as pd
from sklearn.model_selection import train_test_split

dataset = pd.read_csv('../Day1/datasets/titanic/train.csv')

# Quick verification of the loaded dataset
print(dataset.head())
print(dataset.shape)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
(8

#### Separate features and target

In [60]:
# For titanic, the target is 'Survived' column
target = "Survived"

# Select features

features = [
    "Pclass",
    "Sex",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
]

# Store PassengerIDs for later use
passenger_ids = dataset["PassengerId"]

# Key Concept reinforcement: X = input features, y = ground truth labels
X = dataset[features].copy()
y = dataset[target]

#### Perform 70 / 15 / 15 split using `train_test_split`

In [61]:
# Minimal preprocessing
X["Age"] = X["Age"].fillna(X["Age"].median())

# Encode categorical features
X = pd.get_dummies(X, columns=["Sex"], drop_first=True)

# Two stage splitting to create training, validation, and testing sets
# Split into 70% training and 30% for second split (testing and validation)
X_train, X_temp, y_train, y_temp, pid_train, pid_temp = train_test_split(
    X,
    y,
    passenger_ids,
    test_size=0.3,
    random_state=42,
    stratify=y
)

# Split the 30% temp set into 15% validation and 15% testing
X_val, X_test, y_val, y_test, pid_val, pid_test = train_test_split(
    X_temp,
    y_temp,
    pid_temp,
    test_size=0.5,
    random_state=42,
    stratify=y_temp
)

#### Verify data shapes

In [62]:
# Verify the splits: 70% train, 15% val, 15% test
print("Train set:", X_train.shape, y_train.shape)
print("Validation set:", X_val.shape, y_val.shape)
print("Test set:", X_test.shape, y_test.shape)

Train set: (623, 6) (623,)
Validation set: (134, 6) (134,)
Test set: (134, 6) (134,)


### **Observations**

- Dataset was correctly loaded using `read_csv()` method and verified using `.shape` and `.head()` method.
- Since I used titanic dataset, the features includes Pclass, Sex, Age, SibSp, Parch and Fare which are all numerical values. These affects the output of the target which determines if the person *"Survived"*.
- Minimal Preprocessing was required before splitting the dataset because of the presence of missing values in the dataset.
- `.copy()` method was added when copying the dataset features to ***X*** variable treating it as independent.
- Two stages were done to split the dataset into three parts: 70% training, 15% testing and 15% validation.
    - First stage: Dataset is split into 70% and 30% for training and temporary split.
    - Second stage: 30% is split into two 15% for testing and validation.
- Data shapes were displayed to verify if the data splitting is accurate.

## **Regression Models (Baseline & Linear Models)**

### **Objectives**

- Build baseline regression models
- Understand model assumptions
- Evaluate performance quantitatively

### **Hands-on Tasks**

#### Train Linear Regression Model

In [63]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=1000,
    solver="lbfgs"
)

# Model training phase
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


#### Predict on validation set

In [70]:
# Prediction phase on validation set
y_val_pred = model.predict(X_val)

print("Validation Predictions:", y_val_pred)

# Display validation predictions alongside PassengerId for reference
val_results = pd.DataFrame({
    "PassengerId": pid_val,
    "Actual": y_val,
    "Predicted": y_val_pred
})

# Display first 10 results
print(val_results.head(10))

Validation Predictions: [1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1
 0 0 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0
 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0
 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1]
     PassengerId  Actual  Predicted
377          378       0          1
244          245       0          0
72            73       0          0
815          816       0          0
841          842       0          0
23            24       1          1
27            28       0          1
239          240       0          0
386          387       0          0
347          348       1          1


#### Compute RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error)

In [72]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
print("Validation RMSE:", rmse)

mae = mean_absolute_error(y_val, y_val_pred)
print("Validation MAE:", mae)

Validation RMSE: 0.3863337046431279
Validation MAE: 0.14925373134328357


#### Interpret coefficients

In order to interpret the coefficients, we must first define the two computated results:
- **RMSE**
    - penalizes **wrong predictions more strongly** than ***MAE**
    - expected result: **RMSE** > **MAE** since **RMSE** penalizes errors more
    - **RMSE** closer to 0 indicates better performance.
    - highlights confidence in correctness.
- **MAE**
    - measures the **average absolute difference** between prediction and ground truth.
    - closer to *0* means perfect predictions while closer to *0.5* means poor classifier
    - roughly equals **misclassification rate**

In our example, the computed results were: **RMSE** = ***0.3863*** and **MAE** = ***0.1493***.

Here, we have a baseline regression model with exceptional computed results. Here are the observations:
- **RMSE** is closer to *0* than *1* indicates that the model is confident to predict correctness.
- **MAE** is closer to *0* which means almost perfect predictions.
- Misclassification rate is slow, therefore, higher chance that the predicted result is correct.

### **Observations**

- Baseline regression model is trained using our features values from `X_train`
- The validation values from `y_val` is used to predict the target value using the trained regression model.
- Separated PassengerId for displaying actual vs. ground truth.
- Computed **RMSE** and **MAE** to indicate model performance.
- Generally, based on the computated results, the model shows correct predictions but not perfectly.
- However, occasional predictions are off by a larger margin.

## **Classification Models**

### **Objectives**

- Understand classification decision boundaries
- Train multiple classifiers
- Evaluate classification performance correctly

### **Hands-on Tasks**

#### Train Logistic Regression classifier

In [87]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_val)
y_pred_lr_test = log_reg.predict(X_test)

#### Train kNN classifier

In [88]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_val)
y_pred_knn_test = knn.predict(X_test)

#### Generate confusion matrices

In [None]:
# Logistic Regression and kNN Predictions on Validation Set
lr_confmat_val = confusion_matrix(y_val, y_pred_lr)
lr_class_report_val = classification_report(y_val, y_pred_lr)
knn_confmat_val = confusion_matrix(y_val, y_pred_knn)
knn_class_report_val = classification_report(y_val, y_pred_knn)

# Display confusion matrices and classification reports
print("Logistic Regression (Validation Set):")
print("Confusion Matrix:") 
print(lr_confmat_val)
print("Classification Report:")
print(lr_class_report_val)
print("\nK-Nearest Neighbors (Validation Set):")
print("Confusion Matrix:")
print(knn_confmat_val)
print("Classification Report:")
print(knn_class_report_val)

Logistic Regression (Validation Set):
Confusion Matrix:
[[73  9]
 [11 41]]
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.89      0.88        82
           1       0.82      0.79      0.80        52

    accuracy                           0.85       134
   macro avg       0.84      0.84      0.84       134
weighted avg       0.85      0.85      0.85       134


K-Nearest Neighbors (Validation Set):
Confusion Matrix:
[[64 18]
 [24 28]]
Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.78      0.75        82
           1       0.61      0.54      0.57        52

    accuracy                           0.69       134
   macro avg       0.67      0.66      0.66       134
weighted avg       0.68      0.69      0.68       134


Logistic Regression (Testing Set):
Confusion Matrix:
[[65 18]
 [16 35]]
Classification Report:
              precision    recall  f1-score   support

In [None]:
# Logistic Regression and kNN Predictions on Testing Set
lr_confmat_test = confusion_matrix(y_test, y_pred_lr_test)
knn_confmat_test = confusion_matrix(y_test, y_pred_knn_test)
lr_class_report_test = classification_report(y_test, y_pred_lr_test)
knn_class_report_test = classification_report(y_test, y_pred_knn_test)

#Display confusion matrices and classification reports for testing set
print("\nLogistic Regression (Testing Set):")
print("Confusion Matrix:")
print(lr_confmat_test)
print("Classification Report:")
print(lr_class_report_test)
print("\nK-Nearest Neighbors (Testing Set):")
print("Confusion Matrix:")
print(knn_confmat_test)
print("Classification Report:")
print(knn_class_report_test)


Logistic Regression (Testing Set):
Confusion Matrix:
[[65 18]
 [16 35]]
Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.78      0.79        83
           1       0.66      0.69      0.67        51

    accuracy                           0.75       134
   macro avg       0.73      0.73      0.73       134
weighted avg       0.75      0.75      0.75       134


K-Nearest Neighbors (Testing Set):
Confusion Matrix:
[[63 20]
 [22 29]]
Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.76      0.75        83
           1       0.59      0.57      0.58        51

    accuracy                           0.69       134
   macro avg       0.67      0.66      0.67       134
weighted avg       0.68      0.69      0.69       134



#### Compare metrics

- In the confusion matrix:
    - TP : Logistics Regression is higher
    - FN : kNN is higher
    - FP : KNN is higher
    - TN : Logistics Regression is higher

- For precision, recall, f1-score and accuracy:
    - Both for Class 0 and 1 : Logistic Regression is higher

### **Observations**

- Same method for both classification model ML pipelines:
    1. Train using `X_train` and `y_train` values by utilizing `.fit()` method.
    2. Predict using `y_val` values and using `.predict()` method.

## **Model Comparison & Experiment Tracking**

### **Objectives**

- Compare models objectively
- Understand why one model outperforms another
- Document experiments clearly

### **Hands-on Tasks**

#### Create comparison table (metrics)

| Classification Models | TP | FP | FN | TN | Precision (macro) | Recall (macro) | F1-score (macro) | Accuracy
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Logistic Regression | 73 | 11 | 9 | 41 | 0.84 | 0.84 | 0.84 | 0.85
| k-Nearest Neighbor | 64 | 24 | 18 | 28 | 0.67 | 0.66 | 0.66 | 0.69

Which model balances bias and variance better?
- Logistic regression balances bias and variance better because it has performs well on both validation and testing datasets compared to k-Nearest Neighbor.
- When comparing two models:
    - Logistic Regression: moderate bias, moderate variance (can be determined by checking the performance drop of metric values specifically accuracy.)
    - k-Nearest Neighbor: high bias, low variance (Low accuracy on both validation and testing datasets)

Any performance patterns across metrics?
- Logistic Regression:
    - Class imbalance effect: Class 1 suffers more (lower recall, lower precision).
    - Validation → test drop: Slight decrease, indicating minor variance.
    - Balanced performance: Model is decent on both classes, though slightly biased toward class 0.

- k-Nearest Neighbor:
    - High bias: Consistently low performance on both validation and test.
    - Class 1 performance weak: Precision, recall, and F1 are much lower than class 0.
    - No variance issues: Metrics don’t change much between sets.

#### Select best model based on validation performance

Based on validation performance, the Logistic Regression model is suggested as the best model because of the following:

- Higher accuracy compared to kNN
- Higher precision leading to reliable predictions for both classes
- Higher recall detects negatives better
- Higher F1-score meaning better in overall balance in performance.

#### Write conclusions in markdown

***Best Model Choice***
Based on validation performance, the **best model** is **Logistic Regression**.  
- Achieves higher **accuracy** (0.85 vs 0.69) and **F1-score** (0.85 vs 0.69) compared to k-Nearest Neighbors (kNN).  
- Shows better **precision and recall** for both classes, particularly for the minority class (class 1).  
- Provides a more **balanced prediction** across classes.

***Bias-Variance Analysis***
- **Logistic Regression**
  - **Bias:** Moderate — slight underfitting for class 1 observed.  
  - **Variance:** Low to moderate — minor drop from validation to test set.  
  - **Interpretation:** Generalizes reasonably well with fairly balanced errors.  

- **k-Nearest Neighbors (kNN)**
  - **Bias:** High — consistently low accuracy and F1-score on both validation and test sets.  
  - **Variance:** Low — minimal difference between validation and test metrics.  
  - **Interpretation:** Underfits the data, failing to capture patterns, especially for class 1.  

***Observed Tradeoffs***
- **Logistic Regression:** Slight tradeoff between bias and variance; performs better on both classes but shows minor sensitivity to test data.  
- **kNN:** Sacrifices accuracy for simplicity; low variance but high bias results in poor generalization.  
- **Class-level tradeoff:** Both models struggle more with class 1, indicating possible class imbalance or features that do not fully separate classes.

## **Interview Readiness Tip (Important)**

##### You should now be able to confidently answer

- Why accuracy alone can be misleading?
    - Accuracy alone can be misleading, particularly in imbalanced datasets, because it does not show how well the model predicts individual classes. Metrics such as precision, recall, and F1-score provide a more complete picture of model performance. By evaluating these metrics, we can better understand class-specific performance, potential bias or variance issues, and the model’s suitability for specific use cases.

- When to use RMSE vs MAE?
    - RMSE and MAE are both used to evaluate regression models. RMSE penalizes larger errors more strongly, making it useful when large errors are particularly undesirable. MAE measures the average absolute error and is more robust to outliers. The choice depends on whether you want to emphasize large errors (RMSE) or treat all errors equally (MAE).

- How train/validation/test splits prevent leakage?
    - Train/validation/test splits prevent data leakage by ensuring that the model is trained, tuned, and evaluated on separate datasets. The training set is used to fit the model, the validation set is used to tune hyperparameters, and the test set evaluates performance on completely unseen data. This separation ensures that the model does not memorize specific patterns from the data and reflects its true generalization ability.

- How to compare models fairly?
    - To compare models fairly, the same train/validation/test splits and evaluation metrics should be used for all models. For classification, metrics like accuracy, precision, recall, F1-score, and confusion matrices are commonly used. Evaluating performance on both validation and test sets helps assess bias and variance tradeoffs and ensures that the comparison reflects real-world performance.

## **Day 3 Deliverables Checklist**

* ✅ One Jupyter Notebook: **`Day3_Supervised_Learning_Models.ipynb`**
* ✅ At least:
    - 1 regression model
    - 2 classification models
* ✅ Evaluation metrics clearly presented
* ✅ Written comparison and justification