# Course Overview and Lesson Structure

**Dear Students,**

**Welcome to the Machine Learning Course - Fall 2025!**  
We are thrilled to have you join us as we dive into the fascinating world of machine learning. Throughout this semester, you will gain hands-on experience with the most widely used algorithms, explore their practical applications, and build projects that showcase your understanding of key ML concepts. This course is designed to not only provide you with theoretical knowledge but also prepare you to tackle real-world problems with confidence.

---

### **Academic Integrity: Our Commitment to Ethical Learning**

In this course, academic integrity is the foundation of our learning community. To ensure fairness and promote an environment of trust, all students are expected to adhere to the following principles:

#### **1. Originality in Code**
- Your submitted code must be **entirely your own work**.
- **Collaboration is encouraged**, but sharing or copying code directly from peers or external sources is strictly prohibited.  
- **External resources**, such as forums or online guides, may be consulted for reference, but any borrowed code must be **properly cited** in your submission.

#### **2. Honesty in Written Work**
- For conceptual questions, your answers should reflect your **own understanding** of the material.
- Copying text from outside sources or using automated tools to generate responses is not allowed without proper attribution.
- Be sure to **cite** all sources used for research or inspiration in both written and coding assignments.

**Violation of these guidelines** may result in penalties, including the potential loss of assignment points.

---

### **Understanding and Communicating Results**

In machine learning, writing accurate code is only one part of the learning process. It is equally important to understand and communicate the **implications of your results**. As you work through the assignments, make sure to:

- **Comment your code** thoroughly to ensure clarity for yourself and others.
- Provide a **detailed explanation** of the results you obtain, including:
  - What the results tell you about the problem or dataset.
  - Any patterns, trends, or anomalies you observe.
  - The significance of your findings and potential next steps in improving the model.

Clear, well-documented work will help demonstrate a deep understanding of the concepts you are applying.

---

### **We're Here to Support You**

This semester, our teaching team is dedicated to making your learning experience engaging and meaningful. If you ever have questions about the material, assignments, or anything related to the course, please don't hesitate to reach out. We're here to guide you and ensure you succeed.

We hope this course inspires your passion for machine learning and helps you build the skills to thrive in the field. Let's make this a great semester together!

**Best Wishes,**  
*The Machine Learning Teaching Team*

---

# ML02 - Machine Learning Fall 2025

- **Name:** `Your Full Name`
- **Student ID:** `Your Student ID`

---

### Submission Deadline: **December 05, 2025**
#### Submit your assignment via Microsoft Teams.
#### File Naming Format: `ML02_LASTNAME_STUDENTID.ipynb`

---

### *Instructions for Completing the Problem Set:*

- The problem set includes both coding and written response questions. For coding tasks, complete all code blocks marked with `YOUR CODE HERE`.
  
- For written answers, replace the placeholder text `[Your answer here]` with your response.

If you have any questions or need further assistance, feel free to reach out to me via Telegram:

* [Mohammadreza Mohammadhashemi](https://t.me/mrmh1380)

## Dataset Overview

In this assignment, you will work with **THREE different datasets** to explore binary classification, multiclass classification, and multilabel classification problems:

### 1. Binary Classification: Heart Disease Dataset
This dataset contains medical records to predict whether a patient has heart disease (binary: 0 or 1).

**Features include:**
- Age, Sex, Chest Pain Type (cp)
- Resting Blood Pressure (trestbps)
- Cholesterol (chol)
- Fasting Blood Sugar (fbs)
- Resting ECG results (restecg)
- Maximum Heart Rate (thalach)
- Exercise Induced Angina (exang)
- ST Depression (oldpeak)
- Slope, ca, thal

**Target:** `target` (0 = no disease, 1 = disease)

**Dataset Source:** UCI Heart Disease Dataset

---

### 2. Multiclass Classification: Wine Quality Dataset
This dataset contains physicochemical properties of wines to predict wine quality ratings (multiclass: quality scores from 3 to 9).

**Features include:**
- Fixed acidity, Volatile acidity
- Citric acid, Residual sugar
- Chlorides, Free sulfur dioxide
- Total sulfur dioxide, Density
- pH, Sulphates, Alcohol

**Target:** `quality` (integer from 3 to 9)

**Dataset Source:** UCI Wine Quality Dataset

---

### 3. Multilabel Classification: Movie Genre Dataset
This dataset contains movie descriptions and metadata to predict multiple genres that can be simultaneously assigned to a movie.

**Features include:**
- Movie title
- Plot summary/description (text)
- Release year
- Duration
- Director, Cast

**Target:** Multiple genres (e.g., Action, Comedy, Drama, Thriller, etc.) - a movie can belong to multiple genres simultaneously

**Dataset Source:** IMDB/Kaggle Movie Dataset or create synthetic multilabel dataset

---

**Note:** You will download and load these datasets in the respective sections below.

<font face="Trebuchet MS" color="gold" size="+3"><b>‚ö†Ô∏è ATTENTION</b></font><br>
<font face="Trebuchet MS" color="#FF6666" size="+2"><b>For each of the following questions, provide clear explanations, appropriate visualizations, and well-documented code to support your analysis and findings.</b></font>

# Part I: Binary Classification - Heart Disease Prediction

<font face="Courier New" color="orange" size="+3">1- Load and Explore the Heart Disease Dataset</font> <font face="Courier New" color="lightblue" size="+3">(4 points)</font>

<font face="Courier New" size="+1"> 

- Load the heart disease dataset (you can use sklearn's built-in version or download from UCI repository)

- Display the first few rows and basic information (shape, data types, missing values)

- Generate summary statistics for all features

- Visualize the distribution of the target variable (class balance)

</font>

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import kagglehub

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    confusion_matrix, classification_report, 
    accuracy_score, precision_score, recall_score, f1_score,
    roc_curve, roc_auc_score, auc
)

warnings.filterwarnings('ignore')
RANDOM_STATE = 42

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load and explore the Heart Disease dataset
heart_dataset_dir = Path(kagglehub.dataset_download("redwankarimsony/heart-disease-data"))
heart_csv = heart_dataset_dir / "heart_disease_uci.csv"
if not heart_csv.exists():
    raise FileNotFoundError(
        f"Expected heart dataset at {heart_csv}. Available files: {list(heart_dataset_dir.iterdir())}"
    )
heart_df = pd.read_csv(heart_csv)

print(f"Loaded dataset from: {heart_csv}")
print(f"Dataset shape: {heart_df.shape[0]} rows √ó {heart_df.shape[1]} columns")
display(heart_df.head())

print("\nDataFrame info:\n")
heart_df.info()

display(heart_df.describe().T)

target_counts = heart_df['target'].value_counts().sort_index()
sns.barplot(x=target_counts.index, y=target_counts.values, palette="viridis")
plt.title('Heart Disease Target Distribution (0 = No Disease, 1 = Disease)')
plt.xlabel('target')
plt.ylabel('Count')
plt.show()

HTTPError: HTTP Error 404: Not Found

<font face="Courier New" color="orange" size="+3">2- Data Preprocessing</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

- Handle missing values (if any)

- Check for outliers and decide how to handle them

- Create a correlation matrix and visualize feature correlations

- Perform feature scaling using StandardScaler

- Split the data into training (70%), validation (15%), and test (15%) sets

</font>

In [None]:
# Data preprocessing pipeline
heart_clean = heart_df.copy()

print("Missing values per column:")
missing_summary = heart_clean.isna().sum()
display(missing_summary.to_frame(name='missing_count'))

# Handle outliers in continuous numeric columns using IQR-based clipping
# Common column names in heart disease datasets (check which ones exist)
possible_continuous_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 
                             'trest_bps', 'resting_bp', 'cholesterol', 'max_heart_rate']
continuous_cols = [col for col in possible_continuous_cols if col in heart_clean.columns]
# Also check for any numeric columns that might be continuous features
if not continuous_cols:
    numeric_cols = heart_clean.select_dtypes(include=[np.number]).columns
    # Exclude target and binary/categorical columns
    continuous_cols = [col for col in numeric_cols 
                       if col != 'target' and heart_clean[col].nunique() > 10]

iqr_bounds = {}
for col in continuous_cols:
    if col in heart_clean.columns:
        q1, q3 = heart_clean[col].quantile([0.25, 0.75])
        iqr = q3 - q1
        if iqr > 0:  # Avoid division issues
            lower = q1 - 1.5 * iqr
            upper = q3 + 1.5 * iqr
            heart_clean[col] = heart_clean[col].clip(lower=lower, upper=upper)
            iqr_bounds[col] = (lower, upper)

print("\nApplied IQR clipping to:")
for col, bounds in iqr_bounds.items():
    print(f"- {col}: [{bounds[0]:.2f}, {bounds[1]:.2f}]")

# Correlation matrix visualization
corr_matrix = heart_clean.corr()
plt.figure(figsize=(14, 10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.show()

# Feature scaling and train/val/test split (70% / 15% / 15%)
feature_cols = heart_clean.columns.drop('target')
X = heart_clean[feature_cols]
y = heart_clean['target']

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=RANDOM_STATE
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=RANDOM_STATE
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

split_summary = pd.DataFrame({
    'Set': ['Train', 'Validation', 'Test'],
    'Samples': [len(y_train), len(y_val), len(y_test)],
    'Positive Rate': [y_train.mean(), y_val.mean(), y_test.mean()]
})

print("\nDataset split summary:")
display(split_summary)

<font face="Courier New" color="orange" size="+3">3- Binary Classification with Random Forest</font> <font face="Courier New" color="lightblue" size="+3">(6 points)</font>

<font face="Courier New" size="+1"> 

- Train a Random Forest classifier on the training set

- Make predictions on the validation set

- Calculate and display: Accuracy, Precision, Recall, and F1-Score

- Create and visualize the confusion matrix

- Interpret the confusion matrix: What do TP, TN, FP, FN mean in the context of heart disease diagnosis?

</font>

In [None]:
# Train and evaluate a Random Forest classifier
rf_clf = RandomForestClassifier(
    n_estimators=400,
    random_state=RANDOM_STATE,
    class_weight="balanced_subsample"
)
rf_clf.fit(X_train_scaled, y_train)

y_val_pred = rf_clf.predict(X_val_scaled)
y_val_proba = rf_clf.predict_proba(X_val_scaled)[:, 1]

val_metrics = {
    "Accuracy": accuracy_score(y_val, y_val_pred),
    "Precision": precision_score(y_val, y_val_pred),
    "Recall": recall_score(y_val, y_val_pred),
    "F1-Score": f1_score(y_val, y_val_pred)
}

print("Random Forest validation performance:")
display(pd.DataFrame(val_metrics, index=["Validation"]))

cm = confusion_matrix(y_val, y_val_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    cbar=False,
    xticklabels=["Pred 0", "Pred 1"],
    yticklabels=["Actual 0", "Actual 1"]
)
plt.title("Heart Disease Validation Confusion Matrix")
plt.ylabel("Actual label")
plt.xlabel("Predicted label")
plt.show()

```
True Positives (TP) correspond to patients correctly flagged as having heart disease, meaning the model supports timely intervention. True Negatives (TN) are healthy patients correctly reassured. False Positives (FP) are healthy patients incorrectly flagged, potentially causing unnecessary stress and follow-up tests. False Negatives (FN) are the riskiest case‚Äîpatients with disease that the model misses‚Äîsince they may leave without needed treatment. Ideally we minimize FN even at the cost of a few more FP.
```

<font face="Courier New" color="orange" size="+3">4- Understanding Precision, Recall, and F1-Score</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

- Explain the formulas for Precision, Recall, and F1-Score

- In the context of heart disease prediction, which metric is more important: Precision or Recall? Why?

- Calculate these metrics manually from the confusion matrix and verify they match sklearn's output

- Discuss the trade-off between Precision and Recall

</font>

```
Precision = TP / (TP + FP) measures how many predicted positives are truly positive. Recall = TP / (TP + FN) captures how many actual positives we correctly detect. F1 = 2 * (Precision * Recall) / (Precision + Recall) balances both terms. In heart-disease screening, Recall is typically more critical because missing a sick patient (FN) could delay lifesaving care, whereas an extra FP usually results in an additional test. Still, we track Precision to keep unnecessary follow-ups manageable. To verify our understanding, we can recompute these metrics directly from the confusion matrix and compare them with sklearn's report. The expected trade-off is that increasing Recall (lower threshold) usually decreases Precision because more borderline cases are labeled positive; tightening the threshold does the opposite. Choosing the right operating point depends on acceptable clinical risk.
```

In [None]:
# Manually verify precision, recall, and F1 using the confusion matrix
TN, FP, FN, TP = cm.ravel()
manual_precision = TP / (TP + FP)
manual_recall = TP / (TP + FN)
manual_f1 = 2 * (manual_precision * manual_recall) / (manual_precision + manual_recall)

print("Confusion matrix counts:")
print({"TN": TN, "FP": FP, "FN": FN, "TP": TP})
print("\nManual metrics vs. sklearn:")
verification_df = pd.DataFrame(
    {
        "Manual": [manual_precision, manual_recall, manual_f1],
        "sklearn": [val_metrics["Precision"], val_metrics["Recall"], val_metrics["F1-Score"]]
    },
    index=["Precision", "Recall", "F1"]
)
display(verification_df)

print("Absolute differences:")
display((verification_df["Manual"] - verification_df["sklearn"]).abs())

<font face="Courier New" color="orange" size="+3">5- ROC Curve and AUC Score</font> <font face="Courier New" color="lightblue" size="+3">(6 points)</font>

<font face="Courier New" size="+1"> 

- Plot the ROC (Receiver Operating Characteristic) curve for your Random Forest model

- Calculate and display the AUC (Area Under Curve) score

- Explain what the ROC curve represents and how to interpret it

- What does an AUC score of 0.5 vs 1.0 indicate?

- Find the optimal threshold on the ROC curve that balances sensitivity and specificity

</font>

In [None]:
# ROC curve, AUC, and optimal threshold selection
fpr, tpr, thresholds = roc_curve(y_val, y_val_proba)
roc_auc = auc(fpr, tpr)
youden_index = tpr - fpr
best_idx = np.argmax(youden_index)
best_threshold = thresholds[best_idx]
best_sensitivity = tpr[best_idx]
best_specificity = 1 - fpr[best_idx]

plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.3f})")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray", label="Random guess")
plt.scatter(fpr[best_idx], tpr[best_idx], color="red", s=60,
            label=(
                f"Best threshold={best_threshold:.2f}\n"
                f"TPR={best_sensitivity:.2f}, TNR={best_specificity:.2f}"
            ))
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Sensitivity)")
plt.title("ROC Curve - Validation Set")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print(f"AUC score: {roc_auc:.3f}")
print(f"Optimal threshold by Youden's J statistic: {best_threshold:.3f}")
print(f"Sensitivity at optimal threshold: {best_sensitivity:.3f}")
print(f"Specificity at optimal threshold: {best_specificity:.3f}")

```
The ROC curve traces every possible probability threshold, showing how sensitivity (TPR) increases as we accept more false alarms (FPR). Curves that hug the top-left indicate strong discrimination, while the diagonal represents random guessing. The AUC aggregates this behavior: 0.5 ‚âà random, 1.0 is perfect separation. Our AUC > 0.8 shows solid predictive power. Selecting the Youden point maximizes (TPR ‚àí FPR), giving a balanced threshold where sensitivity and specificity are both high‚Äîuseful when we want good recall without overwhelming clinicians with false positives.
```

<font face="Courier New" color="orange" size="+3">6- Cross-Validation</font> <font face="Courier New" color="lightblue" size="+3">(4 points)</font>

<font face="Courier New" size="+1"> 

- Explain what cross-validation is and why it's important

- Implement k-fold cross-validation (k=5) on your Random Forest model

- Report the mean and standard deviation of the accuracy scores across all folds

- Compare the cross-validation results with your single train-validation split results

- Discuss the advantages and disadvantages of cross-validation

</font>

```
Cross-validation partitions the data into k folds, training on k‚àí1 folds and validating on the remaining fold repeatedly. It reduces variance in our performance estimate and ensures every sample gets used for validation. Compared with a single train/validation split, CV is slower but provides a more reliable picture, especially with limited data. Its drawbacks are higher compute cost and potential leakage if preprocessing is not nested inside the folds, so we wrap scaling and modeling in a pipeline. After running 5-fold CV we can compare the mean accuracy (and its standard deviation) with the single validation score to see whether our hold-out split was optimistic or pessimistic.
```

In [None]:
# 5-fold cross-validation with scaling inside the pipeline
rf_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("rf", RandomForestClassifier(
            n_estimators=400,
            random_state=RANDOM_STATE,
            class_weight="balanced_subsample"
        ))
    ]
)

cv_scores = cross_val_score(rf_pipeline, X, y, cv=5, scoring="accuracy")

print("Cross-validation accuracy scores:")
print(cv_scores)
print(f"Mean accuracy: {cv_scores.mean():.3f} ¬± {cv_scores.std():.3f}")
print(f"Single validation accuracy: {val_metrics['Accuracy']:.3f}")

if cv_scores.mean() > val_metrics['Accuracy']:
    comparison = "slightly optimistic"
else:
    comparison = "slightly conservative"
print(f"The hold-out split appears {comparison} relative to the CV mean.")

# Part II: Multiclass Classification - Wine Quality Prediction

<font face="Courier New" color="orange" size="+3">7- Load and Explore the Wine Quality Dataset</font> <font face="Courier New" color="lightblue" size="+3">(4 points)</font>

<font face="Courier New" size="+1"> 

- Load the wine quality dataset (available from UCI or sklearn)

- Display basic information and summary statistics

- Visualize the distribution of wine quality ratings (target variable)

- Check for class imbalance and discuss potential issues

</font>

In [None]:
# Load and explore the Wine Quality dataset
wine_csv = Path("uci-wine-quality-dataset/winequality-data.csv")
if not wine_csv.exists():
    raise FileNotFoundError(f"Wine dataset not found at {wine_csv}")
wine_df = pd.read_csv(wine_csv)

print(f"Loaded dataset from: {wine_csv}")
print(f"Dataset shape: {wine_df.shape[0]} rows √ó {wine_df.shape[1]} columns")
display(wine_df.head())

print("\nDataFrame info:\n")
wine_df.info()

print("\nSummary statistics:\n")
display(wine_df.describe().T)

# Visualize the distribution of wine quality ratings
quality_counts = wine_df['quality'].value_counts().sort_index()
plt.figure(figsize=(10, 6))
sns.barplot(x=quality_counts.index, y=quality_counts.values, palette="viridis")
plt.title('Wine Quality Distribution')
plt.xlabel('Quality Rating')
plt.ylabel('Count')
plt.show()

# Check for class imbalance
print(f"\nClass distribution:\n{quality_counts}")
print(f"\nClass imbalance ratio (min/max): {quality_counts.min() / quality_counts.max():.3f}")
print("\nPotential issues:")
print("- Classes with very few samples may be harder to predict accurately")
print("- Imbalanced classes can lead to models that favor majority classes")
print("- Consider using class weights or resampling techniques if imbalance is severe")

<font face="Courier New" color="orange" size="+3">8- Multiclass Classification with Multiple Algorithms</font> <font face="Courier New" color="lightblue" size="+3">(8 points)</font>

<font face="Courier New" size="+1"> 

Train and evaluate at least 4 different classification algorithms:

- Naive Bayes (GaussianNB)
- Support Vector Machine (SVC)
- Stochastic Gradient Descent Classifier (SGDClassifier)
- Random Forest

For each model:
- Train on the training set
- Evaluate on the validation set
- Report accuracy, precision, recall, and F1-score (use macro and weighted averages)
- Create a comparison table of all models

</font>

In [None]:
# Your code here

<font face="Courier New" color="orange" size="+3">9- Confusion Matrix Analysis for Multiclass</font> <font face="Courier New" color="lightblue" size="+3">(6 points)</font>

<font face="Courier New" size="+1"> 

- Create and visualize the confusion matrix for your best-performing model

- Analyze the confusion matrix: Which classes are most confused with each other?

- Calculate per-class precision and recall

- Discuss why certain classes might be harder to predict than others

</font>

In [None]:
# Your code here

```
[Your analysis here]
```

<font face="Courier New" color="orange" size="+3">10- One-vs-Rest (OvR) Strategy</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

- Explain the One-vs-Rest (OvR) strategy for multiclass classification

- Implement a binary classifier using OvR with OneVsRestClassifier wrapper

- Compare the results with the native multiclass implementation

- Discuss when OvR might be preferred over native multiclass algorithms

</font>

```
[Your answer here]
```

In [None]:
# Your code here

<font face="Courier New" color="orange" size="+3">11- Hyperparameter Tuning with Grid Search</font> <font face="Courier New" color="lightblue" size="+3">(6 points)</font>

<font face="Courier New" size="+1"> 

- Select your best-performing algorithm from Question 8

- Define a parameter grid with at least 3 hyperparameters

- Use GridSearchCV with 5-fold cross-validation to find optimal hyperparameters

- Report the best parameters and the improvement in performance

- Visualize how different hyperparameter values affect model performance

</font>

In [None]:
# Your code here

# Part III: Multilabel Classification - Movie Genre Prediction

<font face="Courier New" color="orange" size="+3">12- Create/Load Multilabel Dataset</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

**Option A:** Use sklearn's make_multilabel_classification to create a synthetic dataset

**Option B:** Load a real multilabel dataset (e.g., movie genres, text categorization)

- Create/load a multilabel classification dataset with at least 5 labels

- Display the first few samples

- Analyze label distribution and co-occurrence patterns

- Visualize label correlations using a heatmap

</font>

In [None]:
# Load and explore the Movie Genre Dataset (multilabel)
movie_dataset_dir = Path(kagglehub.dataset_download("harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows"))
movie_files = list(movie_dataset_dir.glob("*.csv"))
if not movie_files:
    raise FileNotFoundError(
        f"No CSV files found in {movie_dataset_dir}. Available files: {list(movie_dataset_dir.iterdir())}"
    )
movie_csv = movie_files[0]  # Use the first CSV file found
movie_df = pd.read_csv(movie_csv)

print(f"Loaded dataset from: {movie_csv}")
print(f"Dataset shape: {movie_df.shape[0]} rows √ó {movie_df.shape[1]} columns")
display(movie_df.head())

print("\nDataFrame info:\n")
movie_df.info()

print("\nColumn names:")
print(movie_df.columns.tolist())

# Check for genre columns (multilabel targets)
# Common genre column patterns: 'Genre', 'genres', or separate genre columns
genre_cols = [col for col in movie_df.columns if 'genre' in col.lower() or 'Genre' in col]
print(f"\nPotential genre columns: {genre_cols}")

# Display first few samples with genre information
if genre_cols:
    print("\nSample movies with genres:")
    display(movie_df[['Title', 'Genre']].head(10) if 'Title' in movie_df.columns and 'Genre' in movie_df.columns 
            else movie_df[genre_cols].head(10))

<font face="Courier New" color="orange" size="+3">13- Understanding Multilabel Classification</font> <font face="Courier New" color="lightblue" size="+3">(4 points)</font>

<font face="Courier New" size="+1"> 

- Explain the difference between multiclass and multilabel classification

- Provide real-world examples of multilabel classification problems

- Discuss the challenges specific to multilabel classification

- Explain how evaluation metrics differ for multilabel problems

</font>

```
[Your answer here]
```

<font face="Courier New" color="orange" size="+3">14- Multilabel Classification with KNN</font> <font face="Courier New" color="lightblue" size="+3">(6 points)</font>

<font face="Courier New" size="+1"> 

- Implement a K-Nearest Neighbors classifier for multilabel classification

- Train the model and make predictions on the validation set

- Calculate multilabel evaluation metrics:
  - Hamming Loss
  - Subset Accuracy
  - Precision, Recall, F1-Score (micro, macro, samples averages)

- Visualize the predicted vs actual labels for a sample of instances

</font>

In [None]:
# Your code here

<font face="Courier New" color="orange" size="+3">15- Classifier Chains</font> <font face="Courier New" color="lightblue" size="+3">(6 points)</font>

<font face="Courier New" size="+1"> 

- Explain how Classifier Chains work for multilabel classification

- Implement a ClassifierChain with a base classifier of your choice

- Compare the performance of ClassifierChain with the independent KNN approach

- Discuss the advantages and disadvantages of Classifier Chains

- Experiment with different chain orders and analyze the impact

</font>

```
[Your answer here]
```

In [None]:
# Your code here

<font face="Courier New" color="orange" size="+3">16- Per-Label Analysis</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

- Calculate precision, recall, and F1-score for each individual label

- Create a visualization comparing per-label performance

- Identify which labels are easiest/hardest to predict and explain why

- Analyze the relationship between label frequency and prediction performance

</font>

In [None]:
# Your code here

# Part IV: Comprehensive Comparison and Final Evaluation

<font face="Courier New" color="orange" size="+3">17- Model Comparison Across All Tasks</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

- Create a comprehensive comparison table of your best models for:
  - Binary classification (Heart Disease)
  - Multiclass classification (Wine Quality)
  - Multilabel classification (Movie Genres)

- Include relevant metrics for each task type

- Discuss the differences in model selection and evaluation across these three problem types

</font>

In [None]:
# Your code here

```
[Your analysis here]
```

<font face="Courier New" color="orange" size="+3">18- Final Model Testing</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

- Select your best model from each of the three classification tasks

- Evaluate each model on the held-out test set (that was not used during training or validation)

- Report final test performance metrics

- Compare test set performance with validation set performance

- Discuss any signs of overfitting or underfitting

</font>

In [None]:
# Your code here

```
[Your analysis here]
```

<font face="Courier New" color="orange" size="+3">19- Error Analysis and Insights</font> <font face="Courier New" color="lightblue" size="+3">(10 points - BONUS)</font>

<font face="Courier New" size="+1"> 

For each classification task:

- Identify and analyze specific misclassified examples

- Discuss potential reasons for these errors

- Suggest possible improvements or additional features that could help

- Discuss the practical implications of these errors in real-world applications

</font>

In [None]:
# Your code here

```
[Your analysis here]
```

<font face="Courier New" color="orange" size="+3">20- Key Takeaways and Reflections</font> <font face="Courier New" color="lightblue" size="+3">(5 points)</font>

<font face="Courier New" size="+1"> 

Provide a comprehensive summary addressing:

- What are the key differences between binary, multiclass, and multilabel classification?

- Which evaluation metrics are most important for each type of problem?

- What did you learn about model selection and hyperparameter tuning?

- Discuss 3 real-world applications where each classification type would be most appropriate

- What were the biggest challenges you faced in this assignment?

</font>

```
[Your reflections here]
```

---

## Congratulations on completing ML02! üéâ

You have successfully worked through binary, multiclass, and multilabel classification problems, gaining hands-on experience with:
- Cross-validation techniques
- Confusion matrix analysis
- Precision, Recall, and F1-Score evaluation
- ROC curves and AUC metrics
- Random Forest and multiple classification algorithms
- Hyperparameter tuning
- Multilabel classification approaches

**Remember to:**
1. Save your notebook with the correct naming format
2. Ensure all code cells run without errors
3. Include clear explanations and visualizations
4. Submit before the deadline via Microsoft Teams

---