# Prediction of Liver Disease [117 points total]

**Liver Disease (LD)** is a serious condition in which the liver’s ability to detoxify, metabolize, and synthesize proteins is impaired. In this pre-lab you will use clinical features to **predict the presence of liver disease**. You will build two models:

- **K-Nearest Neighbors (K-NN)** with feature scaling  
- **Random Forest** (no scaling needed)

> **Provided:** The dataset is already loaded for you as a pandas DataFrame named `df`.  
> **Target column:** `disease` (binary: `1` = liver disease, `0` = healthy).  
> **Features:** numeric only. Covert Male_Gender to [0,1]

In [None]:
from pathlib import Path
import pandas as pd

FILENAME   = "ilpd_clean.csv"
LOCAL_DIR  = None  # e.g. r"/Users/you/data/ilpd_release" (leave as None to use current folder)

try:
    # --- Colab case ---
    import google.colab  # type: ignore
    from google.colab import drive  # type: ignore

    drive.mount("/content/drive", force_remount=False)
    base = Path("/content/drive/MyDrive")
    csv_path = (base / "ilpd_release" / FILENAME)
    if not csv_path.exists():
        csv_path = base / FILENAME  # fall back to MyDrive root
except ImportError:
    # --- Local case ---
    if LOCAL_DIR:
        base = Path(LOCAL_DIR).expanduser()
        csv_path = base / FILENAME if base.is_dir() else Path(LOCAL_DIR)
    else:
        csv_path = Path.cwd() / FILENAME  # assume file is next to the notebook

# Final load
if not csv_path.exists():
    raise FileNotFoundError(f"Couldn't find {FILENAME} at: {csv_path}\n"
                            f"Put the file next to the notebook, or set LOCAL_DIR to its folder.")
df = pd.read_csv(csv_path)
print("Shape:", df.shape)

df.head(10)


## Section 1 — Set up the Features and Target *(10 points)*

**Goal.** Identify the target (`disease`) and define the feature set for modeling (numeric features only).

**What you should do (conceptually):**
- Verify the dataset’s basic structure (shape, first few rows, data types).
- Confirm the target column is `disease` and encoded as **0/1** integers.
- Define the feature columns as **all non-target** variables.
- If a boolean indicator like `Gender_Male` exists, ensure it’s represented as **0/1** (without altering non-boolean values).

**Checklist for completion:**
- Target column correctly identified and encoded as 0/1.
- Feature set excludes the target and contains other variables including gender.
- Any boolean indicators are consistently represented as 0/1.
- Brief note on whether any missing values are present.

**Grading (10 points):**
- Target identification & encoding (0–3)
- Proper feature selection (target excluded) (0–4)
- Boolean handling for Gender feature (0–2)
- Brief data sanity checks (shape/dtypes/any missing vlues) (0–1)


In [None]:
import numpy as np
import matplotlib.pyplot as plt

from pandas.api.types import is_bool_dtype


# ML
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score, accuracy_score, f1_score, classification_report,
    ConfusionMatrixDisplay,precision_recall_curve, average_precision_score, RocCurveDisplay
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# ----------------------------
# 1) Set up the features and target
# ----------------------------


## Section 2 — Train / Test Split *(10 points)*

**Goal.** Create a stratified train/test split that preserves class proportions and keeps the test set untouched for final evaluation.

**What you should do (conceptually):**
- Define **X** as the selected numeric feature matrix and **y** as the `disease` label.
- Perform an **75/25 split** **with stratification** on `y` to maintain class balance.
- Use a **fixed random seed** for reproducibility (42).
- Keep the **test set strictly untouched** until the final evaluation.

**Checklist for completion:**
- Clear identification of X (features) and y (target).
- Stratified split performed with the specified test size.
- Random seed documented and used.
- Class balance verified (at least on the full set; ideally also on train vs. test).
- Statement that the test set will not be used during model selection/tuning.

**Grading (10 points):**
- Correct definition of X and y (0–2)
- Proper **stratified** split at the requested ratio (0–4)
- Reproducibility via fixed random state (0–2)
- Class balance check & “no peeking” note for the test set (0–2)


In [None]:
# ----------------------------
# 2) Train / Test split
# ----------------------------



## Section 3 — 5-Fold CV + Grid Search for K-NN *(12 points)*

**Goal.** Select the best **K-NN** hyperparameters using **5-fold Stratified cross-validation**:
- Tune **k** (number of neighbors) over a small, odd-valued set.
- Compare **distance weighting**: `uniform` vs. `distance`.
- Use **`f1_macro`** as the primary scoring metric (treats classes evenly).

**What you should do (conceptually):**
- Build a pipeline that standardizes features and applies K-NN.
- Define a **grid** for `k` (e.g., 3–17, odd values) and `weights` (`uniform`, `distance`).
- Use **Stratified 5-fold CV**, shuffled with a fixed random seed for reproducibility.
- Run a **grid search**, select the configuration with the best mean **`f1_macro`**.
- Refit the **best model on the full training set** (standard practice after selection).

**Checklist for completion:**
- Pipeline includes **scaling → K-NN**.
- Hyperparameter grid covers **k** and **weights** clearly.
- CV is **Stratified**, **5 folds**, **shuffled**, **seeded**.
- Primary scoring is **`f1_macro`** (with brief rationale).
- Best model **refit** on train.

**Grading (12 points):**
- proper pipeline (scaling and K-NN) (0-2)
- Correct CV design (Stratified, 5-fold, shuffled, seeded) (0–4)
- Sensible hyperparameter grid (k range + weights) (0–4)
- Proper scoring choice and justification (`f1_macro`) (0–3)
- Best model refit on the full training set (0–1)


In [None]:
# ----------------------------
# 3) Use 5-fold CV with grid search to identify best 
# Knn parameter ( k value and distance weighting "uniform", "distance")
# ----------------------------



## Section 4 — Apply Best Model to Test Data *(5 points)*

**Goal.** Use the **refit best estimator** from the grid search to generate **test-set predictions**.

**What you should do (as used here):**
- Retrieve the refit best pipeline/estimator from the search object.
- Report the **best hyperparameters**.
- Produce **`y_pred`** by predicting on **`X_test`**.

**Checklist for completion:**
- Best estimator retrieved and identified.
- Best parameters reported.
- Test-set predictions (`y_pred`) generated from the best estimator.

**Grading (5 points):**
- Retrieve refit best estimator (0–2)  
- Report best params (0–1)  
- Predict on `X_test` to obtain `y_pred` (0–2)


In [None]:
# ----------------------------
# 4) Apply on Test data to get Y_pred   
# ----------------------------




## Section 5 — Class Evaluation Table & Confusion Matrix *(10 points)*

**Goal.** Summarize test-set performance with a **per-class metrics table** and a **confusion matrix**.

**What you should do (as used here):**
- Generate a **classification report** showing **precision, recall, F1-score, support** for each class.
- Plot a **confusion matrix** using the test labels and predictions.
- Provide a short interpretation (e.g., which class is misclassified more often).

**Checklist for completion:**
- Classification report produced for the **test set** (per-class metrics visible).
- Confusion matrix plotted for **`y_test` vs. `y_pred`**.
- Clear title/labels and a brief interpretation of errors.

**Grading (10 points):**
- Correct classification report on test set (0–4)  
- Correct confusion matrix on test set (0–4)  
- Brief, accurate interpretation (0–2)

- **Provide a short interpretation:** <mark>Which class is misclassified more often (describe your interpretation)?</mark>



In [None]:
# ----------------------------
# 5) Class Evaluation Table and Confusion Matrix 
# ----------------------------



## Section 6 — Visualize ROC & Precision–Recall *(10 points)*

**Goal.** For binary classification, summarize threshold-independent performance with **ROC** and **Precision–Recall** visualizations.

**What you should do (as used here):**
- Obtain **predicted probabilities** for the positive class (if the model provides them).
- Compute and **report ROC AUC** on the test set.
- Plot the **Precision–Recall (PR) curve** and display **Average Precision (AP)**.
- Plot the **ROC curve** for the test set.

**Checklist for completion:**
- Positive-class probabilities obtained for the test set.
- **ROC AUC** reported.
- **PR curve** plotted with **AP** labeled.
- **ROC curve** plotted with an appropriate title.

**Grading (10 points):**
- Correct probability extraction for the positive class (0–3)  
- ROC AUC computed and reported (0–2)  
- PR curve + AP shown (0–3)  
- ROC curve shown (0–2)


In [None]:
# ----------------------------
# 6) Visualize ROC and Precision–Recall
# ----------------------------




## Section 7 — Interpret Your Results *(10 points)*

**Goal.** Demonstrate understanding of the evaluation metrics/plots without repeating earlier confusion-matrix analysis.

**Answer briefly (1–2 sentences each):**
1. **Primary metric choice:** Given the ~71/29 class mix, explain why **F1-macro** (or balanced accuracy) is preferable to plain accuracy.
- **Provide your answer:** <mark>Question 1?</mark>
 
2. **ROC-AUC vs. PR (AP):** Which is more informative for this dataset and why?
- **Provide your answer:** <mark>Question 2?</mark>


 
3. **Thresholding:** If prioritizing recall for disease, would you raise or lower the 0.5 threshold, and what trade-off do you expect?
- **Provide your answer:** <mark>Question 3?</mark>


**Grading (10 points):**
- Metric choice justification (0–4)  
- ROC-AUC vs. PR insight (0–3)  
- Threshold trade-off explanation (0–3)





## Random Forest

## Section 8 — 5-Fold CV + Grid Search for Random Forest *(15 points)*

**Goal.** Use **5-fold Stratified cross-validation** to select Random Forest hyperparameters and refit the best model.

**What you should do (as used here):**
- Build a pipeline with a **RandomForestClassifier**.
- Define **StratifiedKFold(n_splits=5, shuffle=True, random_state=42)**.
- Specify the hyperparameter grid:
  - **n_estimators:** {200, 400}
  - **max_depth:** {None, 10, 20}
  - **min_samples_split:** {2, 10}
  - **min_samples_leaf:** {1, 5}
  - **max_features:** {"sqrt", 0.5}
  - Run a **grid search**, select the configuration with the best mean **`f1_macro`**.
  - Refit the **best model on the full training set** (standard practice after selection).



**Checklist for completion:**
- RF pipeline defined.
- Stratified 5-fold CV (shuffled, seeded) used.
- Grid exactly as listed above.
- Best model refit on training data.

**Grading (15 points):**
- Correct CV setup (Stratified, 5-fold, shuffle, seed) (0–4)
- Hyperparameter grid exactly matches the spec (0–4)
- Proper Grid Search (0–4)
- Best-model refit stated (0–3)


In [None]:
# ----------------------------
# 8) Use 5-fold CV with grid search to identify best Random Forest parameter
# 
# n_estimators": [200, 400]
# max_depth": [None, 10, 20]
# min_samples_split": [2, 10]
# min_samples_leaf": [1, 5]
# max_features": ["sqrt", 0.5]
# ----------------------------



## Section 9 — Apply Best Model to the Test Set *(5 points)*

**Goal.** Use the refit **best model** (selected by cross-validation) to generate **predictions on the held-out test set**.

**What you should do (conceptually):**
- Retrieve the **best-performing** model from your hyperparameter search (the one automatically refit on the full training data).
- Note the **best hyperparameters** selected (for reporting).
- Produce **test-set predictions** using the untouched **X_test**.

**Checklist for completion:**
- Best model clearly identified and referenced.
- Best hyperparameters briefly reported.
- Test-set predictions generated (and stored) for downstream evaluation.

**Grading (5 points):**
- Identify and use the refit best model (0–2)  
- Report best hyperparameters (0–1)  
- Generate predictions on the test set (0–2)


In [None]:
# ----------------------------
# 9) Apply on Test data to get Y_predict 
# ----------------------------




## Section 10 — Class Evaluation Table & Confusion Matrix *(10 points)*

**Goal.** Summarize **test-set** performance of your **Random Forest** with a per-class metrics table and a confusion matrix.

**What you should do (conceptually):**
- Produce a **classification report** on the test set showing **precision, recall, F1-score, support** for each class.
- Plot a **confusion matrix** comparing **true labels vs. predictions**.
- Provide a **brief interpretation**: which class is misclassified more often, and what that implies (e.g., lower recall for the minority class).

**Checklist for completion:**
- Classification report generated for the **test set**.
- Confusion matrix plotted and clearly labeled (title reflects Random Forest).
- One–two sentence interpretation referencing a specific metric or cell in the matrix.

**Grading (10 points):**
- Correct classification report on test set (0–4)  
- Correct confusion matrix on test set (0–4)  
- Clear, concise interpretation (0–2)

- **Provide a short interpretation:** <mark>Which class is misclassified more often (describe your interpretation)?</mark>


In [None]:
# ----------------------------
# 10) Class Evaluation Table and Confusion Matrix 
# ----------------------------




## Section 11 — Visualize ROC & Precision–Recall *(10 points)*

**Goal.** Use **threshold-independent** plots to evaluate your **Random Forest** on the test set.

**What you should do (conceptually):**
- Obtain **predicted probabilities** for the **positive** class from the best model.
- Compute and **report ROC-AUC** on the test set.
- Plot the **Precision–Recall (PR) curve** and report **Average Precision (AP)**.
- Plot the **ROC curve** and label axes and title appropriately.

**Checklist for completion:**
- Positive-class probabilities extracted for **X_test**.
- **ROC-AUC** reported (test set).
- **PR curve** shown with **AP** displayed.
- **ROC curve** shown with clear title/labels referencing Random Forest.

**Grading (10 points):**
- Correct probability extraction for the positive class (0–3)  
- ROC-AUC computed and reported (0–2)  
- PR curve + AP displayed (0–3)  
- ROC curve displayed (0–2)


In [None]:
# ----------------------------
# 11) Visualize ROC and Precision–Recall
# ----------------------------




## Section 12 — Top Features by Importance *(10 points)*

**Goal.** Identify and visualize the **top 10 features** that contribute most to the **Random Forest** model.

**What you should do (conceptually):**
- Obtain the **feature importance scores** from the trained Random Forest.
- Align these scores with the **exact feature names** used to fit the model.
- Select the **top 10** features by importance and **order them descending** (largest at the top).
- Create a **horizontal bar chart** of these top features.
- Add a concise caption/observation about which variables dominate and any domain-plausible reasons.

**Checklist for completion:**
- Correct mapping: importance scores ↔ feature names.
- Top **10** features selected and sorted (highest to lowest).
- Horizontal bar plot with **clear labels** (feature names on the y-axis, “Feature importance” on the x-axis).
- Brief interpretation (1–2 sentences).

**Grading (10 points):**
- Proper name–importance alignment (0–3)  
- Correct selection/sorting of top 10 (0–3)  
- Clear, labeled plot (0–2)  
- Brief interpretation (0–2)


In [None]:
# ----------------------------
# 12) Get and plot the top 10 features ranked in descending order according to importance
# -----------------------------


