<a href="https://colab.research.google.com/github/mppenfold/AIHC5010-Winter-2026/blob/main/Assignment1_Colab_Workflow_JZ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1 — Colab Workflow (GitHub + Pre-commit + Submission Validation)

This notebook teaches the standard workflow used throughout the course:

1. Clone your team repo
2. Install dependencies
3. Install **pre-commit** and enable a hook to strip notebook outputs
4. Run this notebook end-to-end
5. Validate `predictions.csv`
6. Commit + push + tag


In [1]:
# (Colab) show python and system info
import sys, platform
print(sys.version)
print(platform.platform())


3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Linux-6.6.105+-x86_64-with-glibc2.35


## 1) Clone Repo

Login to your personal Github account, and make a fork of: https://github.com/TLKline/AIHC-5010-Winter-2026

Follow setup directions for working with a PAT in GitHub (30-second guide):

* Go to GitHub → Settings
* Developer settings
* Personal access tokens
* Choose:
  * Fine-Grained

You can clone using HTTPS.

Repo HTTPS URL (e.g., `https://github.com/TLKline/AIHC-5010-Winter-2026.git`)

In [2]:
# TODO: Change the following to your github repo path
repo_path = 'https://github.com/joezein71/AIHC-5010-Winter-2026.git'
!git clone {repo_path} student_repo

fatal: destination path 'student_repo' already exists and is not an empty directory.


In [3]:
# Move into repo
%cd student_repo

# Repo git info
!git status

# Where are we?
print('----------')
print('We are at:')
!pwd


/content/student_repo
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   predictions.csv[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mAIHC-5010-Winter-2026/[m
	[31mstudent_repo/[m

no changes added to commit (use "git add" and/or "git commit -a")
----------
We are at:
/content/student_repo


## 2) Install dependencies

This installs whatever is in `requirements.txt`.


In [4]:
!pip -q install -r Project-1/readmit30/requirements.txt

## 3) Enable pre-commit hook to strip notebook outputs

This prevents giant notebooks and reduces merge/diff pain.

One-time per clone:
- `pre-commit install`

After that, every `git commit` will strip outputs from `*.ipynb`.


In [5]:
!pip -q install pre-commit
!pre-commit install


pre-commit installed at .git/hooks/pre-commit


#MAINSTART

# 4) Submission Notebook (Template)

Replace the baseline model with your team’s approach.

In [6]:
import os
from pathlib import Path

TRAIN_PATH = os.environ.get("TRAIN_PATH", "Project-1/readmit30/scripts/data/public/train.csv")
DEV_PATH   = os.environ.get("DEV_PATH",   "Project-1/readmit30/scripts/data/public/dev.csv")
TEST_PATH  = os.environ.get("TEST_PATH",  "Project-1/readmit30/scripts/data/public/public_test.csv")
OUT_PATH   = os.environ.get("OUT_PATH",   "predictions.csv")

print("TRAIN_PATH:", TRAIN_PATH)
print("DEV_PATH:", DEV_PATH)
print("TEST_PATH:", TEST_PATH)
print("OUT_PATH:", OUT_PATH)

TRAIN_PATH: Project-1/readmit30/scripts/data/public/train.csv
DEV_PATH: Project-1/readmit30/scripts/data/public/dev.csv
TEST_PATH: Project-1/readmit30/scripts/data/public/public_test.csv
OUT_PATH: predictions.csv


In [7]:
import numpy as np
import pandas as pd
np.random.seed(42)

train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

assert "row_id" in train.columns and "readmit30" in train.columns
assert "row_id" in test.columns

X_train = train.drop(columns=["readmit30"])
y_train = train["readmit30"].astype(int)

### Summary of Training Dataset (Numerical Features)

In [8]:
display(train.describe())

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,row_id,readmit30
count,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0,65003.0
mean,165276600.0,54407740.0,2.022537,3.714152,5.74012,4.37666,43.027553,1.334769,15.983216,0.366383,0.199698,0.627417,7.421704,165276600.0,0.111626
std,102553300.0,38676050.0,1.444042,5.289158,4.061814,2.968984,19.655092,1.700345,8.069052,1.253617,0.989302,1.247288,1.933656,102553300.0,0.314908
min,12522.0,729.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,12522.0,0.0
25%,85490130.0,23436260.0,1.0,1.0,1.0,2.0,31.0,0.0,10.0,0.0,0.0,0.0,6.0,85490130.0,0.0
50%,152431300.0,45515590.0,1.0,1.0,7.0,4.0,44.0,1.0,15.0,0.0,0.0,0.0,8.0,152431300.0,0.0
75%,230272000.0,87716620.0,3.0,3.0,7.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0,230272000.0,0.0
max,443867200.0,189481500.0,8.0,28.0,25.0,14.0,129.0,6.0,81.0,40.0,76.0,21.0,16.0,443867200.0,1.0


### Summary of Training Dataset (Categorical Features)

In [9]:
for column in train.select_dtypes(include='object').columns:
    print(f"\nValue counts for column: {column}")
    display(train[column].value_counts())


Value counts for column: race


Unnamed: 0_level_0,count
race,Unnamed: 1_level_1
Caucasian,48447
AfricanAmerican,12381
?,1471
Hispanic,1304
Other,984
Asian,416



Value counts for column: gender


Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
Female,35011
Male,29990
Unknown/Invalid,2



Value counts for column: age


Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
[70-80),16725
[60-70),14550
[80-90),10940
[50-60),10832
[40-50),6153
[30-40),2453
[90-100),1770
[20-30),1034
[10-20),438
[0-10),108



Value counts for column: weight


Unnamed: 0_level_0,count
weight,Unnamed: 1_level_1
?,63018
[75-100),808
[50-75),572
[100-125),400
[125-150),91
[25-50),57
[0-25),31
[150-175),18
[175-200),6
>200,2



Value counts for column: payer_code


Unnamed: 0_level_0,count
payer_code,Unnamed: 1_level_1
?,25733
MC,20677
HM,4057
SP,3178
BC,2868
MD,2292
CP,1656
UN,1579
CM,1247
OG,671



Value counts for column: medical_specialty


Unnamed: 0_level_0,count
medical_specialty,Unnamed: 1_level_1
?,31809
InternalMedicine,9460
Emergency/Trauma,4827
Family/GeneralPractice,4746
Cardiology,3356
...,...
Proctology,1
SportsMedicine,1
Dermatology,1
Perinatology,1



Value counts for column: diag_1


Unnamed: 0_level_0,count
diag_1,Unnamed: 1_level_1
428,4367
414,4167
786,2558
410,2327
486,2225
...,...
833,1
895,1
506,1
149,1



Value counts for column: diag_2


Unnamed: 0_level_0,count
diag_2,Unnamed: 1_level_1
276,4402
428,4219
250,3917
427,3159
401,2430
...,...
379,1
235,1
V69,1
215,1



Value counts for column: diag_3


Unnamed: 0_level_0,count
diag_3,Unnamed: 1_level_1
250,7444
401,5293
276,3323
428,2922
427,2485
...,...
757,1
841,1
622,1
463,1



Value counts for column: max_glu_serum


Unnamed: 0_level_0,count
max_glu_serum,Unnamed: 1_level_1
Norm,1672
>200,944
>300,803



Value counts for column: A1Cresult


Unnamed: 0_level_0,count
A1Cresult,Unnamed: 1_level_1
>8,5249
Norm,3168
>7,2375



Value counts for column: metformin


Unnamed: 0_level_0,count
metformin,Unnamed: 1_level_1
No,52268
Steady,11717
Up,665
Down,353



Value counts for column: repaglinide


Unnamed: 0_level_0,count
repaglinide,Unnamed: 1_level_1
No,64044
Steady,859
Up,76
Down,24



Value counts for column: nateglinide


Unnamed: 0_level_0,count
nateglinide,Unnamed: 1_level_1
No,64565
Steady,411
Up,20
Down,7



Value counts for column: chlorpropamide


Unnamed: 0_level_0,count
chlorpropamide,Unnamed: 1_level_1
No,64952
Steady,46
Up,4
Down,1



Value counts for column: glimepiride


Unnamed: 0_level_0,count
glimepiride,Unnamed: 1_level_1
No,61717
Steady,2939
Up,220
Down,127



Value counts for column: acetohexamide


Unnamed: 0_level_0,count
acetohexamide,Unnamed: 1_level_1
No,65002
Steady,1



Value counts for column: glipizide


Unnamed: 0_level_0,count
glipizide,Unnamed: 1_level_1
No,56944
Steady,7205
Up,514
Down,340



Value counts for column: glyburide


Unnamed: 0_level_0,count
glyburide,Unnamed: 1_level_1
No,58261
Steady,5896
Up,507
Down,339



Value counts for column: tolbutamide


Unnamed: 0_level_0,count
tolbutamide,Unnamed: 1_level_1
No,64990
Steady,13



Value counts for column: pioglitazone


Unnamed: 0_level_0,count
pioglitazone,Unnamed: 1_level_1
No,60253
Steady,4516
Up,164
Down,70



Value counts for column: rosiglitazone


Unnamed: 0_level_0,count
rosiglitazone,Unnamed: 1_level_1
No,60956
Steady,3872
Up,117
Down,58



Value counts for column: acarbose


Unnamed: 0_level_0,count
acarbose,Unnamed: 1_level_1
No,64809
Steady,185
Up,6
Down,3



Value counts for column: miglitol


Unnamed: 0_level_0,count
miglitol,Unnamed: 1_level_1
No,64983
Steady,14
Down,4
Up,2



Value counts for column: troglitazone


Unnamed: 0_level_0,count
troglitazone,Unnamed: 1_level_1
No,65002
Steady,1



Value counts for column: tolazamide


Unnamed: 0_level_0,count
tolazamide,Unnamed: 1_level_1
No,64976
Steady,26
Up,1



Value counts for column: examide


Unnamed: 0_level_0,count
examide,Unnamed: 1_level_1
No,65003



Value counts for column: citoglipton


Unnamed: 0_level_0,count
citoglipton,Unnamed: 1_level_1
No,65003



Value counts for column: insulin


Unnamed: 0_level_0,count
insulin,Unnamed: 1_level_1
No,30268
Steady,19776
Down,7732
Up,7227



Value counts for column: glyburide-metformin


Unnamed: 0_level_0,count
glyburide-metformin,Unnamed: 1_level_1
No,64516
Steady,476
Up,7
Down,4



Value counts for column: glipizide-metformin


Unnamed: 0_level_0,count
glipizide-metformin,Unnamed: 1_level_1
No,64990
Steady,13



Value counts for column: glimepiride-pioglitazone


Unnamed: 0_level_0,count
glimepiride-pioglitazone,Unnamed: 1_level_1
No,65002
Steady,1



Value counts for column: metformin-rosiglitazone


Unnamed: 0_level_0,count
metformin-rosiglitazone,Unnamed: 1_level_1
No,65002
Steady,1



Value counts for column: metformin-pioglitazone


Unnamed: 0_level_0,count
metformin-pioglitazone,Unnamed: 1_level_1
No,65002
Steady,1



Value counts for column: change


Unnamed: 0_level_0,count
change,Unnamed: 1_level_1
No,35098
Ch,29905



Value counts for column: diabetesMed


Unnamed: 0_level_0,count
diabetesMed,Unnamed: 1_level_1
Yes,50097
No,14906


**Missingness in Training and testing datasets**

In [10]:
import pandas as pd

# Calculate missing values for training set
missing_train_count = train.isnull().sum()
missing_train_percent = (train.isnull().sum() / len(train) * 100)
missing_train_df = pd.DataFrame({
    'Train Missing Count': missing_train_count,
    'Train Missing Percentage (%)': missing_train_percent
})

# Calculate missing values for test set
missing_test_count = test.isnull().sum()
missing_test_percent = (test.isnull().sum() / len(test) * 100)
missing_test_df = pd.DataFrame({
    'Test Missing Count': missing_test_count,
    'Test Missing Percentage (%)': missing_test_percent
})

# Merge the two dataframes side-by-side
combined_missing_df = pd.merge(
    missing_train_df,
    missing_test_df,
    left_index=True,
    right_index=True,
    how='outer' # Use outer join to include variables missing in only one set
).fillna(0) # Fill NaN values (for variables missing in only one set) with 0

# Display the combined table
print("Combined Missing Data Analysis (Training vs. Test Set - All Variables):")
display(combined_missing_df.sort_values(by='Train Missing Count', ascending=False).head(10)) # Sort for better readability, but include all

Combined Missing Data Analysis (Training vs. Test Set - All Variables):


Unnamed: 0,Train Missing Count,Train Missing Percentage (%),Test Missing Count,Test Missing Percentage (%)
max_glu_serum,61584,94.740243,15465.0,94.795881
A1Cresult,54211,83.397689,13514.0,82.836827
acarbose,0,0.0,0.0,0.0
admission_source_id,0,0.0,0.0,0.0
admission_type_id,0,0.0,0.0,0.0
age,0,0.0,0.0,0.0
change,0,0.0,0.0,0.0
chlorpropamide,0,0.0,0.0,0.0
citoglipton,0,0.0,0.0,0.0
diabetesMed,0,0.0,0.0,0.0


In [None]:
import pandas as pd

# Calculate missing values for training set
missing_train_count = train.isnull().sum()
missing_train_percent = (train.isnull().sum() / len(train) * 100)
missing_train_df = pd.DataFrame({
    'Train Missing Count': missing_train_count,
    'Train Missing Percentage (%)': missing_train_percent
})

# Calculate missing values for test set
missing_test_count = test.isnull().sum()
missing_test_percent = (test.isnull().sum() / len(test) * 100)
missing_test_df = pd.DataFrame({
    'Test Missing Count': missing_test_count,
    'Test Missing Percentage (%)': missing_test_percent
})

# Merge the two dataframes side-by-side
combined_missing_df = pd.merge(
    missing_train_df,
    missing_test_df,
    left_index=True,
    right_index=True,
    how='outer' # Use outer join to include variables missing in only one set
).fillna(0) # Fill NaN values (for variables missing in only one set) with 0

# Display the combined table
print("Combined Missing Data Analysis (Training vs. Test Set - All Variables):")
display(combined_missing_df.sort_values(by='Train Missing Count', ascending=False).head(10)) # Sort for better readability, but include all

In [None]:
import pandas as pd

dev = pd.read_csv(DEV_PATH) # Load the development data

print("Missing values in 'readmit30' column (training data):", train['readmit30'].isnull().sum())
print("Missing values in 'readmit30' column (development data):", dev['readmit30'].isnull().sum())

### Summary of Variables with Outliers

Based on the box plots, the following numerical covariates show significant outliers:

*   **`time_in_hospital`**: Exhibited data points far above the upper whisker, indicating unusually long hospital stays.
*   **`num_lab_procedures`**: Showed several points beyond the whiskers, suggesting exceptionally high numbers of lab procedures for some patients.
*   **`num_medications`**: Displayed a clear right-skewed distribution with many data points above the upper whisker, implying a considerable number of medications for some patients.
*   **`number_outpatient`**: This column has a highly skewed distribution with numerous outliers, indicating a small fraction of patients have a very high number of outpatient visits.
*   **`number_emergency`**: This also shows a strong presence of outliers, with some patients having a significantly higher number of emergency visits.
*   **`number_inpatient`**: Similar to outpatient and emergency visits, this feature has many outliers, suggesting a subset of patients have had a much higher frequency of inpatient admissions.
*   **`number_diagnoses`**: Contained some outliers at the higher end, representing patients with an unusually large number of diagnoses.

In [None]:
import numpy as np

# Assuming 'numerical_cols' is already defined from previous steps
# If not, you'd redefine it here:
# numerical_cols = train.select_dtypes(include=['int64', 'float64']).columns.tolist()
# if 'readmit30' in numerical_cols: numerical_cols.remove('readmit30')
# if 'row_id' in numerical_cols: numerical_cols.remove('row_id')

outlier_counts = {}

for col in numerical_cols:
    Q1 = train[col].quantile(0.25)
    Q3 = train[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Count outliers
    outliers = train[(train[col] < lower_bound) | (train[col] > upper_bound)]
    outlier_counts[col] = len(outliers)

print("Number of outliers per numerical variable (using IQR method):")
for col, count in outlier_counts.items():
    print(f"{col}: {count}")

### Number of Duplicates per Variable

In [None]:
duplicate_counts = {}

for col in train.columns:
    # Count duplicates (excluding the first occurrence)
    duplicate_counts[col] = train[col].duplicated().sum()

print("Number of duplicates per variable (excluding first occurrence):")
for col, count in duplicate_counts.items():
    print(f"{col}: {count}")

# Task
Analyze potential data leakage in the `train` dataset by visualizing the relationship between each feature and the target variable `readmit30`. Specifically:

1.  For numerical features (excluding `encounter_id`, `patient_nbr`, `row_id`), create box plots or violin plots comparing their distributions for patients readmitted (`readmit30=1`) versus not readmitted (`readmit30=0`).
2.  For categorical features (excluding `encounter_id`, `patient_nbr`, `row_id`), create stacked bar charts or count plots to show the proportion of `readmit30` outcomes within each category.

Finally, summarize any features that exhibit an unusually strong or direct relationship with `readmit30`, indicating potential data leakage or highly predictive features that warrant further investigation.

### Summary of Features with Potential Data Leakage or High Predictiveness

Based on the visualizations of numerical and categorical features against the `readmit30` target variable, several features show a strong relationship, which could indicate either high predictiveness or potential data leakage:

**Numerical Features (from Box Plots):**

*   **`discharge_disposition_id`**: Certain `discharge_disposition_id` values (e.g., those indicating death or transfer to another facility) show a clear separation in `readmit30` distributions. For instance, if a patient is discharged to a hospice or dies, they cannot be readmitted, leading to `readmit30=0`. This is a strong indicator of **data leakage** because the discharge disposition often occurs at the end of the hospital stay, and certain outcomes directly preclude readmission. Specifically, `discharge_disposition_id` values like '11' (Expired), '13' (Discharged/transferred to home under care of Home IV provider), '14' (Discharged/transferred to a non-acute care facility), '19' (Expired at home. Medicaid only, hospice), '20' (Expired in a medical facility. Medicaid only, hospice), '21' (Discharged/transferred to another institution for inpatient care), '24' (Expired at home - hospice), '25' (Expired in a medical facility - hospice) are highly indicative of no readmission.
*   **`number_outpatient`, `number_emergency`, `number_inpatient`**: These features, particularly `number_outpatient` and `number_emergency`, show a trend where higher counts are associated with a higher likelihood of readmission. While not as direct as `discharge_disposition_id`, a very high number of prior outpatient, emergency, or inpatient visits for the same patient (`patient_nbr`) could be highly predictive, and in some contexts, could hint at `data leakage` if this information is recorded or becomes known *after* the decision for the current admission but before the 30-day readmission window.
*   **`time_in_hospital`**: Patients with longer hospital stays tend to have a slightly higher median for `readmit30=1`, though the overlap is significant. This could be a genuine predictor rather than leakage.

**Categorical Features (from Stacked Bar Charts):**

*   **`discharge_disposition_id`**: Similar to its numerical interpretation, several categories within `discharge_disposition_id` directly correlate with `readmit30=0` (no readmission). For example, discharge dispositions indicating death or transfer to another facility (e.g., categories '11', '13', '14', '19', '20', '21', '24', '25') will almost exclusively show a 0% readmission rate. This is the most significant **data leakage** observed.
*   **`age`**: The `[70-80)` and `[80-90)` age groups appear to have a slightly higher proportion of readmissions compared to younger groups, suggesting it's a predictive feature.
*   **`diag_1`, `diag_2`, `diag_3`**: Some specific diagnostic codes might show very high or very low readmission rates. If certain diagnoses are closely tied to the discharge outcome that prevents readmission, they could also reflect leakage indirectly through `discharge_disposition_id`. Without detailed medical knowledge, it's hard to distinguish true causality from leakage for these.
*   **`payer_code`**: Certain payer codes might correlate with different healthcare access or follow-up care, leading to varying readmission rates, but no extremely stark differences indicating direct leakage.

**Key Takeaway for Leakage:**
The feature **`discharge_disposition_id`** is the clearest and most significant indicator of **data leakage**. The information contained in this feature directly determines whether a patient can be readmitted (e.g., if they expired or were transferred to another inpatient facility, they cannot be readmitted to the same hospital within 30 days). Using this feature directly in a predictive model without careful handling would lead to an artificially inflated performance.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# TODO: Add any new imports for your own method here
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

method = 4

cat_cols = [c for c in X_train.columns if X_train[c].dtype == "object"]
num_cols = [c for c in X_train.columns if c not in cat_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", SimpleImputer(strategy="median"))]), num_cols),
        ("cat", Pipeline([("imputer", SimpleImputer(strategy="most_frequent")),
                          ("onehot", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
    ],
)

if method==1:
    # Use logistic regression model
    clf = Pipeline([
        ("preprocess", preprocess),
        ("model", LogisticRegression(max_iter=200)),
    ])

if method==2:
    # Use logistic regression model
    clf = Pipeline([
        ("preprocess", preprocess),
        ("model", LogisticRegression(max_iter=200,class_weight='balanced')),
    ])

if method==3:
    # Use SVC (i.e. SVM model)
    clf = Pipeline(
        [
            ("preprocess", preprocess),
            ("scaler", StandardScaler(with_mean=False)), # Add StandardScaler here
            ("model", SVC(gamma="auto",max_iter=1000,probability=True)),
        ]
    )

if method == 4:
    # Preprocess for HGB: ordinal-encode categories (HGB needs numeric inputs)
    preprocess_hgb = ColumnTransformer(
        transformers=[
            ("num", Pipeline([
                ("imputer", SimpleImputer(strategy="median")),
            ]), num_cols),
            ("cat", Pipeline([
                ("imputer", SimpleImputer(strategy="most_frequent")),
                ("ord", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),
            ]), cat_cols),
        ],
        remainder="drop",
    )

    clf = Pipeline([
        ("preprocess", preprocess_hgb),
        ("model", HistGradientBoostingClassifier(
            max_depth=6,
            learning_rate=0.05,
            max_iter=300,
            l2_regularization=1.0,
            early_stopping=True,
            random_state=42,
            class_weight='balanced',
        )),
    ])

clf.fit(X_train, y_train)

In [None]:
p_test = clf.predict_proba(test)[:, 1]
pred = pd.DataFrame({"row_id": test["row_id"].astype(int), "prob_readmit30": p_test.astype(float)})
pred.to_csv(OUT_PATH, index=False)
pred.head()

In [None]:
# Validate output format (required for students before tagging)
!python Project-1/readmit30/scripts/validate_submission.py --pred {OUT_PATH} --test {TEST_PATH}


In [None]:
# Calculate metrics for the dev set
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss
import matplotlib.pyplot as plt

dev = pd.read_csv(DEV_PATH)

X_dev = dev.drop(columns=["readmit30"])
y_dev = dev["readmit30"].astype(int)

# Calculate metrics
y_true = y_dev.astype(int)
y_pred = clf.predict_proba(X_dev)[:, 1]

auroc = roc_auc_score(y_true, y_pred)
auprc = average_precision_score(y_true, y_pred)
brier = brier_score_loss(y_true, y_pred)

print(f'AUROC: {auroc:.4f}')
print(f'AUPRC: {auprc:.4f}')
print(f'Brier Score: {brier:.4f}')

# Create figures
plt.figure(figsize=(10, 6))

# Histogram of predicted probabilities
plt.hist(y_pred, bins=20, alpha=0.7, label='Predicted Probabilities')
plt.title('Histogram of Predicted Probabilities')
plt.xlabel('Probability')
plt.ylabel('Frequency')
plt.legend()
plt.show()

# Scatter plot of true vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_true, y_pred, alpha=0.5, label='True vs Predicted')
plt.title('True vs Predicted Probabilities')
plt.xlabel('True Labels')
plt.ylabel('Predicted Probabilities')
plt.legend()
plt.show()

# Create ROC Curve
from sklearn.metrics import roc_curve

fpr, tpr, _ = roc_curve(y_true, y_pred)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, label=f'AUROC = {auroc:.4f}')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

# Create Precision-Recall Curve
from sklearn.metrics import precision_recall_curve

precision, recall, _ = precision_recall_curve(y_true, y_pred)
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, label=f'AUPRC = {auprc:.4f}')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()

# Create Confusion Matrix Heatmap
from sklearn.metrics import confusion_matrix
import seaborn as sns

threshold = 0.5  # Default threshold for binary classification
y_pred_binary = (y_pred >= threshold).astype(int)
cm = confusion_matrix(y_true, y_pred_binary)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Readmit', 'Readmit'], yticklabels=['No Readmit', 'Readmit'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

#MAINEND

In [None]:
print("Missing values in 'readmit30' column (training data):", train['readmit30'].isnull().sum())
print("Missing values in 'readmit30' column (development data):", dev['readmit30'].isnull().sum())

## 5) Validate the predictions file format

This checks:
- required columns
- probabilities in [0, 1]
- row_ids match the test file

It assumes the submission notebook wrote `predictions.csv` in the repo root.


In [None]:
from pathlib import Path
pred_path = Path("predictions.csv")
test_path = Path("Project-1/readmit30/scripts/data/public/public_test.csv")

if not pred_path.exists():
    print("predictions.csv not found. Run notebooks/submission.ipynb first.")
else:
    !python Project-1/readmit30/scripts/validate_submission.py --pred predictions.csv --test Project-1/readmit30/scripts/data/public/public_test.csv


## 6) Commit + push + tag

You will:
- add changes
- commit (pre-commit hook runs here)
- push
- tag a milestone (example: `milestone_wk3`) and push tags



You will need a Personal Access Token (PAT) for the following step. See instructions above.

In [None]:
# ==== Colab -> GitHub commit/push for a specific notebook path (PAT auth) ====
# What this does:
#  1) clones the repo into the Colab VM
#  2) overwrites the target notebook file with the *currently open* Colab notebook
#  3) commits the change
#  4) asks you for a GitHub PAT and pushes to the target branch
#  5) (optional) creates a git tag and pushes the tag
#
# Notes:
#  - PAT is read via getpass (not echoed). It is only used for this runtime session.
#  - This overwrites the file at TARGET_REL with the *current Colab notebook contents*.

import os
import json
import subprocess
import getpass
from google.colab import _message

# ==========================
# START USER-EDITABLE SETTINGS
# ==========================
# Repo settings
REPO_HTTPS = "https://github.com/joezein71/AIHC-5010-Winter-2026.git"  # full https clone URL ending in .git
REPO_DIR   = "AIHC-5010-Winter-2026"                                # folder name to clone into (or reuse)

# Git settings
BRANCH     = "main"                                                 # branch to commit/push to
COMMIT_MSG = "Update Assignment1_Colab_Workflow.ipynb from Colab test5"    # commit message

# File to overwrite inside the repo (relative to repo root)
TARGET_REL = "Project-1/readmit30/notebooks/Assignment1_Colab_Workflow.ipynb"

# Identity for commits
GIT_USER_NAME  = "Joe Zein"
GIT_USER_EMAIL = "zein.timothy@mayo.edu"

# (Optional) If you want to push to a different remote than REPO_HTTPS, set it here.
# Leave as None to use REPO_HTTPS.
PUSH_REMOTE_HTTPS = None  # e.g. "https://github.com/<user>/<repo>.git"

# Set TAG_NAME to something like "assignment1-submission-v1".
# Leave as "" (empty string) to skip tagging.
TAG_NAME    = "assignment1-submission-v01"  # e.g. "assignment1-submission-v1"
TAG_MESSAGE = "Assignment 1 submission"  # used only for annotated tags
TAG_ANNOTATED = True  # True = annotated tag (-a -m). False = lightweight tag.
# ==========================
# END USER-EDITABLE SETTINGS
# ==========================


def run(cmd, cwd=None, check=True):
    """Run a shell command and stream output."""
    print(f"\n$ {' '.join(cmd)}")
    p = subprocess.run(cmd, cwd=cwd, text=True, capture_output=True)
    if p.stdout:
        print(p.stdout)
    if p.stderr:
        print(p.stderr)
    if check and p.returncode != 0:
        raise RuntimeError(f"Command failed with exit code {p.returncode}: {' '.join(cmd)}")
    return p


def github_authed_remote(https_remote: str, token: str) -> str:
    """
    Convert https://github.com/OWNER/REPO.git into https://TOKEN@github.com/OWNER/REPO.git
    Works for standard GitHub HTTPS remotes.
    """
    if https_remote.startswith("https://"):
        return "https://" + token + "@" + https_remote[len("https://"):]
    raise ValueError("Expected an https remote URL (starting with https://).")


def tag_exists_locally(tag_name: str, cwd: str) -> bool:
    p = subprocess.run(["git", "tag", "-l", tag_name], cwd=cwd, text=True, capture_output=True)
    return p.stdout.strip() == tag_name


REMOTE_FOR_PUSH = PUSH_REMOTE_HTTPS or REPO_HTTPS

# 1) Clone (or reuse existing clone)
if not os.path.isdir(REPO_DIR):
    run(["git", "clone", REPO_HTTPS, REPO_DIR])
else:
    print(f"Repo directory already exists: {REPO_DIR}")

# Ensure we're on the right branch and up-to-date
run(["git", "checkout", BRANCH], cwd=REPO_DIR)
run(["git", "pull", "origin", BRANCH], cwd=REPO_DIR)

# 2) Get the currently-open notebook JSON from Colab
nb = _message.blocking_request("get_ipynb", timeout_sec=30)["ipynb"]

# 3) Overwrite the target file in the clone
target_abs = os.path.join(os.getcwd(), REPO_DIR, TARGET_REL)
os.makedirs(os.path.dirname(target_abs), exist_ok=True)
with open(target_abs, "w", encoding="utf-8") as f:
    json.dump(nb, f, ensure_ascii=False, indent=1)

print("Wrote current Colab notebook to:")
print(" ", target_abs)

# 4) Configure git identity
run(["git", "config", "user.name", GIT_USER_NAME], cwd=REPO_DIR)
run(["git", "config", "user.email", GIT_USER_EMAIL], cwd=REPO_DIR)

# 5) Show status; if no changes, stop early
status = run(["git", "status", "--porcelain"], cwd=REPO_DIR, check=True).stdout.strip()
if not status:
    print("\nNo changes detected in the repo after writing the notebook.")
    print("Double-check that you're running this cell inside the notebook you edited,")
    print("and that TARGET_REL points to the correct path inside the repo.")
else:
    # 6) Add + commit
    run(["git", "add", TARGET_REL], cwd=REPO_DIR)

    commit_proc = subprocess.run(
        ["git", "commit", "-m", COMMIT_MSG],
        cwd=REPO_DIR, text=True, capture_output=True
    )
    if commit_proc.stdout:
        print(commit_proc.stdout)
    if commit_proc.stderr:
        print(commit_proc.stderr)

    combined = (commit_proc.stdout + commit_proc.stderr).lower()
    if commit_proc.returncode != 0 and "nothing to commit" not in combined:
        raise RuntimeError("git commit failed unexpectedly")

    # 7) Ask for PAT and push
    print("\nEnter a GitHub Personal Access Token (PAT) with permission to push to this repo.")
    print("Recommended: fine-grained token with access to the repo and Contents: Read/Write.")
    token = getpass.getpass("GitHub PAT (input hidden): ").strip()
    if not token:
        raise ValueError("No token entered.")

    # Temporarily set authenticated remote URL for this push only (and for tag push)
    authed_remote = github_authed_remote(REMOTE_FOR_PUSH, token)
    run(["git", "remote", "set-url", "origin", authed_remote], cwd=REPO_DIR)

    try:
        # Push commits
        run(["git", "push", "origin", BRANCH], cwd=REPO_DIR)
        print(f"\n Pushed successfully to {BRANCH}.")

        # 8) OPTIONAL: Create + push tag
        if TAG_NAME.strip():
            tag_name = TAG_NAME.strip()

            # If tag already exists locally, don't recreate
            if tag_exists_locally(tag_name, REPO_DIR):
                print(f"Tag already exists locally: {tag_name}")
            else:
                if TAG_ANNOTATED:
                    run(["git", "tag", "-a", tag_name, "-m", TAG_MESSAGE], cwd=REPO_DIR)
                else:
                    run(["git", "tag", tag_name], cwd=REPO_DIR)
                print(f"Created tag: {tag_name}")

            # Push just this tag (or use --tags to push all tags)
            run(["git", "push", "origin", tag_name], cwd=REPO_DIR)
            print(f" Pushed tag: {tag_name}")
        else:
            print("Skipping tag creation (TAG_NAME is empty).")

        print("\nDone. Check GitHub for the new commit (and tag, if set).")

    finally:
        # Restore remote URL without token
        run(["git", "remote", "set-url", "origin", REPO_HTTPS], cwd=REPO_DIR, check=False)


## Done ✅

If you hit issues:
- Make sure you pulled the latest course template (missing files).
- Make sure `data/public/*` exists in your repo (or your instructor provided it separately).
