# Lab 1 Week 1: Datasheet and Audit

Notebook: 01_datasheet_and_audit.ipynb

Student Name: [Your Name]

Date: February 4, 2026

Dataset: Surgical-deepnet.csv from Kaggle[](https://www.kaggle.com/datasets/omnamahshivai/surgical-dataset-binary-classification)

Task: Binary Classification - Predict 'complication' (0/1) from patient features.

This notebook covers Part A (Datasheet + Quality Audit + Leakage Note) and Part B (Leakage-safe Split + Pipeline + Baseline).

## Imports and Setup

Import required libraries and set random seed for reproducibility.

In [None]:
# TODO: Import pandas, numpy, sklearn, matplotlib
# TODO: Set random_state = 42

## Part A: Dataset Datasheet

### Motivation
[TODO: Brief paragraph on why this dataset exists / purpose, e.g., to predict post-surgical complications for better risk assessment.]

### Target Definition
- Target: 'complication' (binary: 0 = no complication, 1 = complication)
- Positive class represents: Presence of surgical complication (event/risk with asymmetric FP/FN costs, e.g., FN misses a complication leading to harm).

### Data Source and License
- Source: Kaggle dataset uploaded by user 'omnamahshivai'; origin likely anonymized medical records (exact source unknown).
- Download Link: https://www.kaggle.com/datasets/omnamahshivai/surgical-dataset-binary-classification/download
- License/Terms: Unknown (Kaggle default: use for non-commercial purposes; assume fair use for educational ML).

### Brief Feature Dictionary
Dataset has 14,635 rows and 25 features (all numeric after loading).

Top features (with types and brief descriptions):
- bmi (float): Body Mass Index
- Age (float): Patient age in years
- asa_status (int): ASA physical status classification (0-2?)
- baseline_cancer (int): Binary indicator for baseline cancer
- baseline_charlson (int): Charlson Comorbidity Index
- baseline_cvd (int): Binary for cardiovascular disease
- baseline_dementia (int): Binary for dementia
- baseline_diabetes (int): Binary for diabetes
- baseline_digestive (int): Binary for digestive issues
- baseline_osteoart (int): Binary for osteoarthritis
- baseline_psych (int): Binary for psychiatric conditions
- baseline_pulmonary (int): Binary for pulmonary issues
- ahrq_ccs (int): AHRQ Clinical Classification System code
- ccsComplicationRate (float): Complication rate from CCS
- ccsMort30Rate (float): 30-day mortality rate from CCS
- complication_rsi (float): Risk score index for complication
- dow (int): Day of week (0-4)
- gender (int): Binary (0/1)
- hour (float): Time of day (hour)
- month (int): Month (0-11?)
- moonphase (int): Moon phase (0-3?)
- mort30 (int): Binary 30-day mortality
- mortality_rsi (float): Risk score index for mortality
- race (int): Categorical (0-2)
- complication (int): Target

### Known Limitations/Risks
- Imbalanced target (75% no complication, 25% yes).
- Potential biases: Race/gender distributions may not represent diverse populations.
- Anonymized data: No sensitive info, but ethics in medical predictions (e.g., avoid harm from false negatives).
- Size: 14k rows, but 2.9k duplicates.


## Part A: Data-Quality Audit

In [None]:
# TODO: Load the dataset (pd.read_csv('Surgical-deepnet.csv'))
# TODO: Print shape, columns, dtypes

### Missingness Summary

In [None]:
# TODO: df.isna().sum() and percentages

### Duplicate Rows Check

In [None]:
# TODO: df.duplicated().sum()

### Target Distribution

In [None]:
# TODO: df['complication'].value_counts(normalize=True)
# TODO: Optional: plot histogram or pie

### One Bias/Ethics Consideration

[TODO: Brief 3-5 sentences, e.g., Race is categorical (0-2), check distribution: if skewed (e.g., 92% race=1), model may not generalize to underrepresented groups. Ethical risk: Deployed model could disadvantage minority patients in complication prediction. Mitigation: Check for disparate impact later.]

In [None]:
# TODO: Code to support ethics note, e.g., df['race'].value_counts(normalize=True)

## Part A: Leakage-Risk Note

Plausible leakage vectors:
1. Post-outcome fields: 'ccsComplicationRate', 'complication_rsi' (seem derived from complications), 'ccsMort30Rate', 'mortality_rsi', 'mort30' (related outcomes).
2. Time-related: 'dow', 'hour', 'month', 'moonphase' - if data is chronological, future data could leak.
3. Duplicates/near-duplicates: 2902 duplicates - remove to avoid memorization.
4. No explicit IDs, but possible implicit (e.g., similar rows).

Prevention:
- Drop leakage columns before split: ['ccsComplicationRate', 'ccsMort30Rate', 'complication_rsi', 'mortality_rsi', 'mort30']
- If temporal (sort by month/hour?), use time-based split (train on early, test on late).
- Remove duplicates.
- Justify: No time ordering evident, so use stratified random split.

## Part B: Leakage-Safe Preprocessing and Baselines

### Data Cleaning (Pre-Split)

In [None]:
# TODO: Drop duplicates
# TODO: Drop leakage columns
# TODO: Define X (features), y (target='complication')

### Leakage-Safe Split

In [None]:
# TODO: If temporal: sort by potential time col, split last 20% test
# ELSE: from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Build scikit-learn Pipeline

In [None]:
# TODO: Identify numeric/categorical cols (all numeric here, but use ColumnTransformer if needed)
# TODO: Pipeline with StandardScaler() + model

### Train Simple Baseline Model

Choose one: GaussianNB(), KNeighborsClassifier(), etc.

In [None]:
# TODO: pipe = Pipeline([('scaler', StandardScaler()), ('model', GaussianNB())])
# TODO: pipe.fit(X_train, y_train)
# TODO: y_pred = pipe.predict(X_test)

### Report Primary Metric and Artifact

Primary: F1-score (since imbalanced).
Artifact: Confusion matrix.

In [None]:
# TODO: from sklearn.metrics import f1_score, confusion_matrix, ConfusionMatrixDisplay
# TODO: Print f1_score(y_test, y_pred)
# TODO: Plot confusion matrix

## End of Week 1

Ready for Thursday check-in: Show datasheet, audit outputs, leakage note, split code, baseline metric + confusion matrix, git commits.