# Checkpoint 1

**Part A**

https://www.kaggle.com/datasets/omnamahshivai/surgical-dataset-binary-classification


## **Subtask 1**

Dataset name + link: https://www.kaggle.com/datasets/omnamahshivai/surgical-dataset-binary-classification

License/terms: Published by Anesthesia & Analgesia (Wolters Kluwer / International Anesthesia Research Society) and is not open access. All rights are reserved, and reuse, redistribution, or reproduction beyond personal or fair use requires prior permission from the publisher.

Prediction task + target definition: Predict whether a surgery will have a complication or not given features, target: `complication`

Intended use / decision context: We want to see how certain features, such as biodata on an individual, can be a good predictor on surgeries having complications.

Feature dictionary (5-10 key features): https://www.causeweb.org/tshs/datasets/Surgery%20Timing%20Data%20Dictionary.pdf
* bmi
* baseline_cancer
* baseline_cvd
* baseline_dementia
* baseline_diabetes
* age
* baseline_osteoart
* baseline_digestive
* baseline_psych
* baseline_pulmonary

Known limitations/risks (2-3 bullets):
* Because the study is not randomized, it can show association but not causation
* Focusing on 30-day mortality may miss longer-term complications or functional outcomes


In [None]:
import kagglehub
import pandas as pd
import os

# Download dataset
path = kagglehub.dataset_download(
    "omnamahshivai/surgical-dataset-binary-classification"
)

print("Dataset downloaded to:", path)

# List files in the dataset directory
files = os.listdir(path)

print("Files:", files)

# Load the CSV file (update name if needed)
csv_file = "Surgical-deepnet.csv"  # this is the main file in the dataset
csv_path = os.path.join(path, csv_file)
df = pd.read_csv(csv_path)

# Basic info
print("Dataset shape:", df.shape)
print(df.head())

# Check for duplicate rows
num_duplicates = df.duplicated().sum()

print("Number of duplicate rows:", num_duplicates)

# View duplicates
if num_duplicates > 0:
    duplicates = df[df.duplicated()]
    print("Sample duplicate rows:")

# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print("Shape after removing duplicates:", df_no_duplicates.shape)

# Check for missing values in each column
missing_per_column = df_no_duplicates.isna().sum()

print("Missing values per column:")
print(missing_per_column)

# Total number of missing values in the dataset
total_missing = missing_per_column.sum()
print("\nTotal missing values in dataset:", total_missing)

# Percentage of missing values per column
missing_percentage = (missing_per_column / len(df_no_duplicates)) * 100

print("\nMissing percentage per column:")
print(missing_percentage)

# Show only columns that actually have missing values
columns_with_missing = missing_per_column[missing_per_column > 0]

if not columns_with_missing.empty:
    print("\nColumns with missing values:")
    print(columns_with_missing)
else:
    print("\nNo missing values found in the dataset.")

complication_sum = df_no_duplicates["complication"].sum()
mort30_sum = df_no_duplicates["mort30"].sum()

print("Sum of complication:", complication_sum)
print("Sum of mort30:", mort30_sum)

Using Colab cache for faster access to the 'surgical-dataset-binary-classification' dataset.
Dataset downloaded to: /kaggle/input/surgical-dataset-binary-classification
Files: ['Surgical-deepnet.csv']
Dataset shape: (14635, 25)
     bmi   Age  asa_status  baseline_cancer  baseline_charlson  baseline_cvd  \
0  19.31  59.2           1                1                  0             0   
1  18.73  59.1           0                0                  0             0   
2  21.85  59.0           0                0                  0             0   
3  18.49  59.0           1                0                  1             0   
4  19.70  59.0           1                0                  0             0   

   baseline_dementia  baseline_diabetes  baseline_digestive  \
0                  0                  0                   0   
1                  0                  0                   0   
2                  0                  0                   0   
3                  0                  1

## **Subtask 2**

*Data-Quality Audit*

Missingness Summary
* The dataset is clean with respect to completeness, as there are 0 missing
  values across all 25 columns. Since each feature has a 100% fill rate, so imputation strategies are necessary.


Duplicate Rows Check
* Significant quantity of duplicates identified: 2902 rows (~19.8% of the original 14,635 entries).
* These rows were resultantly removed, leaving a dataset of 11,733 unique rows.

Target Distribution
* Primary target (`complication`): There are 3690 positive cases out of 11,733. This represents a 31.4% complication rate. This is a relatively healthy distribution for a binary classifier given the size of the dataset.
* Secondary Target (`mort30`): There are only 58 positive cases (~0.5%). This is highly imbalanced, and should likely be dropped.

Ethical Considerations
* The ethical considerations around this data set involve the classification of race. The dataset only uses three different classifications of race- Caucasian, African American, and Other. This is kind of concerning that many Race classifications, if necessary for prediction, are grouped into "Other" which we feel like could make the model biased against the other group since there would be more noise surrounding those data points, rather than precisely identifying them uniquely. For that reason we consider dropping the race category.

## Subtask 3

Possible leakage vectors include mort30 as the mort30 column is an outcome after the surgery happened. Therefore if we are trying to predict if there was a complication in surgery, mort 30 is likely highly indicative of the complication and directly related to it. Additionally, it would not be data that we have access to before surgery, which is ideally when you would want to do this prediction/use this model. To prevent this vector, we drop the mort30 column in order to not include this data in our prediction.

Other risks of leakage are related to how the data is preprocessed. In order to prevent these, we should fit any of our preprocessing bits before we split the data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression  # or any model

# Drop unwanted columns
df_clean = df_no_duplicates.drop(
    columns=["mort30", "moonphase", "month", "dow", "hour"],
    errors="ignore"
)

# Split features / target
X = df_clean.drop(columns=["complication"])
y = df_clean["complication"]

# 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123, shuffle=True
)

The year isn't included so we don't know the real ordering, which is why we don't perform a time-based split.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
knn_pipeline = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(
        n_neighbors=71,
        weights="distance",
        metric="euclidean"
    ))
])

# Fit ONLY on training data
knn_pipeline.fit(X_train, y_train)

# Accuracy
knn_pipeline.score(X_test, y_test)

0.749893481039625

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = knn_pipeline.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

[[1512  100]
 [ 487  248]]


###Part B
We dropped the columns mort30 because it would be a form of data leakage since it is only known after the fact. We dropped moonphase, month, dow, and hour to prevent overfitting. Our Pipeline then involved a Standard scaler, and then Knn classification using the Euclidean metric. We got an accuracy of 74.99% which is not great because the classs split is roughly 70/30 but it does show some promise as other people's models only got around 80%.
The confusion matrix is:
[[1512  100]
 [ 487  248]]