# Project: SVM CLASSIFICATION

## Information of the project: predicting the probability of death

In this project, you have to predict the probability of death of a patient that is entering an ICU (Intensive Care Unit).

The dataset comes from MIMIC project (https://mimic.physionet.org/). MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

Each row of *mimic_train.csv* correponds to one ICU stay (*hadm_id*+*icustay_id*) of one patient (*subject_id*). Column HOSPITAL_EXPIRE_FLAG is the indicator of death (=1) as a result of the current hospital stay; this is the outcome to predict in our modelling exercise.
The remaining columns correspond to vitals of each patient (when entering the ICU), plus some general characteristics (age, gender, etc.), and their explanation can be found at *mimic_patient_metadata.csv*.

Please don't use any feature that you infer you don't know the first day of a patient in an ICU (e.g., the date of death).

Note that the main cause/disease of patient condition is embedded as a code at *ICD9_diagnosis* column. The meaning of this code can be found at *MIMIC_metadata_diagnose.csv*. **But** this is only the main one; a patient can have co-occurrent diseases (comorbidities). These secondary codes can be found at *extra_data/MIMIC_diagnoses.csv*.

As performance metric, you can use *AUC* for the binary classification case, but feel free to report as well any other metric if you can justify that is particularly suitable for this case.

## Tasks

Main tasks are:
+ Using *mimic_train.csv* file build a predictive model for *HOSPITAL_EXPIRE_FLAG* .
+ For this analysis there is an extra test dataset, *mimic_X_test.csv*. Apply your final model to this extra dataset and generate predictions following the same format as *test_kaggle1.csv*. Once ready, you can submit to our Kaggle competition and iterate to improve the accuracy.

As a *bonus*, try different algorithms for neighbor search and for distance, and justify final selection. Try also different weights to cope with class imbalance and also to balance neighbor proximity. Try to assess somehow confidence interval of predictions.

## Tips

You can follow those **steps** in your first implementation:
1. *Explore* and understand the dataset.
2. Manage missing data:   discard any non-numeric columns and columns with a high proportion of missing data.  Find a simple way to impute or remove any remaining missing data - remember you cannot remove test set observation or else kaggle won't be able to give you a score

3. Manage categorial features. E.g. create *dummy variables* for relevant categorical features, or build an ad hoc distance function.

4. Do any further necessary preprocessing

5. Build a prediction model. Fit a SVM model explicitly calling the arguments we have talked about in class.

6. Assess expected accuracy and tune your models using *cross-validation*.

7. Test the performance on the test file and report accuracy, following same preparation steps (missing data, dummies, etc). Remember that you should be able to yield a prediction for all the rows of the test dataset.

8. Submit to kaggle and receive a score.

Feel free to reduce the training dataset if you experience computational constraints.

## Data

- The dataset comes from MIMIC project (https://mimic.physionet.org/). 
- MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.
- Each row of mimic_train.csv correponds to one ICU stay (hadm_id+icustay_id) of one patient (subject_id). Column HOSPITAL_EXPIRE_FLAG is the indicator of death (=1) as a result of the current hospital stay; this is the outcome to predict in our modelling exercise. The remaining columns correspond to vitals of each patient (when entering the ICU), plus some general characteristics (age, gender, etc.), and their explanation can be found at mimic_patient_metadata.csv.
- There is an extra test dataset, mimic_X_test.csv. Apply your final model to this extra dataset and produce a prediction .csv file in same format as test_kaggle1.csv.

## Evaluation

The evaluation metric for this competition is ROC AUC (Area Under the Curve). The AUC, commonly used in binary classification models, measures the area under a curve that is obtained by varying the threshold for binary classification (0.5 by default) and computing True Positive Rates and False Positive Rates (http://en.wikipedia.org/wiki/Receiver_operating_characteristic).

## Main criteria for grading (extended project!)

These components are only related to the extended projects:
+ Code runs - 20%
+ Data preparation - 35%
+ SVM method(s) have been used - 10%
+ Probability of death for each test patient is computed - 10%
+ Accuracy of predictions (in class - kaggle) - 5%
+ Accuracy of predictions for test patients is calculated (kaggle) - 10%
+ Hyperparameter optimization - 10%
+ Neat and understandable code, with some titles and comments - 0%
+ Improved methods from what we discussed in class (properly explained/justified) - 0%

## Submission file

Submission Format
For every patient in the dataset, submission files should contain two columns: icustayid (this defines the individual prediction, and it's extracted from the test dataset) and HOSPITAL_EXPIRE_FLAG (float number between 0 and 1, the probability of death)..

The file should contain a header and have the following format:

``` {python}
icustay_id,HOSPITAL_EXPIRE_FLAG
2,0.651
5,0.004
6,0.104
etc.
```

1) create a pandas dataframe with two columns, one with the test set "icustay_id"'s and the other with your predicted "HOSPITAL_EXPIRE_FLAG" for that observation

2) use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects

# 0. Packages

In [10]:
import pandas as pd
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
# Evaluate performance
from sklearn.metrics import classification_report, confusion_matrix

# 1. Importing the data

In [5]:
import pandas as pd

# Training dataset
df=pd.read_csv('data/mimic_train.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,HOSPITAL_EXPIRE_FLAG,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,6733,0,77502,151200,299699,89.0,116.0,102.677419,97.0,150.0,...,-63883.7834,EMERGENCY,Medicare,PROTESTANT QUAKER,DIVORCED,BLACK/AFRICAN AMERICAN,ASTHMA;CHRONIC OBST PULM DISEASE,49121,MICU,6.1397
1,15798,0,44346,140114,250021,74.0,114.0,92.204082,87.0,160.0,...,-56421.13544,EMERGENCY,Private,EPISCOPALIAN,DIVORCED,WHITE,S/P PEDESTRIAN STRUCK,80620,TSICU,10.2897
2,2129,0,92438,118589,288511,59.0,89.0,70.581395,88.0,160.0,...,-60754.35504,EMERGENCY,Medicare,CATHOLIC,MARRIED,WHITE,STERNAL WOUND INFECTION,99859,CSRU,5.808
3,17053,1,83663,125553,278204,75.0,86.0,80.4,74.0,102.0,...,-56609.91884,EMERGENCY,Private,CATHOLIC,MARRIED,BLACK/AFRICAN AMERICAN,SEPSIS,27652,MICU,2.3536
4,11609,0,85941,181409,292581,77.0,107.0,91.020408,95.0,150.0,...,-59200.37377,EMERGENCY,Medicaid,NOT SPECIFIED,DIVORCED,WHITE,BILE LEAK,9974,MICU,19.3935


In [2]:
print(df.head())

   Unnamed: 0  HOSPITAL_EXPIRE_FLAG  subject_id  hadm_id  icustay_id  \
0        6733                     0       77502   151200      299699   
1       15798                     0       44346   140114      250021   
2        2129                     0       92438   118589      288511   
3       17053                     1       83663   125553      278204   
4       11609                     0       85941   181409      292581   

   HeartRate_Min  HeartRate_Max  HeartRate_Mean  SysBP_Min  SysBP_Max  ...  \
0           89.0          116.0      102.677419       97.0      150.0  ...   
1           74.0          114.0       92.204082       87.0      160.0  ...   
2           59.0           89.0       70.581395       88.0      160.0  ...   
3           75.0           86.0       80.400000       74.0      102.0  ...   
4           77.0          107.0       91.020408       95.0      150.0  ...   

          Diff  ADMISSION_TYPE  INSURANCE           RELIGION  MARITAL_STATUS  \
0 -63883.78340    

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 45 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            8000 non-null   int64  
 1   HOSPITAL_EXPIRE_FLAG  8000 non-null   int64  
 2   subject_id            8000 non-null   int64  
 3   hadm_id               8000 non-null   int64  
 4   icustay_id            8000 non-null   int64  
 5   HeartRate_Min         7167 non-null   float64
 6   HeartRate_Max         7167 non-null   float64
 7   HeartRate_Mean        7167 non-null   float64
 8   SysBP_Min             7160 non-null   float64
 9   SysBP_Max             7160 non-null   float64
 10  SysBP_Mean            7160 non-null   float64
 11  DiasBP_Min            7160 non-null   float64
 12  DiasBP_Max            7160 non-null   float64
 13  DiasBP_Mean           7160 non-null   float64
 14  MeanBP_Min            7167 non-null   float64
 15  MeanBP_Max           

In [None]:
# Test dataset (to produce predictions)
data_test=pd.read_csv('mimic_X_test.csv')
data_test.sort_values('icustay_id').head()

Unnamed: 0.1,Unnamed: 0,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
8505,13837,76603,179633,200024,101.0,123.0,112.222222,75.0,125.0,102.0,...,-42216.12842,EMERGENCY,Medicare,NOT SPECIFIED,MARRIED,BLACK/AFRICAN AMERICAN,URINARY TRACT INFECTION;PNEUMONIA,5070,MICU,0.3812
2976,4817,41710,181955,200028,66.0,113.0,84.232143,74.0,166.0,115.614035,...,-44240.38784,ELECTIVE,Medicare,CATHOLIC,MARRIED,WHITE,FIDELIS LEAD FRACTURE\IMPLANTABLE CARDIOVERER ...,99604,CCU,2.9038
12647,20515,98276,145866,200034,72.0,103.0,87.3125,85.0,161.0,122.15625,...,-63804.09402,ELECTIVE,Private,UNOBTAINABLE,UNKNOWN (DEFAULT),OTHER,BRAIN ANEURYSM/SDA,4373,SICU,3.217
8600,13982,74282,121149,200061,83.0,106.0,96.633333,128.0,161.0,146.925926,...,-45262.81681,EMERGENCY,Private,UNOBTAINABLE,SINGLE,OTHER,PANCREATITIS,5770,SICU,2.0142
1942,3148,85350,122992,200067,68.0,92.0,79.862069,86.0,136.0,108.448276,...,-41760.03551,EMERGENCY,Medicare,CATHOLIC,SINGLE,WHITE,CHRONIC OBSTRUCTIVE PULMONARY DISEASE,49121,MICU,2.8853


In [None]:
# Sample output prediction file
pred_sample=pd.read_csv('test_kaggle1.csv')
pred_sample.sort_values('icustay_id').head()

Unnamed: 0,icustay_id,HOSPITAL_EXPIRE_FLAG
8505,200024,0.19763
2976,200028,0.099115
12647,200034,0.0
8600,200061,0.092823
1942,200067,0.0


1. *Explore* and understand the dataset.
2. Manage missing data:   discard any non-numeric columns and columns with a high proportion of missing data.  Find a simple way to impute or remove any remaining missing data - remember you cannot remove test set observation or else kaggle won't be able to give you a score

3. Manage categorial features. E.g. create *dummy variables* for relevant categorical features, or build an ad hoc distance function.

4. Do any further necessary preprocessing

5. Build a prediction model. Fit a SVM model explicitly calling the arguments we have talked about in class.

6. Assess expected accuracy and tune your models using *cross-validation*.

7. Test the performance on the test file and report accuracy, following same preparation steps (missing data, dummies, etc). Remember that you should be able to yield a prediction for all the rows of the test dataset.

8. Submit to kaggle and receive a score.

Feel free to reduce the training dataset if you experience computational constraints.

# 2. EDA (Exploratory Data Analysis)

# 3. Preprocessing

## 3.1. Handling missing data

## 3.2. Dealing with categorical features

# 4. Feature selection
https://www.kaggle.com/code/prashant111/comprehensive-guide-on-feature-selection

# 5. Building prediction model

# 6. Evaluating results

In [None]:
# Your code here:


# mySVM=.....
# y_hat_test= mySVM.predict_proba(X_test)

In [51]:
# Load your dataset
# df = pd.read_csv('your_data.csv')

# Convert dates and calculate age
df['DOB'] = pd.to_datetime(df['DOB'])
df['ADMITTIME'] = pd.to_datetime(df['ADMITTIME'])
# df['Age'] = (df['ADMITTIME'] - df['DOB']).dt.days / 365.25

# Drop unnecessary columns
columns_to_drop = [
    'Unnamed: 0', 'subject_id', 'hadm_id',
    'DOB', 'DOD', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
    'DIAGNOSIS', 'ICD9_diagnosis', 'HOSPITAL_EXPIRE_FLAG'
]
X = df.drop(columns=columns_to_drop)
y = df['HOSPITAL_EXPIRE_FLAG']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Define numerical and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), numerical_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_features)
    ])

# Full pipeline with SMOTE and SVM
pipeline = imbpipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('svm', SVC(kernel='rbf', class_weight='balanced', random_state=42, probability=True))
])

# Train the model
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.85      0.88      1417
           1       0.26      0.42      0.32       183

    accuracy                           0.80      1600
   macro avg       0.59      0.63      0.60      1600
weighted avg       0.84      0.80      0.82      1600


Confusion Matrix:
[[1203  214]
 [ 107   76]]


In [17]:
df_test=pd.read_csv('data/mimic_X_test.csv')
df_test

Unnamed: 0.1,Unnamed: 0,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,1,55440,195768,228357,89.0,145.0,121.043478,74.0,127.0,106.586957,...,-61961.78470,EMERGENCY,Medicare,PROTESTANT QUAKER,SINGLE,WHITE,GASTROINTESTINAL BLEED,5789,MICU,4.5761
1,2,76908,126136,221004,63.0,110.0,79.117647,89.0,121.0,106.733333,...,-43146.18378,EMERGENCY,Private,UNOBTAINABLE,MARRIED,WHITE,ESOPHAGEAL FOOD IMPACTION,53013,MICU,0.7582
2,3,95798,136645,296315,81.0,98.0,91.689655,88.0,138.0,112.785714,...,-42009.96157,EMERGENCY,Medicare,PROTESTANT QUAKER,SEPARATED,BLACK/AFRICAN AMERICAN,UPPER GI BLEED,56983,MICU,3.7626
3,4,40708,102505,245557,76.0,128.0,98.857143,84.0,135.0,106.972973,...,-43585.37922,ELECTIVE,Medicare,NOT SPECIFIED,WIDOWED,WHITE,HIATAL HERNIA/SDA,5533,SICU,3.8734
4,5,28424,127337,225281,,,,,,,...,-50271.76602,EMERGENCY,Medicare,JEWISH,WIDOWED,WHITE,ABDOMINAL PAIN,56211,TSICU,5.8654
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12880,20879,28519,140024,241050,,,,,,,...,-62710.83211,EMERGENCY,Medicare,GREEK ORTHODOX,MARRIED,WHITE,CEREBROVASCULAR ACCIDENT,431,SICU,1.0487
12881,20880,2338,145012,233204,80.0,85.0,81.947368,81.0,107.0,94.176471,...,-54809.57459,EMERGENCY,Private,NOT SPECIFIED,MARRIED,WHITE,S/P ARREST,4271,MICU,0.8011
12882,20881,28043,135417,244530,65.0,92.0,78.500000,60.0,160.0,110.976190,...,-60714.92678,EMERGENCY,Medicare,CATHOLIC,MARRIED,WHITE,ALTERED MENTAL STATUS,3229,MICU,11.6116
12883,20883,47492,152608,274507,58.0,97.0,76.933333,94.0,131.0,112.037037,...,-39830.10848,EMERGENCY,Private,PROTESTANT QUAKER,DIVORCED,BLACK/AFRICAN AMERICAN,HYPOGLYCEMIA,24980,MICU,1.8830


In [54]:
df_test=pd.read_csv('data/mimic_X_test.csv')
columns_to_drop = [
    'Unnamed: 0', 'subject_id', 'hadm_id',
    'DOB', 'DOD', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
    'DIAGNOSIS', 'ICD9_diagnosis'
]
X_test = df_test.drop(columns=columns_to_drop)

In [23]:
y_test = pipeline.predict_proba(X_test)

In [24]:
y_test

array([[0.18628185, 0.81371815],
       [0.99615267, 0.00384733],
       [0.82009484, 0.17990516],
       ...,
       [0.02712411, 0.97287589],
       [0.96780634, 0.03219366],
       [0.99817909, 0.00182091]])

In [27]:
submission = df_test[['icustay_id']]

In [28]:
submission

Unnamed: 0,icustay_id
0,228357
1,221004
2,296315
3,245557
4,225281
...,...
12880,241050
12881,233204
12882,244530
12883,274507


In [30]:
test_list = list(y_test)

In [36]:
probabilities_test = pd.DataFrame(data = y_test)
probabilities_test

Unnamed: 0,0,1
0,0.186282,0.813718
1,0.996153,0.003847
2,0.820095,0.179905
3,0.921504,0.078496
4,0.264276,0.735724
...,...,...
12880,0.912136,0.087864
12881,0.940210,0.059790
12882,0.027124,0.972876
12883,0.967806,0.032194


In [37]:
submission['HOSPITAL_EXPIRE_FLAG'] = probabilities_test['1']
submission

KeyError: '1'

In [55]:
# Predictions (use probability for the positive class)
y_prob = pipeline.predict_proba(X_test)[:, 1]

# Save to CSV
submission = pd.DataFrame({'icustay_id': X_test['icustay_id'], 'HOSPITAL_EXPIRE_FLAG': y_prob})

In [56]:
submission

Unnamed: 0,icustay_id,HOSPITAL_EXPIRE_FLAG
0,228357,0.833339
1,221004,0.002876
2,296315,0.241426
3,245557,0.071118
4,225281,0.580200
...,...,...
12880,241050,0.095118
12881,233204,0.037649
12882,244530,0.974699
12883,274507,0.049863


In [57]:
submission.to_csv('svm_predictions.csv', index=False)