# Project: SVM CLASSIFICATION
Kaggle competition: https://www.kaggle.com/competitions/cmldl-bse-2425-svm/overview

## Programming project: probability of death

In this project, you have to predict the probability of death of a patient that is entering an ICU (Intensive Care Unit).

The dataset comes from MIMIC project (https://mimic.physionet.org/). MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

Each row of *mimic_train.csv* correponds to one ICU stay (*hadm_id*+*icustay_id*) of one patient (*subject_id*). Column HOSPITAL_EXPIRE_FLAG is the indicator of death (=1) as a result of the current hospital stay; this is the outcome to predict in our modelling exercise.
The remaining columns correspond to vitals of each patient (when entering the ICU), plus some general characteristics (age, gender, etc.), and their explanation can be found at *mimic_patient_metadata.csv*.

Please don't use any feature that you infer you don't know the first day of a patient in an ICU.

Note that the main cause/disease of patient condition is embedded as a code at *ICD9_diagnosis* column. The meaning of this code can be found at *MIMIC_metadata_diagnose.csv*. **But** this is only the main one; a patient can have co-occurrent diseases (comorbidities). These secondary codes can be found at *extra_data/MIMIC_diagnoses.csv*.

As performance metric, you can use *AUC* for the binary classification case, but feel free to report as well any other metric if you can justify that is particularly suitable for this case.

Main tasks are:
+ Using *mimic_train.csv* file build a predictive model for *HOSPITAL_EXPIRE_FLAG* .
+ For this analysis there is an extra test dataset, *mimic_X_test.csv*. Apply your final model to this extra dataset and generate predictions following the same format as *test_kaggle1.csv*. Once ready, you can submit to our Kaggle competition and iterate to improve the accuracy.

As a *bonus*, try different algorithms for neighbor search and for distance, and justify final selection. Try also different weights to cope with class imbalance and also to balance neighbor proximity. Try to assess somehow confidence interval of predictions.

You can follow those **steps** in your first implementation:
1. *Explore* and understand the dataset.
2. Manage missing data:   discard any non-numeric columns and columns with a high proportion of missing data.  Find a simple way to impute or remove any remaining missing data - remember you cannot remove test set observation or else kaggle won't be able to give you a score

3. Manage categorial features. E.g. create *dummy variables* for relevant categorical features, or build an ad hoc distance function.

4. Do any further necessary preprocessing

5. Build a prediction model. Fit a SVM model explicitly calling the arguments we have talked about in class.

6. Assess expected accuracy and tune your models using *cross-validation*.

7. Test the performance on the test file and report accuracy, following same preparation steps (missing data, dummies, etc). Remember that you should be able to yield a prediction for all the rows of the test dataset.

8. Submit to kaggle and receive a score.

Feel free to reduce the training dataset if you experience computational constraints.



## Main criteria for grading (extended project!)
These components are only related to the extended projects:
+ Code runs - 20%
+ Data preparation - 35%
+ SVM method(s) have been used - 10%
+ Probability of death for each test patient is computed - 10%
+ Accuracy of predictions (in class - kaggle) - 5%
+ Accuracy of predictions for test patients is calculated (kaggle) - 10%
+ Hyperparameter optimization - 10%
+ Neat and understandable code, with some titles and comments - 0%
+ Improved methods from what we discussed in class (properly explained/justified) - 0%

In [None]:
import pandas as pd

# Training dataset

## Mount google drive if running from Google Collab
#from google.colab import drive
#drive.mount('/content/drive')
#import os
#os.chdir('/content/drive/MyDrive/.../kaggleData') ####### choose the folder that contains all the kaggle data
data=pd.read_csv('mimic_train.csv')
data.head()

Mounted at /content/drive


Unnamed: 0.1,Unnamed: 0,HOSPITAL_EXPIRE_FLAG,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,6733,0,77502,151200,299699,89.0,116.0,102.677419,97.0,150.0,...,-63883.7834,EMERGENCY,Medicare,PROTESTANT QUAKER,DIVORCED,BLACK/AFRICAN AMERICAN,ASTHMA;CHRONIC OBST PULM DISEASE,49121,MICU,6.1397
1,15798,0,44346,140114,250021,74.0,114.0,92.204082,87.0,160.0,...,-56421.13544,EMERGENCY,Private,EPISCOPALIAN,DIVORCED,WHITE,S/P PEDESTRIAN STRUCK,80620,TSICU,10.2897
2,2129,0,92438,118589,288511,59.0,89.0,70.581395,88.0,160.0,...,-60754.35504,EMERGENCY,Medicare,CATHOLIC,MARRIED,WHITE,STERNAL WOUND INFECTION,99859,CSRU,5.808
3,17053,1,83663,125553,278204,75.0,86.0,80.4,74.0,102.0,...,-56609.91884,EMERGENCY,Private,CATHOLIC,MARRIED,BLACK/AFRICAN AMERICAN,SEPSIS,27652,MICU,2.3536
4,11609,0,85941,181409,292581,77.0,107.0,91.020408,95.0,150.0,...,-59200.37377,EMERGENCY,Medicaid,NOT SPECIFIED,DIVORCED,WHITE,BILE LEAK,9974,MICU,19.3935


In [None]:
# Test dataset (to produce predictions)
data_test=pd.read_csv('mimic_X_test.csv')
data_test.sort_values('icustay_id').head()

Unnamed: 0.1,Unnamed: 0,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
8505,13837,76603,179633,200024,101.0,123.0,112.222222,75.0,125.0,102.0,...,-42216.12842,EMERGENCY,Medicare,NOT SPECIFIED,MARRIED,BLACK/AFRICAN AMERICAN,URINARY TRACT INFECTION;PNEUMONIA,5070,MICU,0.3812
2976,4817,41710,181955,200028,66.0,113.0,84.232143,74.0,166.0,115.614035,...,-44240.38784,ELECTIVE,Medicare,CATHOLIC,MARRIED,WHITE,FIDELIS LEAD FRACTURE\IMPLANTABLE CARDIOVERER ...,99604,CCU,2.9038
12647,20515,98276,145866,200034,72.0,103.0,87.3125,85.0,161.0,122.15625,...,-63804.09402,ELECTIVE,Private,UNOBTAINABLE,UNKNOWN (DEFAULT),OTHER,BRAIN ANEURYSM/SDA,4373,SICU,3.217
8600,13982,74282,121149,200061,83.0,106.0,96.633333,128.0,161.0,146.925926,...,-45262.81681,EMERGENCY,Private,UNOBTAINABLE,SINGLE,OTHER,PANCREATITIS,5770,SICU,2.0142
1942,3148,85350,122992,200067,68.0,92.0,79.862069,86.0,136.0,108.448276,...,-41760.03551,EMERGENCY,Medicare,CATHOLIC,SINGLE,WHITE,CHRONIC OBSTRUCTIVE PULMONARY DISEASE,49121,MICU,2.8853


In [None]:
# Sample output prediction file
pred_sample=pd.read_csv('test_kaggle1.csv')
pred_sample.sort_values('icustay_id').head()

Unnamed: 0,icustay_id,HOSPITAL_EXPIRE_FLAG
8505,200024,0.19763
2976,200028,0.099115
12647,200034,0.0
8600,200061,0.092823
1942,200067,0.0


In [None]:
# Your code here:


# mySVM=.....
# y_hat_test= mySVM.predict_proba(X_test)

### Kaggle Predictions Submissions

Once you have produced testset predictions you can submit these to <i> kaggle </i> in order to see how your model performs.

The following code provides an example of generating a <i> .csv </i> file to submit to kaggle
1) create a pandas dataframe with two columns, one with the test set "icustay_id"'s and the other with your predicted "HOSPITAL_EXPIRE_FLAG" for that observation

2) use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects

In [None]:
# Produce .csv for kaggle testing
test_predictions_submit = pd.DataFrame({"icustay_id": data_test["icustay_id"], "HOSPITAL_EXPIRE_FLAG": y_hat_test})
test_predictions_submit.to_csv("test_predictions_submit.csv", index = False)