# HW1

## Overview

Preparing the data, computing basic statistics and constructing simple models are essential steps for data science practice. In this homework, you will use clinical data as raw input to perform **Heart Failure Prediction**. For this homework, **Python** programming will be required. See the attached skeleton code as a start-point for the programming questions.

This homework assumes familiarity with Pandas. If you need a Pandas crash course, we recommend working through [100 Pandas Puzzles](https://github.com/ajcr/100-pandas-puzzles), the solutions are also available at that link. 

In [2]:
import os
import sys

DATA_PATH = "../HW1-lib/data/"
TRAIN_DATA_PATH = DATA_PATH + "train/"
VAL_DATA_PATH = DATA_PATH + "val/"
    
sys.path.append("../HW1-lib")

## About Raw Data

For this homework, we will be using a clinical dataset synthesized from [MIMIC-III](https://www.nature.com/articles/sdata201635).

Navigate to `TRAIN_DATA_PATH`. There are three CSV files which will be the input data in this homework. 

In [None]:
!ls $TRAIN_DATA_PATH

**events.csv**

The data provided in *events.csv* are event sequences. Each line of this file consists of a tuple with the format *(pid, event_id, vid, value)*. 

For example, 

```
33,DIAG_244,0,1
33,DIAG_414,0,1
33,DIAG_427,0,1
33,LAB_50971,0,1
33,LAB_50931,0,1
33,LAB_50812,1,1
33,DIAG_425,1,1
33,DIAG_427,1,1
33,DRUG_0,1,1
33,DRUG_3,1,1
```

- **pid**: De-identified patient identier. For example, the patient in the example above has pid 33. 
- **event_id**: Clinical event identifier. For example, DIAG_244 means the patient was diagnosed of disease with ICD9 code [244](http://www.icd9data.com/2013/Volume1/240-279/240-246/244/244.htm); LAB_50971 means that the laboratory test with code 50971 was conducted on the patient; and DRUG_0 means that a drug with code 0 was prescribed to the patient. Corresponding lab (drug) names can be found in `{DATA_PATH}/lab_list.txt` (`{DATA_PATH}/drug_list.txt`).
- **vid**: Visit identifier. For example, the patient has two visits in total. Note that vid is ordinal. That is, visits with bigger vid occour after that with smaller vid.
- **value**: Contains the value associated to an event (always 1 in the synthesized dataset).

**hf_events.csv**

The data provided in *hf_events.csv* contains pid of patients who have been diagnosed with heart failure (i.e., DIAG_398, DIAG_402, DIAG_404, DIAG_428) in at least one visit. They are in the form of a tuple with the format *(pid, vid, label)*. For example,

```
156,0,1
181,1,1
```

The vid indicates the index of the first visit with heart failure of that patient and a label of 1 indicates the presence of heart failure. **Note that only patients with heart failure are included in this file. Patients who are not mentioned in this file have never been diagnosed with heart failure.**

**event_feature_map.csv**

The *event_feature_map.csv* is a map from an event_id to an integer index. This file contains *(idx, event_id)* pairs for all event ids.

## 1 Descriptive Statistics [20 points]

Before starting analytic modeling, it is a good practice to get descriptive statistics of the input raw data. In this question, you need to write code that computes various metrics on the data described previously. A skeleton code is provided to you as a starting point.

The definition of terms used in the result table are described below:

- **Event count**: Number of events recorded for a given patient.
- **Encounter count**: Number of visits recorded for a given patient.

Note that every line in the input file is an event, while each visit consists of multiple events.

**Complete the following code cell to implement the required statistics.**

Please be aware that **you are NOT allowed to change the filename and any existing function declarations.** Only `numpy`, `scipy`, `scikit-learn`, `pandas` and other built-in modules of python will be available for you to use. The use of `pandas` library is suggested. 

In [None]:
import time
import pandas as pd
import numpy as np
import datetime

# PLEASE USE THE GIVEN FUNCTION NAME, DO NOT CHANGE IT.

def read_csv(filepath=TRAIN_DATA_PATH):

    '''
    Read the events.csv and hf_events.csv files. 
    Variables returned from this function are passed as input to the metric functions.
    
    NOTE: remember to use `filepath` whose default value is `TRAIN_DATA_PATH`.
    '''
    
    events = pd.read_csv(filepath + 'events.csv')
    hf = pd.read_csv(filepath + 'hf_events.csv')

    return events, hf

def event_count_metrics(events, hf):

    '''
    TODO : Implement this function to return the event count metrics.
    
    Event count is defined as the number of events recorded for a given patient.
    '''
    ## your code here
    
    # Count events per patient
    event_counts = events['pid'].value_counts().reset_index()
    event_counts.columns = ['pid', 'event_count']
    
    # Merge event counts with HF status
    patient_df = event_counts.merge(hf, on='pid', how='left')
    normal_patients = patient_df[patient_df['label'].isna()]
    hf_patients = patient_df[patient_df['label']==1]
    
    # Calculate metrics for HF patients
    avg_hf_event_count = hf_patients['event_count'].mean() if not hf_patients.empty else None
    max_hf_event_count = hf_patients['event_count'].max() if not hf_patients.empty else None
    min_hf_event_count = hf_patients['event_count'].min() if not hf_patients.empty else None
    
    # Calculate metrics for normal patients
    avg_norm_event_count = normal_patients['event_count'].mean() if not normal_patients.empty else None
    max_norm_event_count = normal_patients['event_count'].max() if not normal_patients.empty else None
    min_norm_event_count = normal_patients['event_count'].min() if not normal_patients.empty else None

    return avg_hf_event_count, max_hf_event_count, min_hf_event_count, \
           avg_norm_event_count, max_norm_event_count, min_norm_event_count

def encounter_count_metrics(events, hf):

    '''
    TODO : Implement this function to return the encounter count metrics.
    
    Encounter count is defined as the number of visits recorded for a given patient. 
    '''
    # your code here
    
    vid_counts = events.groupby('pid')['vid'].nunique().reset_index()
    vid_counts.columns = ['pid', 'encounter_count']

    patient_df = vid_counts.merge(hf, on='pid', how='left')
    normal_patients = patient_df[patient_df['label'].isna()]
    hf_patients = patient_df[patient_df['label']==1]
    
    avg_hf_encounter_count = hf_patients['encounter_count'].mean() if not hf_patients.empty else None
    max_hf_encounter_count = hf_patients['encounter_count'].max() if not hf_patients.empty else None
    min_hf_encounter_count = hf_patients['encounter_count'].min() if not hf_patients.empty else None
    avg_norm_encounter_count = normal_patients['encounter_count'].mean() if not normal_patients.empty else None
    max_norm_encounter_count = normal_patients['encounter_count'].max() if not normal_patients.empty else None
    min_norm_encounter_count = normal_patients['encounter_count'].min() if not normal_patients.empty else None
    
    return avg_hf_encounter_count, max_hf_encounter_count, min_hf_encounter_count, \
           avg_norm_encounter_count, max_norm_encounter_count, min_norm_encounter_count

In [None]:
events, hf = read_csv(TRAIN_DATA_PATH)

#Compute the event count metrics
start_time = time.time()
event_count = event_count_metrics(events, hf)
end_time = time.time()
print(("Time to compute event count metrics: " + str(end_time - start_time) + "s"))
print(event_count)

#Compute the encounter count metrics
start_time = time.time()
encounter_count = encounter_count_metrics(events, hf)
end_time = time.time()
print(("Time to compute encounter count metrics: " + str(end_time - start_time) + "s"))
print(encounter_count)