# Introduction

Health care is moving towards a value-based care model that incentivizes health care providers for providing quality care, quantified by improvements of the patient's health outcomes. This model involves understanding the health care needs of patients, integrating a multidisciplinary team to develop solutions that impact population health, and measuring health outcomes of patients to drive ongoing quality improvement projects.

With the advent of electronic health data and recent innovations in the field of machine learning, large and complex clinical data have been used by health care professionals to have answer pressing problems. Machine learning methods have been used to improve patient risk stratification for specific infections, disease prediction, and streamlining hospital operations.

Centers for Medicare & Medicaid Services (CMS) have created various value-based programs that incentivizes health care providers to provide better care and value to patients. For instance, the [Hospital Readmission Reduction Program](https://www.cms.gov/index.php/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/Value-Based-Programs/HRRP/Hospital-Readmission-Reduction-Program) focuses on improving care coordination efforts in order to reduce readmission rates, effectively saving health care costs for both the provider and the patient. 

The program defines readmission as:

1. Unplanned readmissions that happen within 30 days of discharge from the initial admission and;
2. Patients who are readmitted to the same hospital, or another applicable acute care hospital for any reason.

This notebook will focus on applying classification models to identify patients who are at risk for readmission, as defined by the HRRP. The data used for this project can be found in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).

## Data Set Information

The data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. This data has been used to study the impact of HbA1c measurement and hospital readmission rates (see [here](https://www.hindawi.com/journals/bmri/2014/781670/)).

As mentioned in the article, the data represents 10 years of clinical care at 130 US hospitals and integrated delivery networks. The data has been de-identified and each observations in the dataset satisfy the following criteria:

1. an inpatient encounter (i.e., hospital admission),
2. a diabetic encounter (i.e., any kind of diabetes has been identified as one the patient's diagnosis),
3. the length of stay was at least 1 day and at most 14 days,
4. laboratory tests were performed during the encounter and;
5. medications were administered during the encounter.

More information about the data can be found in the article in the following [link](https://www.hindawi.com/journals/bmri/2014/781670/).

# Methodology

## Data Exploration

In [1]:
# Installing packages.
import pandas as pd
import numpy as np

In [2]:
# Load data into a pandas dataframe.
diabetes_df = pd.read_csv("diabetic_data.csv", low_memory=False)

# Returning number of observations and features of the data.
print(f'The dataset has {diabetes_df.shape[0]} observations and {diabetes_df.shape[1]} features.')

The dataset has 101766 observations and 50 features.


In [4]:
# Returning data types for the features.
diabetes_df.dtypes.value_counts()

object    37
int64     13
dtype: int64

In [3]:
# Returning categorical features.
diabetes_df.columns[diabetes_df.dtypes=='object']

Index(['race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty',
       'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [6]:
# Returning numerical features.
diabetes_df.columns[diabetes_df.dtypes!='object']

Index(['encounter_id', 'patient_nbr', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient',
       'number_diagnoses'],
      dtype='object')

Of our 50 features, 37 are `object` or categorical features and 13 `int64` or numerical features. 

With a quick glance at our features, we see several things that sets a direction for our analysis:
* `diabetes_df['age']` is seen as an `object`, but should be `int64`;
* `diabetes_df['readmitted']` is our output value that will be used for our classification models;
* `diabetes_df['encounter_id']` and `diabetes_df['patient_nbr']` are patient identifiers that are not useful for analysis;
* `diabetes_df['admission_type_id']`, `diabetes_df['discharge_disposition_id']`, and `diabetes_df['admission_source_id']` are seen as `int64`, but should be `object`.

We can see that data cleaning is needed to correct our features into the proper data types and remove predictors that aren't useful for our analysis.

Let's take a closer look at our features.

### Categorical Features

In [4]:
# Assigning categorical features into a variable.
categorical_cols = diabetes_df.columns[diabetes_df.dtypes=='object']

# Assigning an array of unique values for each categorical features.
unique_cat_cols = [diabetes_df[i].unique() for i in categorical_cols]

# Creating a dataframe to view results.
pd.DataFrame({'Unique Values': unique_cat_cols}, index=categorical_cols)

Unnamed: 0,Unique Values
race,"[Caucasian, AfricanAmerican, ?, Other, Asian, ..."
gender,"[Female, Male, Unknown/Invalid]"
age,"[[0-10), [10-20), [20-30), [30-40), [40-50), [..."
weight,"[?, [75-100), [50-75), [0-25), [100-125), [25-..."
payer_code,"[?, MC, MD, HM, UN, BC, SP, CP, SI, DM, CM, CH..."
medical_specialty,"[Pediatrics-Endocrinology, ?, InternalMedicine..."
diag_1,"[250.83, 276, 648, 8, 197, 414, 428, 398, 434,..."
diag_2,"[?, 250.01, 250, 250.43, 157, 411, 492, 427, 1..."
diag_3,"[?, 255, V27, 403, 250, V45, 38, 486, 996, 197..."
max_glu_serum,"[None, >300, Norm, >200]"


In [41]:
(diabetes_df['race'] == '?').value_counts().loc[True]

2273

In [42]:
num_of_missing = ((diabetes_df[i] == '?').value_counts().loc[True] for i in categorical_cols)

pd.DataFrame({'Count of ?': num_of_missing}, index=categorical_cols)

KeyError: True