# Diabetes Dataset Process and Results
Nathaniel Richards - 10/23/18

From UCI ML Repository    
["Diabetes 130-US hospitals for years 1999-2008 Data Set"](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#)

**Goal**: build model(s) to predict which patients will be re-hospitalized within 30 days

**Evaluate**: using AUROC

Notes:
- 'encounter_id' - unique admissions
- ignore 'patient_nbr' - treat all encounters independent
- 'readmitted' - treat 'NO' as '>30'
- Attributes: 55
- Samples: >100k
- Features: numerical, categorical

## Basic Approach

1. Exploration
    1. feature names
    1. feature types
    1. missing values
    1. numerical histograms
    1. categorical value counts
1. Preprocessing
    1. drop rows
        - missing data
        - other insights from exploration
    1. recode categorical features
        - reduce dimensionality
    1. convert types
        - numerical to categorical
        - categorical to numerical
    1. determine feature subset
    1. upsampling / downsampling
    1. train / validation / test split
1. Modeling
    1. Baseline models
        - logistic regression
        - SVC
    1. Advanced models
        - DNN w/ categorical embedding

## Considerations

1. Imbalanced classes
    - <30 only accounts for 11% of data
1. Upsampling / downsampling
    - Must be careful if upsampling, risk of information leakage between train/test
1. Train / val / test split
    - Ensure no information leakage, otherwise overly optimistic results

## Literature Survey

After planning out how I would approach this problem, I did some research on how others have performed on this open-source dataset.  Listed below are some articles that I found, including my inspiration and comments.

### [How to use machine learning to predict hospital readmissions?](https://medium.com/@uraza/how-to-use-machine-learning-to-predict-hospital-readmissions-part-1-bd137cbdba07)   
by Usman Raza

Reported test accuracy: 94%

Notes
- remove weight, payer_code, medical_specialty
- recode diagnoses {1,2,3}
- group similar admission/discharge categories
- convert age ranges to numerical mean
- drop subjects' second, third, etc visits

Pros
- extremely thorough set of blog posts
- insightful feature engineering/reduction
- references the paper associated with dataset
- uses pandas
- strong statistical background

Cons
- **overinflated train/test performance**

Why? Because of the way he performed upsampling.  He performed SMOTE upsampling *before* the train/test split, causing much of the training data information to end up in the test set.  This is how he was able to achieve such a high accuracy/AUROC compared to other literature.


### [STATS701 Project](https://jrfarrer.github.io/stat701_miniproject/)
by Jordan Farrer

Reported AUROC: <0.64

Notes
- confirms that modeling is difficult/impossible with such unbalanced classes

Pros
- great visualization, exploration

Cons
- did not perform upsampling/downsampling, left classes unbalanced
- poor model performance as a result
- similar performance to naive model (only predicting one class)
- unbalanced confusion matrix


### [Predicting Hospital Readmission for Patients with Diabetes Using Scikit-Learn](https://towardsdatascience.com/predicting-hospital-readmission-for-patients-with-diabetes-using-scikit-learn-a2e359b15f0)
by Andrew Long

Reported AUROC: ~0.65

Notes
- recode medical_specialty categories
- most weight values missing, but adds feature 'has_weight' as presence of weight record
- many models evaluated
- reports AUC
- performs hyperparameter optimization

Pros
- follows clear data science process
- train / val / test split
- properly performs subsampling after train/val split
- performs continuous feature scaling

Cons
- one-hot encoding? - may not be negative


## Other Resources

### [The Right Way to Oversample in Predictive Modeling](https://beckernick.github.io/oversampling-modeling/)
by Nick Becker

This article details the subtle-yet-dangerous pitfall of performing upsampling incorrectly.  One should apply SMOTE upsampling *after* the train/test split, otherwise information will leak between the train/test sets, resulting in inflated performance - the model has already trained on the data in the test set.

### [Building A Logistic Regression in Python, Step by Step](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8)
by Susan Li

- Converts categorical features to one-hot
- Performs SMOTE on one-hot
- Properly performs SMOTE on train set after train/test split
- Recursive feature elimination (RFE)

## Expectations

I first came upon the results from [Usman Raza](https://medium.com/@uraza/how-to-use-machine-learning-to-predict-hospital-readmissions-part-1-bd137cbdba07) that reported 94% test accuracy on the Diabetes dataset.

However, these results were inflated (see above), and other results show an average test performance around AUC=0.65 .  In my model evaluation, this is my target to meet or exceed.