# Case Study 2 - Predicting Hospital Readmittance

__Team Members:__ Amber Clark, Andrew Leppla, Jorge Olmos, Paritosh Rai

# Content
* [Business Understanding](#business-understanding)
    - [Abstract](#abstract)
    - [Introduction](#introduction)
    - [Methods](#methods)
    - [Results](#results)
* [Data Evaluation](#data-evaluation)
    - [Loading Data](#loading-data) 
    - [Data Summary](#data-summary)
    - [Missing Values](#missing-values)
    - [Feature Removal](#feature-removal)
    - [Exploratory Data Analysis (EDA)](#eda)
    - [Assumptions](#assumptions)
* [Model Preparations](#model-preparations)
    - [Sampling & Scaling Data](#sampling-scaling-data)
    - [Proposed Method](#proposed-metrics)
    - [Evaluation Metrics](#evaluation-metrics)
    - [Feature Selection](#feature-selection)
* [Model Building & Evaluations](#model-building)
    - [Sampling Methodology](#sampling-methodology)
    - [Model](#model)
    - [Performance Analysis](#performance-analysis)
* [Model Interpretability & Explainability](#model-explanation)
    - [Examining Feature Importance](#examining-feature-importance)
* [Conclusion](#conclusion)
    - [Final Model Proposal](#final-model-proposal)
    - [Future Considerations and Model Enhancements](#model-enhancements)
    - [Alternative Modeling Approaches](#alternative-modeling-approaches)

# Business Understanding & Executive Summary <a id='business-understanding'/>

What are we trying to solve for and why is it important?


### Objective<a id='scope'/>
This case study involves analyzing a ten-year study tracking diabetic patient readmission into hospitals. The goal of the analysis is to predict when diabetic patients are likely to be readmitted to a hospital based on the available data and determine if any of the provided factors are particularly indicative of a high chance of readmission.

### Introduction <a id='introduction'/>
Diabetes is a common disease that affects the body's ability to regulate blood sugar levels naturally. According to the American Heart Association [1], there are two main types of diabetes which involve issues with insulin. This hormone regulates how cells absorb glucose from the bloodstream. Type 1 diabetes describes the chronic form that is usually identified at a young age in which the body is unable to produce sufficient insulin. Type 2 diabetes, the most common condition, can arise later in life and occurs when the body develops an "insulin resistance" or its insulin production begins to diminish.
Complications involving the effects of diabetes on the heart and circulatory system can lead to dire conditions that require hospitalization. Often patients are released from the hospital but may have to be readmitted soon after with recurring issues, causing additional strain on both the patients' livelihoods and the efficiency of the hospital. The data provided for this case study is requisitioned from a ten-year study on hospital readmissions of diabetic patients. This analysis aims to predict when diabetic patients are likely to be readmitted to the hospital after a visit given the provided information and if that readmission will occur in less than 30 days. Based on a predictive model, the most influential indicators of probable readmissions can be identified so that medical professionals can make better judgments on whether a diabetic patient is ready to be released. Since diabetes is estimated to affect 463 million people worldwide [2], any improvement in this regard would have far-reaching significant benefits.

### EDA <a id='eda'/>
Two data CSV files are used for the analysis, "diabetic_data.csv" and "IDs_maping.csv". The “diabetic_data” dataset is based on the multi-year study of diabetes patients, and each row represents an individual hospital admission. The patients' information is systematically collected from the point of entry to that of discharge. It can be broadly classified into a few categories:
Personal information
•	The admission situations and conditions
•	The laboratory tests conducted
•	The physician's diagnosis
•	The treatments and medications
•	The discharge conditions
This dataset contains 50 features and 101766 rows, including the response variable readmission information. In the diabetic data file, the team identified missing data. Following are a few observations:
•	The response variable is a bit unbalanced 89:11 split. 
•	The imbalance was also observed in the male and female samples.
•	Collected data did show more samples from the Caucasian race. However, the ratio of the patient taking diabetic medication vs. total patients within the race is very similar (range from 74% to 80%).  Hospitals are typically required by state law to collect demographic information like race and ethnicity.  The 1964 Civil Rights Act allows hospitals to collect this data to ensure no discrimination in care.
•	Pair plot distribution of continuous variables did not show any obvious correlation among continuous variables.
The team looked at data in the "ID_mapping.csv" file. This file contains the ID for multiple categorical variables and its brief description. The team found some missing values like NULL, Not Available, unknown/Invalid, and Not Mapped during the data exploration of the ID_Mapping dataset. NaN replaced these missing values in the ID_Mapping dataset. Then Id_Mapping information was merged with the "diabetic_data".

### Methods <a id='methods'/>
The objective of our analysis is to classify diabetes patients who have the highest risk of being re-admitted to the hospital within 30 days using logistic regression modeling. L2 regularization was used to prevent overfitting and maintain model performance.  The team observed that L2 is faster and more efficient to run than L1 with this dataset that had a large number of dummy coded variables.

The metrics used to evaluate model performance were percision and recall.  To balance percision and recall, we used the F1 score.  **** define F1 **** 

One school of thought presents an argument that the higher recall should be sought as the readmission must be prevented at all costs. There is no free lunch as the tightening of one side will force the other to bear the cost that they should not be. In that line, another idea presented is to have the optimum point at intersection between the precision and the recall.

### Train and Test Data Split <a id='Train and Test Data Split'/>
The response variable is a bit unbalanced with 89 – 11 splits. So, Stratified-data split with 70/30 split between train and test set.  Stratified data splitting ensures that the groups are balanced since the dataset is unbalanced.  Dataset was split in training and test set before imputation. After imputation, the continuous variables were scaled to ensure high magnitude variables were not influencing the outcome more than low magnitude variables.

## Missing Values <a id='missing-values'>
No missing value was observed in any continuous variable. All the missing values were observed in categorial features. 
    
The challenge encountered with this data is the missing values. There were missing values observed in Id_mapping files. Missing values in the Id_mapping table were replaced with NaN. NaN values were substituted in the diabetic data file. Then missing values were mitigated as stated in the table above. 
    
Team will investigate whether these missing values have any internal correlation or dependency. The heatmap do not show any overlap and correlation with the missing values.
For diagnosis codes, the columns diag_1, diag_2, and diag_3 ICD code ranges were leveraged to reduce the categories. 
    
Following are the variables with the missing value of feature and methodology team used to mitigate the missing values:

| Missing Variable                 | # of missing values | Mitigation Methodology                                     |
|----------------------------------|---------------------|------------------------------------------------------------|
| weight                           | 98,569              | ~97% of data was missing so variable was dropped.          |
| race                             | 2,273               | Impute by Mode                                             |
| payer_code                       | 40,256              | Impute by Mode                                             |
| medical Specialty                | 49,949              | Group by admission_type_id_new_mapping then Impute by Mode |
| disharge_disposition_new_mapping | 4,680               | Impute by Mode                                             |
| admission_source_new_mapping     | 7,067               | Impute by Mode                                             |
| diag_1                           | 21                  | Recategorize and Impute by Mode                            |
| diag_2                           | 358                 | Recategorize and Impute by Mode                            |
| diag_3                           | 2,423               | Recategorize and Impute by Mode                            |

The race variable is identified as a potential ethical concern, and the sensitive discussion topic and the team decided to fill the missing values of race by mode. The occurrence of missing information for race is relatively low, so there should be much concern for inadvertently creating any artificial trends in the model.

KNN Imputation Methodology:  Team also looked at the KNN imputation methodology for imputation of missing values as there was no perfect correlation observed among missing variables. The team found that KNN imputation was not appropriate for the number of categorical variables in the dataset. The team decided to drop the KNN imputation and follow the imputation approach listed above.
 
### Other Data Cleanup <a id='Other Data Cleanup'/>
The team reviewed the data in detail after addressing the missing values. Features that are not meaningful such as encounter_id, patient_nbr, examide, and citoglipton are removed from the analysis. encounter_id and patient_nbr are id codes and will not add value to the model. examide, and citoglipton variables have all zeros.

### Logistic Regression Modeling <a id='Logistic Regression Modeling'/>
Logistic regression is a classification algorithm. It is used to predict a binary outcome based on the set of independent variables. Logistic regression is used for the analysis when working with the binary target variable, i.e., dichotomous, or categorical in nature; in other words, if it fits into one of two categories (such as “yes” or “no”, “pass” or “fail”, 0 or 1). The response variable used for this analysis also has dichotomous data, so the team decided to use Logistic Regression to analyze the diabatic data. 

Baseline Model:
There were no missing values in the continuous variable, so the team built the baseline model using continuous variable only with Logistic Regression.  The team utilized the base model to assess the value of adding multi-level categorical variables as well as how the data looked without dummy coding or imputation.

The second Logistic Regression model was made using all the variables. This dataset was built after imputing missing values. The Logistic Regression model was built using penalty of “l2”, C value of “1”, and class_weight as “balance”. The team chose balanced class_weight to deal with the imbalance of the target variable. This option balanced how the model was predicting negative and positive classes.
    
The accuracy of the model was 68%.
 
### Results <a id='results'/>


### References <a id='References'/>
[1] American Heart Association. What is Diabetes? https://www.heart.org/en/health-topics/diabetes/about-diabetes

[2] "IDF DIABETES ATLAS Ninth Edition 2019" (PDF). www.diabetesatlas.org. Retrieved 18 May 2020.
[3] What is Logistic Regression? A Beginner's Guide [2022] (careerfoundry.com)

 

# Data Evaluation <a id='data-evaluation'>
    

Summarize data being used?

Are there missing values?

Which variables are needed and which are not?

What assumptions or conclusions are you drawing about your data?

In [1]:
# standard libraries
import pandas as pd
import numpy as np
import os
from IPython.display import Image

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tabulate import tabulate

# data pre-processing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# prediction models
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
#from kneed import KneeLocator
from scipy import stats

# import warnings filter
'''import warnings
warnings.filterwarnings('ignore')
from warnings import simplefilter 
simplefilter(action='ignore', category=FutureWarning)'''



## Loading Data <a id='loading-data'>

In [None]:
adm_type = pd.read_csv('dataset_diabetes/IDs_mapping.csv') 

## Data Summary <a id='data-summary'>

| Feature name                | Type    | Description and values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|-----------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Encounter ID                | Numeric | Unique identifier of an encounter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Patient number              | Numeric | Unique identifier of a patient                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Race                        | Nominal | Values: Caucasian, Asian, African American, Hispanic, and other                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| Gender                      | Nominal | Values: male, female, and unknown/invalid                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Age                         | Nominal | Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Weight                      | Numeric | Weight in pounds.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Admission type              | Nominal | Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Discharge disposition       | Nominal | Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Admission source            | Nominal | Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Time in hospital            | Numeric | Integer number of days between admission and discharge                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Payer code                  | Nominal | Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Medical specialty           | Nominal | Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Number of lab procedures    | Numeric | Number of lab tests performed during the encounter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Number of procedures        | Numeric | Number of procedures (other than lab tests) performed during the encounter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Number of medications       | Numeric | Number of distinct generic names administered during the encounter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Number of outpatient visits | Numeric | Number of outpatient visits of the patient in the year preceding the encounter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Number of emergency visits  | Numeric | Number of emergency visits of the patient in the year preceding the encounter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| Number of inpatient visits  | Numeric | Number of inpatient visits of the patient in the year preceding the encounter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| Diagnosis 1                 | Nominal | The primary diagnosis (coded as first three digits of ICD9); 848 distinct values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Diagnosis 2                 | Nominal | Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Diagnosis 3                 | Nominal | Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Number of diagnoses         | Numeric | Number of diagnoses entered to the system                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Glucose serum test result   | Nominal | Indicates the range of the result or if the test was not taken. Values: >200, >300, normal, and none if not measured                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| A1c test result             | Nominal | Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.                                                                                                                                                                                                                                                                                                                                                                                                              |
|                             |         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| Change of medications       | Nominal | Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Diabetes medications        | Nominal | Indicates if there was any diabetic medication prescribed. Values: “yes” and “no”                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 24 features for medications | Nominal | For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed |
| Readmitted                  | Nominal | Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.                                                                                                                                                                                                                                                                                                                                                                                                                                                          |

## Feature Removal <a id='feature-removal'>

## Exploratory Data Analysis (EDA) <a id='eda'>

### 

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
cont_vars = df.describe().columns
cont_vars

In [None]:
cat_vars = df.columns.drop(cont_vars)
cat_vars

In [None]:
df[cat_vars].describe()

In [None]:
df[cat_vars].head()
# diag_1=250.83 is likely a reference code, not a float

### Feature Collinearity <a id='feature-collinearity'>


### Feature Outliers 
 

## Assumptions <a id='assumptions'>

# Model Preparations <a id='model-preparations'/>

What methods did you use (or not) to solve the problem?

Why are the methods you chose appropriate given the business objective?

How did you decide your approach was useful?  If more than one method, which one was better or why are each better or not?

What evaluation smetrics are most useful given the problem is a binary classification (ex. accuracy, f1-score, precision, recall AUC, etc)?



## Sampling & Scaling Data <a id='sampling-scaling-data' />

## Need to add one hot encoding for categorical variables

In [None]:
# Define y and X
y = np.array( df['readmitted'] )
X = df.drop(['readmitted'], axis =1)

# One hot encoding for X[cat_vars]


# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1234321) 

# Center and Scale X
scl = StandardScaler()
scl.fit(X_train)
X_train_scaled = scl.transform(X_train) # apply to training
X_test_scaled = scl.transform(X_test) # apply to the test set (without snooping)

# Keep the feature names
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X.columns, index = X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns = X.columns, index = X_test.index)

## Proposed Method <a id='proposed-metrics' />

## Evaluation Metrics <a id='evaluation-metrics' />

### Baseline Model

## Feature Selection <a id='feature-selection' />

# Model Building & Evaluations <a id='model-building'/>

Primary task is buiding a logistic regression to predict hospital readmittances.

How did you handle missing values?

Specify your sampling methodology

Set up your models - highlights of any important parameters

Analysis of your models performance

## Sampling Methodology <a id='sampling-methodology'/>

#### Per the code above we used a 70/30 train test sample split with 5 fold internal cross validation for training

## Model's Performance Analysis <a id='performance-analysis'/>

# Model Interpretability & Explainability <a id='model-explanation'>

Which variables were more important and why?

How did you come to the conclusion these variables were important how how should the audience interpret this?

## Examining Feature Importance <a id='examining-feature-importance'/>

# Conclusion <a id='conclusion'>

What are you proposing to the audience with your models and why?

How should your audience interpret your conclusion and whwere should they go moving forward on the topic?

What other approaches do you recommend exploring?

Bring it all home!

### Final Model Proposal <a id='final-model-proposal'/>

### Future Considerations and Model Enhancements <a id='model-enhancements'/>

### Alternative Modeling Approaches <a id='alternative-modeling-approaches'>