# Title

## Introduction

## Preprocessing

### Feature Reduction

The papers used to inform the eliminaton of columns were "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records" found at https://downloads.hindawi.com/journals/bmri/2014/781670.pdf and "Risk factors associated with 30-day readmission and length of stay in patients with type 2 diabetes" found at https://www.sciencedirect.com/science/article/abs/pii/S1056872716307383. 

#### Overview of Support from First article

Cut weight and payer code.  Paper finds gender to not be statistically significant so cut that as well.  Use only one encounter per patient, ideally the first encounter to ensure statistical independence, though this may not be necessary for all models, it is still ideal for our purposes.  Remove all encounters involving discharge to hospice or death. HbA1C measure is more useful being treated as a binary variable focusing on whether the test was was administered at all rather than the results of the test.  

In [153]:
#Import packages 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#import training data and delete weight, payer code, and gender
df=pd.read_csv("diabetic_data_train.csv").drop(columns=['weight','gender','payer_code'])

#import and sort id_codes for admssion type, discharge disposition, and admission source 

dfid=pd.read_csv("IDs_mapping.csv")
df_admissionTypeID = dfid.iloc[0:8]
df_dischargeID=dfid.iloc[10:40].reset_index(drop=True)
df_dischargeID.columns=['discharge_disposition_id', "description"]
df_dischargeID.dropna(inplace=True)
df_admissionSourceID=dfid.iloc[42:].reset_index(drop=True)
df_admissionSourceID.columns=['admission_source_id', "description"]

#ensure only one encounter per patient by only using the first encounter listed

df=df.drop_duplicates(subset=['patient_nbr'],keep='first').reset_index(drop=True)

#remove all encounters occurring in discharge to hospice or death

#the fuction commented out below searched the discharge codes to find the ones that contain hospice and expired
#which can be interperated as death.
#hospice=df_dischargeID['description'].str.contains('hospice')
#Hospice=df_dischargeID['description'].str.contains('Hospice')
#expired=df_dischargeID['description'].str.contains('expired')
#Expired=df_dischargeID['description'].str.contains('Expired')
#death_array=[]
#hospiceArr=df_dischargeID[hospice].discharge_disposition_id.to_numpy()
#for i in range(len(hospiceArr)):
#    death_array.append(hospiceArr[i])
#HospiceArr=df_dischargeID[Hospice].discharge_disposition_id.to_numpy()
#for i in range(len(HospiceArr)):
#    death_array.append(HospiceArr[i])
#expiredArr=df_dischargeID[expired].discharge_disposition_id.to_numpy()
#for i in range(len(expiredArr)):
#    death_array.append(expiredArr[i])
#ExpiredArr=df_dischargeID[Expired].discharge_disposition_id.to_numpy()
#for i in range(len(ExpiredArr)):
#    death_array.append(ExpiredArr[i])
#np.unique(np.array(death_array))
#The results show that discharge codes 11,13,14,19,20,21 contain references to death and hospice
df=df[df['discharge_disposition_id'] != 11]
df=df[df['discharge_disposition_id'] != 13]
df=df[df['discharge_disposition_id'] != 14]
df=df[df['discharge_disposition_id'] != 19]
df=df[df['discharge_disposition_id'] != 20]
df=df[df['discharge_disposition_id'] != 21]

#now change HbA1C into a binary variable 1 = received testing, 0 = did not receive testing

df['A1Cresult'] = df['A1Cresult'].replace({'None':0, '>8':1,'Norm':1, '>7': 1})

In [12]:
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')