# Reducing Bias in Machine Learning Models - Pre-processing


#### Kristen Lo - BrainStation
---

### Table of Contents
- [Introduction](#intro)
- [Part 2: Pre-processing](#clean)
    - 1.1: [Housekeeping](#house)
    - 1.2: [Pre-processing Demographic Data](#demo)
    - 1.3: [Pre-processing Triage Numerical Data](#num)
    - 1.4: [Pre-processing Hospital Usage Data](#huse)
    - 1.5: [Pre-processing Historical Lab Data](#lab) 
    - 1.6: [Pre-processing Meds Data](#med) 
- [Conclusion](#conc)


---
### <a id = 'intro'></a> Introduction

In this notebook, we will be doing some preprocessing to prepare our data for modeling. I plan to convert all of the columns into binary columns so that it's easier to do modeling on. 

There is a need to be able to predict the hospital admission rates for diabetic patients. However, using traditional machine learning models can lead to health disparities caused by biased data which can be related to demographic data (ex. race, age, income, insurance etc). These biases need to be removed prior to modelling so that bias isn't introduced into the model. Building on the work of Raza, S. who aimed to predict, diagnose, and mitigate health disparities in hospital re-admission, my aim is to replicate the study performed by Raza and create my own model that's able to screen for biases and predict admission rates for diabetics visiting the ER. 


Data was sourced from all adult Emergency Department visits from March 2014 - July 2017 from one academic and two community emergency rooms, apart from the Yale New Haven Health system. These visits resulted in either admission to their respective hospital or discharge. 

There are a total of 972 variables that we extracted per patient visit from 560,486 patient visits. 

Courtesy of:
 "Hong WS, Haimovich AD, Taylor RA (2018) Predicting hospital admission at emergency department triage using machine learning. PLoS ONE 13(7): e0201016." (https://doi.org/10.1371/journal.pone.0201016)




-----

## <a id = 'clean'></a> Part 2: Pre-processing

---
#### <a id = 'house'></a> 1.1 HouseKeeping 

Loading the necessary libraries

In [599]:
import pandas as pd
import numpy as np

Loading the clean csv file

In [600]:
health_data = pd.read_csv('Data/clean_health_data_pt1.csv')

In [601]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None) #Show all rows

In [602]:
health_data.head()

Unnamed: 0,demo_age,demo_gender,demo_race,demo_employstatus,demo_insurance_status,disposition,huse_previousdispo,pmh_2ndarymalig,pmh_abdomhernia,pmh_acqfootdef,pmh_acrenlfail,pmh_acutecvd,pmh_acutemi,pmh_acutphanm,pmh_adltrespfl,pmh_alcoholrelateddisorders,pmh_amniosdx,pmh_anemia,pmh_aneurysm,pmh_anxietydisorders,pmh_artembolism,pmh_asppneumon,pmh_asthma,pmh_attentiondeficitconductdisruptivebeha,pmh_biliarydx,pmh_bladdercncr,pmh_blindness,pmh_bnignutneo,pmh_bonectcncr,pmh_bph,pmh_brainnscan,pmh_breastcancr,pmh_breastdx,pmh_brnchlngca,pmh_cardiaarrst,pmh_carditis,pmh_cataract,pmh_chfnonhp,pmh_chrkidneydisease,pmh_coaghemrdx,pmh_coloncancer,pmh_comabrndmg,pmh_complicdevi,pmh_complicproc,pmh_conduction,pmh_contraceptiv,pmh_copd,pmh_coronathero,pmh_crushinjury,pmh_cysticfibro,pmh_deliriumdementiaamnesticothercognitiv,pmh_developmentaldisorders,pmh_diabmelnoc,pmh_diabmelwcm,pmh_disordersusuallydiagnosedininfancych,pmh_diverticulos,pmh_dizziness,pmh_dminpreg,pmh_dysrhythmia,pmh_ecodesadverseeffectsofmedicalcare,pmh_ecodesfall,pmh_ecodesfirearm,pmh_ecodesmotorvehicletrafficmvt,pmh_ecodesotherspecifiedandclassifiable,pmh_encephalitis,pmh_endometrios,pmh_epilepsycnv,pmh_esophcancer,pmh_esophgealdx,pmh_eyeinfectn,pmh_femgenitca,pmh_feminfertil,pmh_fluidelcdx,pmh_fuo,pmh_fxarm,pmh_fxhip,pmh_fxskullfac,pmh_gangrene,pmh_gasduoulcer,pmh_gastritis,pmh_gastroent,pmh_giconganom,pmh_gihemorrhag,pmh_giperitcan,pmh_glaucoma,pmh_goutotcrys,pmh_guconganom,pmh_hdnckcancr,pmh_headachemig,pmh_hemorrpreg,pmh_hodgkinsds,pmh_hrtvalvedx,pmh_htn,pmh_htncomplicn,pmh_htninpreg,pmh_hyperlipidem,pmh_immunitydx,pmh_inducabortn,pmh_infectarth,pmh_influenza,pmh_infmalegen,pmh_intestinfct,pmh_intobstruct,pmh_intracrninj,pmh_jointinjury,pmh_kidnyrnlca,pmh_lateeffcvd,pmh_leukemias,pmh_liveribdca,pmh_lowbirthwt,pmh_lungexternl,pmh_lymphenlarg,pmh_maintchemr,pmh_maligneopls,pmh_meningitis,pmh_menstrualdx,pmh_miscellaneousmentalhealthdisorders,pmh_mooddisorders,pmh_mouthdx,pmh_ms,pmh_multmyeloma,pmh_mycoses,pmh_neoplsmunsp,pmh_nephritis,pmh_nonepithca,pmh_nonhodglym,pmh_nutritdefic,pmh_opnwndextr,pmh_osteoarthros,pmh_osteoporosis,pmh_otaftercare,pmh_otbnignneo,pmh_otbonedx,pmh_otcirculdx,pmh_otcomplbir,pmh_otconntiss,pmh_otdxbladdr,pmh_otdxkidney,pmh_otdxstomch,pmh_otendodsor,pmh_otfemalgen,pmh_othbactinf,pmh_othcnsinfx,pmh_othematldx,pmh_othercvd,pmh_othereardx,pmh_otheredcns,pmh_othereyedx,pmh_othergidx,pmh_othergudx,pmh_otherinjury,pmh_otherpregnancyanddeliveryincludingnormal,pmh_otherscreen,pmh_othfracture,pmh_othheartdx,pmh_othliverdx,pmh_othlowresp,pmh_othmalegen,pmh_othnervdx,pmh_othveindx,pmh_otinflskin,pmh_otitismedia,pmh_otjointdx,pmh_otnutritdx,pmh_otprimryca,pmh_otrespirca,pmh_otupprresp,pmh_otuprspin,pmh_ovariancyst,pmh_pancreascan,pmh_pancreasdx,pmh_paralysis,pmh_parkinsons,pmh_pathologfx,pmh_peripathero,pmh_peritonitis,pmh_personalitydisorders,pmh_phlebitis,pmh_pid,pmh_pleurisy,pmh_pneumonia,pmh_poisnotmed,pmh_precereoccl,pmh_prevcsectn,pmh_prolapse,pmh_prostatecan,pmh_pulmhartdx,pmh_rctmanusca,pmh_rehab,pmh_respdistres,pmh_retinaldx,pmh_rheumarth,pmh_schizophreniaandotherpsychoticdisorde,pmh_screeningandhistoryofmentalhealthan,pmh_septicemia,pmh_sexualinfxs,pmh_shock,pmh_sicklecell,pmh_skininfectn,pmh_skinmelanom,pmh_socialadmin,pmh_spincorinj,pmh_stomchcancr,pmh_substancerelateddisorders,pmh_suicideandintentionalselfinflictedin,pmh_superficinj,pmh_syncope,pmh_teethdx,pmh_testiscancr,pmh_thyroiddsor,pmh_tia,pmh_tonsillitis,pmh_ulceratcol,pmh_ulcerskin,pmh_unclassified,pmh_urinyorgca,pmh_uteruscancr,pmh_uti,pmh_varicosevn,pmh_viralinfect,pmh_whtblooddx,huse_n_edvisits,huse_n_admissions,triage_vital_hr,triage_vital_sbp,triage_vital_dbp,triage_vital_rr,triage_vital_temp,huse_n_surgeries,cc_abdominalcramping,cc_abdominaldistention,cc_abdominalpain,cc_abdominalpainpregnant,cc_abnormallab,cc_abscess,cc_addictionproblem,cc_alcoholintoxication,cc_alcoholproblem,cc_allergicreaction,cc_alteredmentalstatus,cc_animalbite,cc_ankleinjury,cc_anklepain,cc_anxiety,cc_arminjury,cc_armpain,cc_assaultvictim,cc_asthma,cc_backpain,cc_bleeding/bruising,cc_bodyfluidexposure,cc_breastpain,cc_breathingdifficulty,cc_burn,cc_cardiacarrest,cc_chestpain,cc_coldlikesymptoms,cc_confusion,cc_conjunctivitis,cc_constipation,cc_cough,cc_cyst,cc_decreasedbloodsugar-symptomatic,cc_dehydration,cc_dentalpain,cc_depression,cc_detoxevaluation,cc_diarrhea,cc_dizziness,cc_drug/alcoholassessment,cc_drugproblem,cc_dyspnea,cc_dysuria,cc_earpain,cc_earproblem,cc_edema,cc_elbowpain,cc_elevatedbloodsugar-nosymptoms,cc_elevatedbloodsugar-symptomatic,cc_emesis,cc_epigastricpain,cc_epistaxis,cc_exposuretostd,cc_extremitylaceration,cc_extremityweakness,cc_eyeinjury,cc_eyepain,cc_eyeproblem,cc_eyeredness,cc_facialinjury,cc_faciallaceration,cc_facialpain,cc_facialswelling,cc_fall,cc_fall>65,cc_fatigue,cc_femaleguproblem,cc_fever,cc_fever-75yearsorolder,cc_fever-9weeksto74years,cc_feverimmunocompromised,cc_fingerinjury,cc_fingerpain,cc_fingerswelling,cc_flankpain,cc_follow-upcellulitis,cc_footinjury,cc_footpain,cc_footswelling,cc_foreignbodyineye,cc_fulltrauma,cc_generalizedbodyaches,cc_gibleeding,cc_giproblem,cc_groinpain,cc_handinjury,cc_handpain,cc_headache,cc_headache-newonsetornewsymptoms,cc_headache-recurrentorknowndxmigraines,cc_headachere-evaluation,cc_headinjury,cc_headlaceration,cc_hemoptysis,cc_hippain,cc_hypertension,cc_hypotension,cc_insectbite,cc_irregularheartbeat,cc_jawpain,cc_jointswelling,cc_kneeinjury,cc_kneepain,cc_laceration,cc_leginjury,cc_legpain,cc_legswelling,cc_lethargy,cc_lossofconsciousness,cc_maleguproblem,cc_mass,cc_medicalscreening,cc_medicationproblem,cc_medicationrefill,cc_migraine,cc_modifiedtrauma,cc_motorvehiclecrash,cc_multiplefalls,cc_nasalcongestion,cc_nausea,cc_nearsyncope,cc_neckpain,cc_neurologicproblem,cc_otalgia,cc_other,cc_overdose-intentional,cc_pain,cc_panicattack,cc_pelvicpain,cc_rapidheartrate,cc_rash,cc_rectalbleeding,cc_rectalpain,cc_respiratorydistress,cc_ribinjury,cc_ribpain,cc_seizure-newonset,cc_seizure-priorhxof,cc_shortnessofbreath,cc_shoulderinjury,cc_shoulderpain,cc_sinusproblem,cc_skinirritation,cc_skinproblem,cc_sorethroat,cc_stdcheck,cc_strokealert,cc_suicidal,cc_suture/stapleremoval,cc_syncope,cc_tachycardia,cc_testiclepain,cc_thumbinjury,cc_tickremoval,cc_toeinjury,cc_toepain,cc_unresponsive,cc_uri,cc_urinaryfrequency,cc_urinaryretention,cc_urinarytractinfection,cc_vaginalbleeding,cc_vaginaldischarge,cc_vaginalpain,cc_weakness,cc_wheezing,cc_withdrawal-alcohol,cc_woundcheck,cc_woundinfection,cc_woundre-evaluation,cc_wristinjury,cc_wristpain,hist_glucose_median,meds_antihyperglycemics,meds_anti-obesitydrugs,meds_hormones,triage_vital_hr_imputed_flag,triage_vital_sbp_imputed_flag,triage_vital_dbp_imputed_flag,triage_vital_rr_imputed_flag,triage_vital_temp_imputed_flag,hist_glucose_median_imputed_flag
0,80-89,Female,Hispanic or Latino,Retired,Medicare,0,Admit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,no_prior_visits,No_prior_admis,normal_hr,hypertension(high)_sbp,pre-hypertension_dbp,normal_rr,normal_temp,low_Surg,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Normal,no_antihyperglycemics,0.0,no_hormones,0,0,0,0,0,1
1,80-89,Female,Hispanic or Latino,Retired,Medicare,1,Discharge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,low_prior_visit,No_prior_admis,normal_hr,hypertension(high)_sbp,normal_dbp,normal_rr,normal_temp,low_Surg,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Normal,1-to-2_antihyperglycemics,0.0,no_hormones,0,0,0,0,0,0
2,80-89,Female,Hispanic or Latino,Retired,Medicare,0,Admit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,low_prior_visit,low_prior_admis,normal_hr,hypertension(high)_sbp,normal_dbp,normal_rr,normal_temp,moderate_Surg,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Normal,no_antihyperglycemics,0.0,no_hormones,0,0,0,0,0,0
3,50-59,Male,Hispanic or Latino,Disabled,Medicare,1,No previous dispo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,no_prior_visits,No_prior_admis,normal_hr,pre-hypertension_sbp,pre-hypertension_dbp,normal_rr,normal_temp,low_Surg,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Normal,no_antihyperglycemics,0.0,no_hormones,1,1,1,1,1,1
4,50-59,Male,Hispanic or Latino,Disabled,Medicare,1,Admit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,low_prior_visit,low_prior_admis,normal_hr,pre-hypertension_sbp,pre-hypertension_dbp,normal_rr,normal_temp,low_Surg,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Normal,no_antihyperglycemics,0.0,no_hormones,1,1,1,1,1,1


Upon initial inspection, I can see that `prior medical history` and `chief complaint` super columns are all floats. I will change them to integers.

In [603]:
pmh = health_data.filter(like='pmh_').columns
cc = health_data.filter(like='cc_').columns

In [604]:
# Convert columns to integer type
health_data[pmh] = health_data[pmh].astype(int)
health_data[cc] = health_data[cc].astype(int)

---
#### <a id = 'demo'></a> 1.2 Pre-Processing Demographic Data

Before we can start dummying our data, we first need to clean up the categories. We will do this supercategory by supercategory. Let's first start with the `demographic` supercategory. 

In [605]:
demographic = health_data.filter(like='demo_').columns
print(f'The columns that make up the demographic super category are : {demographic}')

The columns that make up the demographic super category are : Index(['demo_age', 'demo_gender', 'demo_race', 'demo_employstatus',
       'demo_insurance_status'],
      dtype='object')


Let's now get the count of unique entries of each of these. 

In [606]:
# Initialize an empty dictionary to store counts for each column
counts_dict = {}

# Iterate through each column in the 'demographic' list
for column in demographic:
    # Get counts of unique entries for the current column
    counts = health_data[column].value_counts()
    
    # Store the counts in the dictionary
    counts_dict[column] = counts

# Print or use the counts for each column
for column, counts in counts_dict.items():
    print(f"Counts for {column}:")
    print(counts)
    print()

Counts for demo_age:
demo_age
50-59      23894
60-69      22236
70-79      17939
40-49      13088
80-89      13019
30-39       7397
90-99       4723
18-29       3195
100-109      136
Name: count, dtype: int64

Counts for demo_gender:
demo_gender
Female    59101
Male      46526
Name: count, dtype: int64

Counts for demo_race:
demo_race
White or Caucasian                           51082
Black or African American                    34317
Hispanic or Latino                           19405
Asian                                          660
American Indian or Alaska Native               104
Native Hawaiian or Other Pacific Islander       59
Name: count, dtype: int64

Counts for demo_employstatus:
demo_employstatus
Retired                    42205
Not Employed               22417
Disabled                   21139
Full Time                  13310
Part Time                   4538
Self Employed               1546
Student - Full Time          402
Student - Part Time           66
On Active Military

Age 
 - is ready to be dummied. No further processing is required.

Gender
- needs to be binarized, Male = 1 and Female = 0 

Race
- needs to be binarized, White = 1 and Person_of_Colour = 0

EmployStatus 
- We will change the groupings to be Employed (both full,part time, Self Employed, and military), Student (both full and part time), Not Employed (includes Not Employed, Disabled, and Retired)
- These will then be dummied

Insurance Status
- We will change the groupings to be the following: Federal(includes Medicare and Medicaid), Private(includes Commercial, Other, and Self-Pay)

---
Let's start with Age, though nothing needs to be done. We sill need to dummy these columns. The instance with the highest occurence will be dropped to avoid multicolinearity. In this case, it will be 50-59. This will also be a reference point

In [607]:
(health_data['demo_age'].value_counts()/health_data['demo_age'].count())*100

demo_age
50-59      22.621110
60-69      21.051436
70-79      16.983347
40-49      12.390771
80-89      12.325447
30-39       7.002944
90-99       4.471395
18-29       3.024795
100-109     0.128755
Name: count, dtype: float64

In [608]:
# Create dummy variables
dummy_age = pd.get_dummies(health_data['demo_age'], prefix='demo_age', dtype=int)

# Concatenate the dummy variables with the original DataFrame
health_data = pd.concat([health_data, dummy_age], axis=1)

In [609]:
#identify dummy columns
dummy_age = health_data.filter(like='demo_age_').columns 

# Identify the column with the most instances
most_instances_column = health_data[dummy_age].sum().idxmax()

print(f'The dummy column with the most instances is: {most_instances_column}')

The dummy column with the most instances is: demo_age_50-59


In [610]:
# Drop the original column with the most instances
health_data = health_data.drop(most_instances_column, axis=1)

In [611]:
#Drop the original column 
health_data = health_data.drop(columns='demo_age')

---
With Gender, we will binarize with Male = 1 and Female = 0

In [612]:
(health_data['demo_gender'].value_counts()/health_data['demo_gender'].count())*100

demo_gender
Female    55.95255
Male      44.04745
Name: count, dtype: float64

In [613]:
#Map the columns accordingly
health_data['demo_gender'] = health_data['demo_gender'].map({'Male': 1, 'Female': 0})

#Ensure the mapping worked accordingly
health_data['demo_gender'].value_counts()

demo_gender
0    59101
1    46526
Name: count, dtype: int64

---
Now moving onto Race, we will binarize with White = 1 and Person_of_Colour = 0

In [614]:
(health_data['demo_race'].value_counts()/health_data['demo_race'].count())*100

demo_race
White or Caucasian                           48.360741
Black or African American                    32.488852
Hispanic or Latino                           18.371250
Asian                                         0.624840
American Indian or Alaska Native              0.098460
Native Hawaiian or Other Pacific Islander     0.055857
Name: count, dtype: float64

In [615]:
#Map the columns accordingly
health_data['demo_race'] = health_data['demo_race'].map({'White or Caucasian': 1, 'Black or African American': 0, 
'Hispanic or Latino':0, 'Asian':0, 'American Indian or Alaska Native':0,'Native Hawaiian or Other Pacific Islander':0})

#Making sure the mapping worked 
health_data['demo_race'].value_counts()

demo_race
0    54545
1    51082
Name: count, dtype: int64

---
Now moving onto employ status, we will change the groupings to be Employed (both full,part time, Self Employed, and military), Student (both full and part time), Not Employed (includes Not Employed, Disabled, and Retired). After we will dummy these columns. We will drop the column with the most instances as per protocol to reduce multicolinearity. This means that the default will be the highest group. In this case, it would be individuals who are not employed.

In [616]:
(health_data['demo_employstatus'].value_counts()/health_data['demo_employstatus'].count())*100

demo_employstatus
Retired                    39.956640
Not Employed               21.222793
Disabled                   20.012875
Full Time                  12.600945
Part Time                   4.296250
Self Employed               1.463641
Student - Full Time         0.380585
Student - Part Time         0.062484
On Active Military Duty     0.003787
Name: count, dtype: float64

In [617]:
#Map the columns accordingly
health_data['demo_employstatus'] = health_data['demo_employstatus'].map({'Full Time': 'Employed', 'Part Time': 'Employed','Self Employed': 'Employed','On Active Military Duty': 'Employed',
'Student - Full Time ': 'Student','Student - Part Time': 'Student',
'Retired': 'Not_Employed','Female': 'Not_Employed','Disabled': 'Not_Employed'})

#Making sure the mapping worked 
health_data['demo_employstatus'].value_counts()

demo_employstatus
Not_Employed    63344
Employed        19398
Student            66
Name: count, dtype: int64

We will now need to convert these into dummy columns

In [618]:
# Create dummy variables
dummy_age = pd.get_dummies(health_data['demo_employstatus'], prefix='demo_employstatus', dtype=int)

# Concatenate the dummy variables with the original DataFrame
health_data = pd.concat([health_data, dummy_age], axis=1)

In [619]:
#identify dummy columns
dummy_age = health_data.filter(like='demo_employstatus_').columns 

# Identify the column with the most instances
most_instances_column = health_data[dummy_age].sum().idxmax()

print(f'The dummy column with the most instances is: {most_instances_column}')

The dummy column with the most instances is: demo_employstatus_Not_Employed


In [620]:
# Drop the original column with the most instances
health_data = health_data.drop(most_instances_column, axis=1)

In [621]:
#Drop the original column 
health_data = health_data.drop(columns='demo_employstatus')

---
Finally, We will change the groupings to be the following: Federal(includes Medicare and Medicaid), Private(includes Commercial, Other, and Self-Pay)

After we will dummy these columns. We will drop the column with the most instances as per protocol to reduce multicolinearity. This means that the default will be the highest group. 

In [622]:
(health_data['demo_insurance_status'].value_counts()/health_data['demo_insurance_status'].count())*100

demo_insurance_status
Medicare      38.352883
Commercial    30.030201
Medicaid      27.797817
Other          3.537921
Self pay       0.281178
Name: count, dtype: float64

In [623]:
#Map the columns accordingly
health_data['demo_insurance_status'] = health_data['demo_insurance_status'].map({'Medicare':'Federal','Medicaid':'Federal','Commercial':'Private','Other':'Private','Self pay':'Private',})

#Making sure the mapping worked 
health_data['demo_insurance_status'].value_counts()

demo_insurance_status
Federal    69873
Private    35754
Name: count, dtype: int64

We will now need to convert these into dummy columns

In [624]:
# Create dummy variables
dummy_age = pd.get_dummies(health_data['demo_insurance_status'], prefix='demo_insurance_status', dtype=int)

# Concatenate the dummy variables with the original DataFrame
health_data = pd.concat([health_data, dummy_age], axis=1)

In [625]:
#identify dummy columns
dummy_age = health_data.filter(like='demo_insurance_status_').columns 

# Identify the column with the most instances
most_instances_column = health_data[dummy_age].sum().idxmax()

print(f'The dummy column with the most instances is: {most_instances_column}')

The dummy column with the most instances is: demo_insurance_status_Federal


In [626]:
# Drop the original column with the most instances
health_data = health_data.drop(most_instances_column, axis=1)

In [627]:
#Drop the original column 
health_data = health_data.drop(columns='demo_insurance_status')

------
#### <a id = 'num'></a> 1.3 Pre-Processing Triage Numerical Data


Before we can start dummying our data, we first need to clean up the categories. We will do this supercategory by supercategory. Now let's go to `triage_num` supercategory. 

In [628]:
triage_num = health_data.filter(like='triage_vital').columns
print(f'The columns that make up the super category are : {triage_num}')

The columns that make up the super category are : Index(['triage_vital_hr', 'triage_vital_sbp', 'triage_vital_dbp',
       'triage_vital_rr', 'triage_vital_temp', 'triage_vital_hr_imputed_flag',
       'triage_vital_sbp_imputed_flag', 'triage_vital_dbp_imputed_flag',
       'triage_vital_rr_imputed_flag', 'triage_vital_temp_imputed_flag'],
      dtype='object')


In [629]:
triage_num =['triage_vital_hr', 'triage_vital_sbp', 'triage_vital_dbp',
       'triage_vital_rr', 'triage_vital_temp']

Let's explore the data that exists within these columns so that we can better have an idea of what would be our reference columns after dummy and dropping.

In [630]:
for column in triage_num:
    # Calculate the value counts percentage for each unique value in the column
    value_counts_percentage = (health_data[column].value_counts() / health_data[column].count()) * 100

    # Print the results
    print(f"Value counts percentage for column '{column}':\n{value_counts_percentage}\n")

Value counts percentage for column 'triage_vital_hr':
triage_vital_hr
normal_hr               85.940148
tachycardia(high)_hr    11.678832
bradycardia(low)_hr      2.340311
critical_hr              0.040709
Name: count, dtype: float64

Value counts percentage for column 'triage_vital_sbp':
triage_vital_sbp
hypertension(high)_sbp    48.451627
pre-hypertension_sbp      33.904210
normal_sbp                13.492762
critical_sbp               3.554016
hypotension(low)_sbp       0.597385
Name: count, dtype: float64

Value counts percentage for column 'triage_vital_dbp':
triage_vital_dbp
normal_dbp                51.875941
pre-hypertension_dbp      28.942411
hypertension(high)_dbp    13.831691
hypotension(low)_dbp       4.603936
critical_dpb               0.746021
Name: count, dtype: float64

Value counts percentage for column 'triage_vital_rr':
triage_vital_rr
normal_rr             97.356736
tachypnea(high)_rr     2.480426
Critical_rr            0.138222
bradypnea(low)_rr      0.024615
Name:

We will now need to convert these into dummy columns, though it might need to be done one at time so that we are able to remove the column with the most instances per column. I can do this with a loop.

In [631]:
# Loop through each column 
for column in triage_num:
    # Create dummy variables
    dummy_col = pd.get_dummies(health_data[column], prefix='dum_triage_vital_', dtype=int)

    # Concatenate the dummy variables with the original DataFrame
    health_data = pd.concat([health_data, dummy_col], axis=1)

    # Identify dummy columns
    dummy_columns = health_data.filter(like='dum_')

    # Identify the column with the most instances
    most_instances_column = dummy_columns.sum().idxmax()

    # Drop the column with the most instances
    health_data.drop(columns=most_instances_column, inplace=True)
    
    print(f"Column '{most_instances_column}' with the most instances for '{column}' was dropped.")

Column 'dum_triage_vital__normal_hr' with the most instances for 'triage_vital_hr' was dropped.
Column 'dum_triage_vital__hypertension(high)_sbp' with the most instances for 'triage_vital_sbp' was dropped.
Column 'dum_triage_vital__normal_dbp' with the most instances for 'triage_vital_dbp' was dropped.
Column 'dum_triage_vital__normal_rr' with the most instances for 'triage_vital_rr' was dropped.
Column 'dum_triage_vital__normal_temp' with the most instances for 'triage_vital_temp' was dropped.


In [632]:
#Drop the original column 
health_data = health_data.drop(columns=triage_num)

---
---
#### <a id = 'huse'></a> 1.4 Pre-Processing Hospital Usage Data

Now let's go to `Hospital Use` supercategory. 

In [633]:
hos_use = health_data.filter(like='huse_').columns
print(f'The columns that make up the super category are : {hos_use}')

The columns that make up the super category are : Index(['huse_previousdispo', 'huse_n_edvisits', 'huse_n_admissions',
       'huse_n_surgeries'],
      dtype='object')


In [634]:
for column in hos_use:
    # Calculate the value counts percentage for each unique value in the column
    value_counts_percentage = (health_data[column].value_counts() / health_data[column].count()) * 100

    # Print the results
    print(f"Value counts percentage for column '{column}':\n{value_counts_percentage}\n")

Value counts percentage for column 'huse_previousdispo':
huse_previousdispo
Discharge                       42.838479
Admit                           36.883562
No previous dispo               17.641323
Transfer to Another Facility     1.029093
LWBS after Triage                0.612533
AMA                              0.601172
Eloped                           0.219641
LWBS before Triage               0.095619
Observation                      0.070058
Send to L&D                      0.008521
Name: count, dtype: float64

Value counts percentage for column 'huse_n_edvisits':
huse_n_edvisits
low_prior_visit         63.127799
no_prior_visits         25.948858
moderate_prior_visit     9.912238
high_prior_visit         1.011105
Name: count, dtype: float64

Value counts percentage for column 'huse_n_admissions':
huse_n_admissions
low_prior_admis         49.308415
No_prior_admis          47.970689
moderate_prior_admis     2.315696
high_prior_admis         0.266977
vhigh_prior_admis        0.138

As we can see, there are still too many categories in the previous disposition column. I will map the outcomes into either admitted (1) or discharged (0). Patients that LWBS before/after triage (left without being seen), AMA(left against medical advice), Eloped (leaves the hospital when doing so may present an imminent threat to the patient's health), and Observation (kept in ED for extended period of time to determine need for admission) are considered to be discharged. Patients who are transferred to another facility and Sent to L&D can be considered admitted. 

In [635]:
#Map the columns accordingly
health_data['huse_previousdispo'] = health_data['huse_previousdispo'].map({'Send to L&D':'prev_dispo_Admit','Transfer to Another Facility':'prev_dispo_Admit', 
'LWBS after Triage':'prev_dispo_Discharge', 'AMA':'prev_dispo_Discharge', 'LWBS before Triage':'prev_dispo_Discharge', 'Eloped ':'prev_dispo_Discharge', 
'No previous dispo':'prev_dispo_None'})

In [636]:
#Making sure the mapping worked 
health_data['huse_previousdispo'].value_counts()

huse_previousdispo
prev_dispo_None         18634
prev_dispo_Discharge     1383
prev_dispo_Admit         1096
Name: count, dtype: int64

We can now start to dummy the columns in a loop

In [637]:
# Loop through each column 
for column in hos_use:
    # Create dummy variables
    dummy_col = pd.get_dummies(health_data[column], prefix='dum_huse_', dtype=int)

    # Concatenate the dummy variables with the original DataFrame
    health_data = pd.concat([health_data, dummy_col], axis=1)

    # Identify dummy columns
    dummy_columns = health_data.filter(like='dum_huse_')

    # Identify the column with the most instances
    most_instances_column = dummy_columns.sum().idxmax()

    # Drop the column with the most instances
    health_data.drop(columns=most_instances_column, inplace=True)
    
    print(f"Column '{most_instances_column}' with the most instances for '{column}' was dropped.")

Column 'dum_huse__prev_dispo_None' with the most instances for 'huse_previousdispo' was dropped.
Column 'dum_huse__low_prior_visit' with the most instances for 'huse_n_edvisits' was dropped.
Column 'dum_huse__low_prior_admis' with the most instances for 'huse_n_admissions' was dropped.
Column 'dum_huse__low_Surg' with the most instances for 'huse_n_surgeries' was dropped.


In [638]:
#Drop the original column 
health_data = health_data.drop(columns=hos_use)

---
#### <a id = 'lab'></a> 1.5 Pre-Processing Historical Lab Data

Now let's do this for `historical labs`

In [639]:
hist_labs = health_data.filter(like='hist_').columns
print(f'The columns that make up the super category are : {hist_labs}')

The columns that make up the super category are : Index(['hist_glucose_median', 'hist_glucose_median_imputed_flag'], dtype='object')


In [640]:
hist_labs = ['hist_glucose_median'] #the other one is an imputed flag

Let's see the distribution

In [641]:
(health_data['hist_glucose_median'].value_counts()/health_data['hist_glucose_median'].count())*100

hist_glucose_median
Normal             76.987891
>200(high)         15.821712
>300(very high)     7.190396
Name: count, dtype: float64

In [642]:
# Create dummy variables
dummy_age = pd.get_dummies(health_data['hist_glucose_median'], prefix='dum_hist_glucose_median_', dtype=int)

# Concatenate the dummy variables with the original DataFrame
health_data = pd.concat([health_data, dummy_age], axis=1)

In [643]:
#identify dummy columns
dummy_age = health_data.filter(like='dum_hist_glucose_median_').columns 

# Identify the column with the most instances
most_instances_column = health_data[dummy_age].sum().idxmax()

print(f'The dummy column with the most instances is: {most_instances_column}')

The dummy column with the most instances is: dum_hist_glucose_median__Normal


In [644]:
# Drop the original column with the most instances
health_data = health_data.drop(most_instances_column, axis=1)

In [645]:
#Drop the original column 
health_data = health_data.drop(columns='hist_glucose_median')

---
#### <a id = 'med'></a> 1.6 Pre-Processing Meds Data

Now let's do this for `meds`

In [646]:
meds = health_data.filter(like='meds_').columns
print(f'The columns that make up the super category are : {meds}')

The columns that make up the super category are : Index(['meds_antihyperglycemics', 'meds_anti-obesitydrugs', 'meds_hormones'], dtype='object')


In [647]:
for column in meds:
    # Calculate the value counts percentage for each unique value in the column
    value_counts_percentage = (health_data[column].value_counts() / health_data[column].count()) * 100

    # Print the results
    print(f"Value counts percentage for column '{column}':\n{value_counts_percentage}\n")

Value counts percentage for column 'meds_antihyperglycemics':
meds_antihyperglycemics
no_antihyperglycemics        73.168792
1-to-2_antihyperglycemics    22.280288
3-to-6_antihyperglycemics     4.547133
7-plus_antihyperglycemics     0.003787
Name: count, dtype: float64

Value counts percentage for column 'meds_anti-obesitydrugs':
meds_anti-obesitydrugs
0.0    99.966865
1.0     0.033135
Name: count, dtype: float64

Value counts percentage for column 'meds_hormones':
meds_hormones
no_hormones        95.042934
1-to-2_hormones     4.909730
3-plus_hormones     0.047336
Name: count, dtype: float64



Though the anti-obesity drugs is binary it's still in float format so I will convert these accordingly. 

In [648]:
health_data['meds_anti-obesitydrugs'] = health_data['meds_anti-obesitydrugs'].astype(int)

In [649]:
meds = ['meds_antihyperglycemics', 'meds_hormones'] #as the anti-obesitydrugs are already binary

In [650]:
# Loop through each column 
for column in meds:
    # Create dummy variables
    dummy_col = pd.get_dummies(health_data[column], prefix='dum_huse_', dtype=int)

    # Concatenate the dummy variables with the original DataFrame
    health_data = pd.concat([health_data, dummy_col], axis=1)

    # Identify dummy columns
    dummy_columns = health_data.filter(like='dum_huse_')

    # Identify the column with the most instances
    most_instances_column = dummy_columns.sum().idxmax()

    # Drop the column with the most instances
    health_data.drop(columns=most_instances_column, inplace=True)
    
    print(f"Column '{most_instances_column}' with the most instances for '{column}' was dropped.")

Column 'dum_huse__no_antihyperglycemics' with the most instances for 'meds_antihyperglycemics' was dropped.
Column 'dum_huse__no_hormones' with the most instances for 'meds_hormones' was dropped.


In [651]:
#Drop the original column 
health_data = health_data.drop(columns=meds)

---
### <a id = 'conc'></a> Conclusion

Let's first make sure that all of the data has been successfully dummied and turned into binary data.

In [652]:
health_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105627 entries, 0 to 105626
Columns: 443 entries, demo_gender to dum_huse__3-plus_hormones
dtypes: int64(443)
memory usage: 357.0 MB


As we can see, all of the data has been properly converted and is ready to be used for modeling. We will now export the file into another CSV to begin vanilla modeling.

In [653]:
health_data.to_csv('clean_health_data_pt2.csv', index=False) 

----