# Reducing Bias in Machine Learning Models:
### Modeling Part 1: Vanilla Models (Basic and Optimized)


#### Kristen Lo - BrainStation
---

### Table of Contents
- [Introduction](#intro)
- [Part 3: Vanilla Models (Basic and Optimized)](#clean)
    - 1.1: [Housekeeping](#house)
    - 1.2: [Pre-processing Demographic Data](#demo)
    - 1.3: [Pre-processing Triage Numerical Data](#num)
    - 1.4: [Pre-processing Hospital Usage Data](#huse)
    - 1.5: [Pre-processing Historical Lab Data](#lab) 
    - 1.6: [Pre-processing Meds Data](#med) 
- [Conclusion](#conc)


---
### <a id = 'intro'></a> Introduction

In this notebook, we will be preparing our vanilla model and then optimizing it for performance. 

There is a need to be able to predict the hospital admission rates for diabetic patients. However, using traditional machine learning models can lead to health disparities caused by biased data which can be related to demographic data (ex. race, age, income, insurance etc). These biases need to be removed prior to modelling so that bias isn't introduced into the model. Building on the work of Raza, S. who aimed to predict, diagnose, and mitigate health disparities in hospital re-admission, my aim is to replicate the study performed by Raza and create my own model that's able to screen for biases and predict admission rates for diabetics visiting the ER. 


Data was sourced from all adult Emergency Department visits from March 2014 - July 2017 from one academic and two community emergency rooms, apart from the Yale New Haven Health system. These visits resulted in either admission to their respective hospital or discharge. 

There are a total of 972 variables that we extracted per patient visit from 560,486 patient visits. 

Courtesy of:
 "Hong WS, Haimovich AD, Taylor RA (2018) Predicting hospital admission at emergency department triage using machine learning. PLoS ONE 13(7): e0201016." (https://doi.org/10.1371/journal.pone.0201016)




-----

## <a id = 'clean'></a> Part 3: Vanilla Modelling (Basic and Optimized) 

---
#### <a id = 'housekeeping'></a> 1.1 HouseKeeping 

Loading the necessary libraries

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler




Loading the clean csv file

In [2]:
health_data = pd.read_csv('Data/clean_health_data_pt2.csv')

In [6]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None) #Show all rows

In [7]:
health_data.head()

Unnamed: 0,demo_gender,demo_race,disposition,pmh_2ndarymalig,pmh_abdomhernia,pmh_acqfootdef,pmh_acrenlfail,pmh_acutecvd,pmh_acutemi,pmh_acutphanm,pmh_adltrespfl,pmh_alcoholrelateddisorders,pmh_amniosdx,pmh_anemia,pmh_aneurysm,pmh_anxietydisorders,pmh_artembolism,pmh_asppneumon,pmh_asthma,pmh_attentiondeficitconductdisruptivebeha,pmh_biliarydx,pmh_bladdercncr,pmh_blindness,pmh_bnignutneo,pmh_bonectcncr,pmh_bph,pmh_brainnscan,pmh_breastcancr,pmh_breastdx,pmh_brnchlngca,pmh_cardiaarrst,pmh_carditis,pmh_cataract,pmh_chfnonhp,pmh_chrkidneydisease,pmh_coaghemrdx,pmh_coloncancer,pmh_comabrndmg,pmh_complicdevi,pmh_complicproc,pmh_conduction,pmh_contraceptiv,pmh_copd,pmh_coronathero,pmh_crushinjury,pmh_cysticfibro,pmh_deliriumdementiaamnesticothercognitiv,pmh_developmentaldisorders,pmh_diabmelnoc,pmh_diabmelwcm,pmh_disordersusuallydiagnosedininfancych,pmh_diverticulos,pmh_dizziness,pmh_dminpreg,pmh_dysrhythmia,pmh_ecodesadverseeffectsofmedicalcare,pmh_ecodesfall,pmh_ecodesfirearm,pmh_ecodesmotorvehicletrafficmvt,pmh_ecodesotherspecifiedandclassifiable,pmh_encephalitis,pmh_endometrios,pmh_epilepsycnv,pmh_esophcancer,pmh_esophgealdx,pmh_eyeinfectn,pmh_femgenitca,pmh_feminfertil,pmh_fluidelcdx,pmh_fuo,pmh_fxarm,pmh_fxhip,pmh_fxskullfac,pmh_gangrene,pmh_gasduoulcer,pmh_gastritis,pmh_gastroent,pmh_giconganom,pmh_gihemorrhag,pmh_giperitcan,pmh_glaucoma,pmh_goutotcrys,pmh_guconganom,pmh_hdnckcancr,pmh_headachemig,pmh_hemorrpreg,pmh_hodgkinsds,pmh_hrtvalvedx,pmh_htn,pmh_htncomplicn,pmh_htninpreg,pmh_hyperlipidem,pmh_immunitydx,pmh_inducabortn,pmh_infectarth,pmh_influenza,pmh_infmalegen,pmh_intestinfct,pmh_intobstruct,pmh_intracrninj,pmh_jointinjury,pmh_kidnyrnlca,pmh_lateeffcvd,pmh_leukemias,pmh_liveribdca,pmh_lowbirthwt,pmh_lungexternl,pmh_lymphenlarg,pmh_maintchemr,pmh_maligneopls,pmh_meningitis,pmh_menstrualdx,pmh_miscellaneousmentalhealthdisorders,pmh_mooddisorders,pmh_mouthdx,pmh_ms,pmh_multmyeloma,pmh_mycoses,pmh_neoplsmunsp,pmh_nephritis,pmh_nonepithca,pmh_nonhodglym,pmh_nutritdefic,pmh_opnwndextr,pmh_osteoarthros,pmh_osteoporosis,pmh_otaftercare,pmh_otbnignneo,pmh_otbonedx,pmh_otcirculdx,pmh_otcomplbir,pmh_otconntiss,pmh_otdxbladdr,pmh_otdxkidney,pmh_otdxstomch,pmh_otendodsor,pmh_otfemalgen,pmh_othbactinf,pmh_othcnsinfx,pmh_othematldx,pmh_othercvd,pmh_othereardx,pmh_otheredcns,pmh_othereyedx,pmh_othergidx,pmh_othergudx,pmh_otherinjury,pmh_otherpregnancyanddeliveryincludingnormal,pmh_otherscreen,pmh_othfracture,pmh_othheartdx,pmh_othliverdx,pmh_othlowresp,pmh_othmalegen,pmh_othnervdx,pmh_othveindx,pmh_otinflskin,pmh_otitismedia,pmh_otjointdx,pmh_otnutritdx,pmh_otprimryca,pmh_otrespirca,pmh_otupprresp,pmh_otuprspin,pmh_ovariancyst,pmh_pancreascan,pmh_pancreasdx,pmh_paralysis,pmh_parkinsons,pmh_pathologfx,pmh_peripathero,pmh_peritonitis,pmh_personalitydisorders,pmh_phlebitis,pmh_pid,pmh_pleurisy,pmh_pneumonia,pmh_poisnotmed,pmh_precereoccl,pmh_prevcsectn,pmh_prolapse,pmh_prostatecan,pmh_pulmhartdx,pmh_rctmanusca,pmh_rehab,pmh_respdistres,pmh_retinaldx,pmh_rheumarth,pmh_schizophreniaandotherpsychoticdisorde,pmh_screeningandhistoryofmentalhealthan,pmh_septicemia,pmh_sexualinfxs,pmh_shock,pmh_sicklecell,pmh_skininfectn,pmh_skinmelanom,pmh_socialadmin,pmh_spincorinj,pmh_stomchcancr,pmh_substancerelateddisorders,pmh_suicideandintentionalselfinflictedin,pmh_superficinj,pmh_syncope,pmh_teethdx,pmh_testiscancr,pmh_thyroiddsor,pmh_tia,pmh_tonsillitis,pmh_ulceratcol,pmh_ulcerskin,pmh_unclassified,pmh_urinyorgca,pmh_uteruscancr,pmh_uti,pmh_varicosevn,pmh_viralinfect,pmh_whtblooddx,cc_abdominalcramping,cc_abdominaldistention,cc_abdominalpain,cc_abdominalpainpregnant,cc_abnormallab,cc_abscess,cc_addictionproblem,cc_alcoholintoxication,cc_alcoholproblem,cc_allergicreaction,cc_alteredmentalstatus,cc_animalbite,cc_ankleinjury,cc_anklepain,cc_anxiety,cc_arminjury,cc_armpain,cc_assaultvictim,cc_asthma,cc_backpain,cc_bleeding/bruising,cc_bodyfluidexposure,cc_breastpain,cc_breathingdifficulty,cc_burn,cc_cardiacarrest,cc_chestpain,cc_coldlikesymptoms,cc_confusion,cc_conjunctivitis,cc_constipation,cc_cough,cc_cyst,cc_decreasedbloodsugar-symptomatic,cc_dehydration,cc_dentalpain,cc_depression,cc_detoxevaluation,cc_diarrhea,cc_dizziness,cc_drug/alcoholassessment,cc_drugproblem,cc_dyspnea,cc_dysuria,cc_earpain,cc_earproblem,cc_edema,cc_elbowpain,cc_elevatedbloodsugar-nosymptoms,cc_elevatedbloodsugar-symptomatic,cc_emesis,cc_epigastricpain,cc_epistaxis,cc_exposuretostd,cc_extremitylaceration,cc_extremityweakness,cc_eyeinjury,cc_eyepain,cc_eyeproblem,cc_eyeredness,cc_facialinjury,cc_faciallaceration,cc_facialpain,cc_facialswelling,cc_fall,cc_fall>65,cc_fatigue,cc_femaleguproblem,cc_fever,cc_fever-75yearsorolder,cc_fever-9weeksto74years,cc_feverimmunocompromised,cc_fingerinjury,cc_fingerpain,cc_fingerswelling,cc_flankpain,cc_follow-upcellulitis,cc_footinjury,cc_footpain,cc_footswelling,cc_foreignbodyineye,cc_fulltrauma,cc_generalizedbodyaches,cc_gibleeding,cc_giproblem,cc_groinpain,cc_handinjury,cc_handpain,cc_headache,cc_headache-newonsetornewsymptoms,cc_headache-recurrentorknowndxmigraines,cc_headachere-evaluation,cc_headinjury,cc_headlaceration,cc_hemoptysis,cc_hippain,cc_hypertension,cc_hypotension,cc_insectbite,cc_irregularheartbeat,cc_jawpain,cc_jointswelling,cc_kneeinjury,cc_kneepain,cc_laceration,cc_leginjury,cc_legpain,cc_legswelling,cc_lethargy,cc_lossofconsciousness,cc_maleguproblem,cc_mass,cc_medicalscreening,cc_medicationproblem,cc_medicationrefill,cc_migraine,cc_modifiedtrauma,cc_motorvehiclecrash,cc_multiplefalls,cc_nasalcongestion,cc_nausea,cc_nearsyncope,cc_neckpain,cc_neurologicproblem,cc_otalgia,cc_other,cc_overdose-intentional,cc_pain,cc_panicattack,cc_pelvicpain,cc_rapidheartrate,cc_rash,cc_rectalbleeding,cc_rectalpain,cc_respiratorydistress,cc_ribinjury,cc_ribpain,cc_seizure-newonset,cc_seizure-priorhxof,cc_shortnessofbreath,cc_shoulderinjury,cc_shoulderpain,cc_sinusproblem,cc_skinirritation,cc_skinproblem,cc_sorethroat,cc_stdcheck,cc_strokealert,cc_suicidal,cc_suture/stapleremoval,cc_syncope,cc_tachycardia,cc_testiclepain,cc_thumbinjury,cc_tickremoval,cc_toeinjury,cc_toepain,cc_unresponsive,cc_uri,cc_urinaryfrequency,cc_urinaryretention,cc_urinarytractinfection,cc_vaginalbleeding,cc_vaginaldischarge,cc_vaginalpain,cc_weakness,cc_wheezing,cc_withdrawal-alcohol,cc_woundcheck,cc_woundinfection,cc_woundre-evaluation,cc_wristinjury,cc_wristpain,meds_anti-obesitydrugs,triage_vital_hr_imputed_flag,triage_vital_sbp_imputed_flag,triage_vital_dbp_imputed_flag,triage_vital_rr_imputed_flag,triage_vital_temp_imputed_flag,hist_glucose_median_imputed_flag,demo_age_100-109,demo_age_18-29,demo_age_30-39,demo_age_40-49,demo_age_60-69,demo_age_70-79,demo_age_80-89,demo_age_90-99,demo_employstatus_Employed,demo_employstatus_Student,demo_insurance_status_Private,dum_triage_vital__bradycardia(low)_hr,dum_triage_vital__critical_hr,dum_triage_vital__tachycardia(high)_hr,dum_triage_vital__critical_sbp,dum_triage_vital__hypotension(low)_sbp,dum_triage_vital__normal_sbp,dum_triage_vital__pre-hypertension_sbp,dum_triage_vital__critical_dpb,dum_triage_vital__hypertension(high)_dbp,dum_triage_vital__hypotension(low)_dbp,dum_triage_vital__pre-hypertension_dbp,dum_triage_vital__Critical_rr,dum_triage_vital__bradypnea(low)_rr,dum_triage_vital__tachypnea(high)_rr,dum_triage_vital__fever(high_temp),dum_triage_vital__hypothermia(low)_temp,dum_huse__prev_dispo_Admit,dum_huse__prev_dispo_Discharge,dum_huse__high_prior_visit,dum_huse__moderate_prior_visit,dum_huse__no_prior_visits,dum_huse__No_prior_admis,dum_huse__high_prior_admis,dum_huse__moderate_prior_admis,dum_huse__vhigh_prior_admis,dum_huse__high_Surg,dum_huse__moderate_Surg,dum_huse__no_prior_surg,dum_hist_glucose_median__>200(high),dum_hist_glucose_median__>300(very high),dum_huse__1-to-2_antihyperglycemics,dum_huse__3-to-6_antihyperglycemics,dum_huse__7-plus_antihyperglycemics,dum_huse__1-to-2_hormones,dum_huse__3-plus_hormones
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


---
#### <a id = 'van'></a> 1.2 Logistic Regression 

Given that all of the data contained in the dataframe is binary (either 1 or 0), I think a suitable vanilla model to use would be Logistic regression. I will first just run the model as is, making sure to split the train/test data, scaling it and then evaluating the performance. I will then run a pipeline and try to optimize for hyperparameters and see if it's possible to increase the accuracy of the model. 

In [39]:
# Step 1: Define the features and target variables
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']

# Step 2: Split the data to train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Scale the data 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Model training with scaled data
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# Step 4: Model evaluation with scaled data
y_pred = model.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.7964593392028779
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.84      0.82     11348
           1       0.80      0.75      0.77      9778

    accuracy                           0.80     21126
   macro avg       0.80      0.79      0.79     21126
weighted avg       0.80      0.80      0.80     21126



In [46]:
# Step 1: Define the features and target variables
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']

# Assuming X and y are your feature matrix and target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check for multicollinearity
corr_matrix = X_train.corr()

# Identify highly correlated features
highly_correlated = np.where(np.abs(corr_matrix) > 0.9)

# Drop one of the correlated features
features_to_drop = set()
for col1, col2 in zip(*highly_correlated):
    if col1 != col2 and col1 not in features_to_drop:
        features_to_drop.add(col2)
        print(f"Highly correlated features: {X_train.columns[col1]} and {X_train.columns[col2]}")

# Drop the features
X_train = X_train.drop(columns=features_to_drop)
X_test = X_test.drop(columns=features_to_drop)


In [45]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Step 1: Define the features and target variables
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']
# Assuming X and y are your feature matrix and target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check for multicollinearity
corr_matrix = X_train.corr()

# Identify highly correlated features
highly_correlated = np.where(np.abs(corr_matrix) > 0.9)

# Drop one of the correlated features
features_to_drop = set()
for col1, col2 in zip(*highly_correlated):
    if col1 != col2 and col1 not in features_to_drop:
        features_to_drop.add(col2)

# Drop the features
X_train = X_train.drop(columns=features_to_drop)
X_test = X_test.drop(columns=features_to_drop)

# Continue with the rest of your code
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Create and train Logistic Regression model
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)

# Add a constant term to the features for the intercept
X_train_sm = sm.add_constant(X_train)

# Create and train Logistic Regression model with statsmodels
logreg_model_sm = sm.Logit(y_train, X_train_sm)
result = logreg_model_sm.fit()

# Extract coefficients, odds ratios, and p-values
coefficients = result.params
odds_ratios = np.exp(coefficients)
p_values = result.pvalues

# Create a DataFrame to display feature names, coefficients, odds ratios, and p-values
results_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': coefficients.values,
    'Odds Ratio': odds_ratios.values,
    'P-value': p_values.values
})

# Print or inspect the results
print("Feature, Coefficients, Odds Ratios, and P-values:")
print(results_df)


KeyError: '[392, 393, 394, 391] not found in axis'

In [29]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Define the features and target variables
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']

# Assuming X and y are your feature matrix and target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train LinearSVC model
svc_model = LinearSVC()
svc_model.fit(X_train_scaled, y_train)

# Predictions
y_pred = svc_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

classification_rep = classification_report(y_test, y_pred)
print("Classification Report:\n", classification_rep)




Accuracy: 0.7787560352172678
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.82      0.80     11348
           1       0.78      0.73      0.75      9778

    accuracy                           0.78     21126
   macro avg       0.78      0.78      0.78     21126
weighted avg       0.78      0.78      0.78     21126





In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import accuracy_score, classification_report

# Assuming 'health_data' is your DataFrame
# Replace 'y_column' with the actual target column name

# Step 1: Separate Features and Target
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']

# Step 2: Train-Test Split (Using a smaller subset for demonstration)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


# Step 4: Hyperparameter Tuning with Simplified GridSearchCV
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [9]:

# Simplified hyperparameter grid
param_grid = {'C': [0.1, 1], 'penalty': ['l2']}

grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

# Best hyperparameters
best_params = grid_search.best_params_


In [10]:

# Step 5: Model Training with Best Hyperparameters
best_model = LogisticRegression(max_iter=1000, **best_params)
best_model.fit(X_train_scaled, y_train)

# Step 6: Model Evaluation
X_test_scaled = scaler.transform(X_test)
y_pred = best_model.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Best Hyperparameters:", best_params)


Accuracy: 0.791867840575594
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.84      0.81     11414
           1       0.79      0.74      0.77      9712

    accuracy                           0.79     21126
   macro avg       0.79      0.79      0.79     21126
weighted avg       0.79      0.79      0.79     21126

Best Hyperparameters: {'C': 0.1, 'penalty': 'l2'}


In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Assuming 'health_data' is your DataFrame
# Replace 'y_column' with the actual target column name

# Step 1: Separate Features and Target
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']

# Step 2: Train-Test Split (Using a smaller subset for demonstration)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 3: Create a preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), X.columns)  # Scale all columns
    ],
    remainder='passthrough'
)

# Step 4: Create a pipeline for Logistic Regression
logreg_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500))
])

# Step 5: Create a pipeline for Decision Tree
dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

# Step 6: Hyperparameter Tuning with GridSearchCV for Logistic Regression
logreg_param_grid = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'classifier__penalty': ['l2','l1']}
logreg_grid_search = GridSearchCV(logreg_pipeline, logreg_param_grid, cv=3, scoring='accuracy', n_jobs=-1)
logreg_grid_search.fit(X_train, y_train)

# Best hyperparameters for Logistic Regression
logreg_best_params = logreg_grid_search.best_params_

# Step 7: Hyperparameter Tuning with GridSearchCV for Decision Tree
dt_param_grid = {'classifier__max_depth': [None, 5, 10, 15], 'classifier__min_samples_split': [2, 5, 10]}
dt_grid_search = GridSearchCV(dt_pipeline, dt_param_grid, cv=3, scoring='accuracy', n_jobs=-1)
dt_grid_search.fit(X_train, y_train)

# Best hyperparameters for Decision Tree
dt_best_params = dt_grid_search.best_params_

# Step 8: Model Evaluation for Logistic Regression
logreg_best_model = logreg_grid_search.best_estimator_
y_logreg_pred = logreg_best_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_logreg_pred))
print("Logistic Regression Classification Report:\n", classification_report(y_test, y_logreg_pred))
print("Logistic Regression Best Hyperparameters:", logreg_best_params)

# Step 9: Model Evaluation for Decision Tree
dt_best_model = dt_grid_search.best_estimator_
y_dt_pred = dt_best_model.predict(X_test)
print("\nDecision Tree Accuracy:", accuracy_score(y_test, y_dt_pred))
print("Decision Tree Classification Report:\n", classification_report(y_test, y_dt_pred))
print("Decision Tree Best Hyperparameters:", dt_best_params)


21 fits failed out of a total of 42.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
21 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/kristenlo/anaconda3/envs/klo-BS_HS-Bia/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/kristenlo/anaconda3/envs/klo-BS_HS-Bia/lib/python3.11/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kristenlo/anaconda3/envs/klo-BS_HS-Bia/lib/python3.11/site-packages/sklearn/pipeline.py", line 427, in fit
    self._final_estimator.fit(Xt, y, **

Logistic Regression Accuracy: 0.7921045157625675
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.84      0.81     11414
           1       0.79      0.74      0.77      9712

    accuracy                           0.79     21126
   macro avg       0.79      0.79      0.79     21126
weighted avg       0.79      0.79      0.79     21126

Logistic Regression Best Hyperparameters: {'classifier__C': 0.01, 'classifier__penalty': 'l2'}

Decision Tree Accuracy: 0.7620941020543406
Decision Tree Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.88      0.80     11414
           1       0.82      0.62      0.71      9712

    accuracy                           0.76     21126
   macro avg       0.77      0.75      0.75     21126
weighted avg       0.77      0.76      0.76     21126

Decision Tree Best Hyperparameters: {'classifier__max_depth': 15, 'classifie

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Assuming 'health_data' is your DataFrame
# Replace 'y_column' with the actual target column name

# Step 1: Separate Features and Target
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']

# Step 2: Train-Test Split (Using a smaller subset for demonstration)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 3: Create a preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), X.columns)  # Scale all columns
    ],
    remainder='passthrough'
)

# Step 4: Create a pipeline for Logistic Regression
logreg_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500))
])

# Step 5: Create a pipeline for Decision Tree
dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

# Step 6: Hyperparameter Tuning with GridSearchCV for Logistic Regression
logreg_param_grid = {'classifier__C': [0.01, 0.1, 1, 10], 'classifier__penalty': ['l2']}
logreg_grid_search = GridSearchCV(logreg_pipeline, logreg_param_grid, cv=3, scoring='accuracy', n_jobs=-1)
logreg_grid_search.fit(X_train, y_train)

# Best hyperparameters for Logistic Regression
logreg_best_params = logreg_grid_search.best_params_

# Step 7: Hyperparameter Tuning with GridSearchCV for Decision Tree
dt_param_grid = {'classifier__max_depth': [None, 5, 10, 15, 20], 'classifier__min_samples_split': [2, 5, 10]}
dt_grid_search = GridSearchCV(dt_pipeline, dt_param_grid, cv=3, scoring='accuracy', n_jobs=-1)
dt_grid_search.fit(X_train, y_train)

# Best hyperparameters for Decision Tree
dt_best_params = dt_grid_search.best_params_

# Step 8: Model Evaluation for Logistic Regression
logreg_best_model = logreg_grid_search.best_estimator_
y_logreg_pred = logreg_best_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_logreg_pred))
print("Logistic Regression Classification Report:\n", classification_report(y_test, y_logreg_pred))
print("Logistic Regression Best Hyperparameters:", logreg_best_params)

# Step 9: Model Evaluation for Decision Tree
dt_best_model = dt_grid_search.best_estimator_
y_dt_pred = dt_best_model.predict(X_test)
print("\nDecision Tree Accuracy:", accuracy_score(y_test, y_dt_pred))
print("Decision Tree Classification Report:\n", classification_report(y_test, y_dt_pred))
print("Decision Tree Best Hyperparameters:", dt_best_params)

Logistic Regression Accuracy: 0.7921045157625675
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.84      0.81     11414
           1       0.79      0.74      0.77      9712

    accuracy                           0.79     21126
   macro avg       0.79      0.79      0.79     21126
weighted avg       0.79      0.79      0.79     21126

Logistic Regression Best Hyperparameters: {'classifier__C': 0.01, 'classifier__penalty': 'l2'}

Decision Tree Accuracy: 0.7613840764934204
Decision Tree Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.88      0.80     11414
           1       0.81      0.62      0.71      9712

    accuracy                           0.76     21126
   macro avg       0.77      0.75      0.75     21126
weighted avg       0.77      0.76      0.76     21126

Decision Tree Best Hyperparameters: {'classifier__max_depth': 15, 'classifie

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Assuming 'health_data' is your DataFrame
# Replace 'y_column' with the actual target column name

# Step 1: Separate Features and Target
X = health_data.drop(['disposition'], axis=1)
y = health_data['disposition']

# Step 2: Train-Test Split (Using a smaller subset for demonstration)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 3: Create a preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), X.columns)  # Scale all columns
    ],
    remainder='passthrough'
)

# Step 4: Create a pipeline for Random Forest
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Step 5: Hyperparameter Tuning with GridSearchCV for Random Forest
rf_param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [15, 20, 25],
    'classifier__min_samples_split': [5, 10, 15],
    'classifier__min_samples_leaf': [1, 2, 5]
}
rf_grid_search = GridSearchCV(rf_pipeline, rf_param_grid, cv=3, scoring='accuracy', n_jobs=-1)
rf_grid_search.fit(X_train, y_train)

# Best hyperparameters for Random Forest
rf_best_params = rf_grid_search.best_params_

# Step 6: Model Evaluation for Random Forest
rf_best_model = rf_grid_search.best_estimator_
y_rf_pred = rf_best_model.predict(X_test)
print("\nRandom Forest Accuracy:", accuracy_score(y_test, y_rf_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_rf_pred))
print("Random Forest Best Hyperparameters:", rf_best_params)

# Step 7: Feature Importance
feature_importance = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_best_model.named_steps['classifier'].feature_importances_})
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
print("\nFeature Importance:\n", feature_importance)



Random Forest Accuracy: 0.7872763419483101
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.84      0.81     11414
           1       0.80      0.72      0.76      9712

    accuracy                           0.79     21126
   macro avg       0.79      0.78      0.78     21126
weighted avg       0.79      0.79      0.79     21126

Random Forest Best Hyperparameters: {'classifier__max_depth': 25, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 200}

Feature Importance:
                                  Feature    Importance
437  dum_huse__1-to-2_antihyperglycemics  1.912753e-01
438  dum_huse__3-to-6_antihyperglycemics  4.703619e-02
440            dum_huse__1-to-2_hormones  3.580138e-02
428             dum_huse__No_prior_admis  2.713441e-02
355                 cc_shortnessofbreath  1.939866e-02
..                                   ...           ...
237           