##### 4_featureEngineering

This notebook takes the csv that was compiled from the notebook 3_compileDischargesHospitalCensusData and does further processing on it such as: 
- Dealing with Categorical Variables

- Feature Engineering

The final features in this dataset are listed belowed organized by the category of data related to NYS hospitals and APR DRGs.

**From NYS Health, SPARCS individual APR DRG level:**
- 'yr' : year the individual discharge occured
- 'Health Service Area' : one of the service areas in New York State
- 'Facility Id' : NYS uid for hospitals
- 'Length of Stay' : number of days stayed max is 120 from the data set
- 'APR DRG Code' : All Patient Refined Diagnosis Related Groups (APR–DRG) that will be utilized for the payment of the Medicaid, Workers Compensation and No–Fault rates.
- 'APR DRG Description' : Description of the code / procedure
- 'APR Severity of Illness Code' : integer on scale of 0 to 4
- 'Total Charges' : dollars the hospital charged
- 'Total Costs' : dollars it costed the hospital to treat/ provide the procedure

**From CMS / (Provider) Hospital Related:**
- 'Provider ID' : CMS used uid for hospitals
- 'Hospital overall rating' 
- 'Mortality national comparison' 
- 'Safety of care national comparison'
- 'Readmission national comparison'
- 'Patient experience national comparison'
- 'Effectiveness of care national comparison'
- 'Timeliness of care national comparison'
- 'Efficient use of medical imaging national comparison'
- 'fte_employees_on_payroll' : number of full time equivalent employees on payroll
- 'number_of_beds': number of beds in the hospital
- 'number_of_interns_and' : number of interns and residents training at the hospital

**From Census API and PUMA:**
- 'puma' : puma area number
- 'totalPopulation' : total population estimate from census api
- 'totalInsured': total insured population estimate from census api
- 'totalFertile': total estimate females between age women 15 to 50 years
- 'lat' : latitude
- 'long' : longitude
- 'lonlat' : point
- 'geometry' : geometry

**Added Features Engineered:**
- 'countPerDrgPerFacIdPerYr': count of the DRG discharge at each facility each year
- 'countPerFacIdPerYr': count of all the discharges at each facility each year
- 'ratioDrgToFacility': ratio of the DRG discharge count to the all discharges count
- 'sumChargesPerFacIdPerYr' : cummulative sum of the charges per facility per year (across all DRG)
- 'days' : numeric conversion of the length of stay since 120+ is the max, then those became 120
- 'ratioSumChargesPerFacPerYrToBedDays' : ratio of the sum charges to the number of possible beds days ( # of beds * 365 days per year)
- 'payment_Medicare' 
- 'payment_Blue Cross/Blue Shield'
- 'payment_Private Health Insurance'
- 'payment_Self-Pay'
- 'payment_Medicaid'
- 'payment_Federal/State/Local/VA'
- 'payment_Miscellaneous/Other'
- 'payment_Department of Corrections'
- 'payment_Managed Care, Unspecified'
- 'numberOfPaymentTypes' : count of the number of payment types used
- 'Type of Admission_Elective'
- 'Type of Admission_Emergency'
- 'Type of Admission_Newborn'
- 'Type of Admission_Not Available'
- 'Type of Admission_Trauma'
- 'Type of Admission_Urgent'
- 'numberUrgentAdmits' : count of the number of admissions from emergency, trauma, or urgent per DRG per facility per year
- 'emergencyRoom'
- 'Hospital Ownership_Government - Hospital District or Authority'
- 'Hospital Ownership_Government - Local'
- 'Hospital Ownership_Government - State'
- 'Hospital Ownership_Proprietary'
- 'Hospital Ownership_Voluntary non-profit - Church'
- 'Hospital Ownership_Voluntary non-profit - Other'
- 'Hospital Ownership_Voluntary non-profit - Private'
- 'rural_versus_urban_R'
- 'rural_versus_urban_U'
- 'ratioInsuredTotalPopulation' : census population estimate for insured / total population estimate
- 'ratioFertilityTotalPopulation' : census population estimate for 15 to 50 yr old females / total population estimate
- 'ratioChargesCosts' : charges/ cost per individual discharge
- 'chargesPerDay' : charges / number of days stayed per individual discharge
- 'avgNumberDaysDrg' : avg number of days for that specific DRG code per facility per year


In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
aprHospFeat = pd.read_csv('dataFiles/compiledDischargesHospitalFeaturesCensus.csv', low_memory=False)

In [3]:
aprHospFeat.drop(['Unnamed: 0'], axis=1, inplace=True)

In [4]:
aprHospFeat.shape

(2209446, 41)

#### Recoding the categorical vars into OneHot

##### Payment Types 1,2,3
There are 3 columns for payment types. This variable is looking at the difference payment types across the 3 columns and inputing a "1" for the various types.

In [5]:
aprHospFeat['Payment Typology 1'].unique()[:-1]

array(['Medicare', 'Blue Cross/Blue Shield', 'Private Health Insurance',
       'Self-Pay', 'Medicaid', 'Federal/State/Local/VA',
       'Miscellaneous/Other', 'Department of Corrections',
       'Managed Care, Unspecified'], dtype=object)

In [6]:
for i in range(len(aprHospFeat['Payment Typology 1'].unique())-1):
    aprHospFeat.loc[(aprHospFeat['Payment Typology 1'] == aprHospFeat['Payment Typology 1'].unique()[i]) | 
                (aprHospFeat['Payment Typology 2'] == aprHospFeat['Payment Typology 1'].unique()[i]) |
                (aprHospFeat['Payment Typology 3'] == aprHospFeat['Payment Typology 1'].unique()[i]), 
                    'payment_'+aprHospFeat['Payment Typology 1'].unique()[i]] = 1

In [7]:
# replace the NaNs with 0
for i in range(len(list(aprHospFeat.filter(like = 'payment').columns))):
    aprHospFeat[list(aprHospFeat.filter(like = 'payment').columns)[i]].fillna(0, inplace=True)

In [8]:
aprHospFeat['numberOfPaymentTypes'] = aprHospFeat.filter(like = 'payment').sum(axis=1)

##### Type of Admission
One hot encoding the type of admission. 

In [9]:
aprHospFeat['Type of Admission'].unique()

array(['Urgent', 'Elective', 'Emergency', 'Newborn', 'Not Available',
       'Trauma'], dtype=object)

In [10]:
urgentAdmits = aprHospFeat[aprHospFeat['Type of Admission'].isin(['Urgent', 'Emergency', 'Trauma'])]\
.groupby(['yr', 'Facility Id', 'APR DRG Code']).count()[['APR DRG Description']].reset_index()\
.rename(index=str, columns={"APR DRG Description":"numberUrgentAdmits"})

In [11]:
urgentAdmits.shape

(4672, 4)

In [12]:
aprHospFeat = pd.get_dummies(aprHospFeat, columns=['Type of Admission'])

In [13]:
aprHospFeat = aprHospFeat.merge(urgentAdmits, how = 'outer',
                                left_on=['yr','Facility Id' ,'APR DRG Code'],
                                right_on=['yr','Facility Id' ,'APR DRG Code'])

In [14]:
aprHospFeat['numberUrgentAdmits'].fillna(0, inplace=True)

##### Emergency Department Indicator

In [15]:
aprHospFeat['Emergency Department Indicator'].unique()

array(['Y', 'N'], dtype=object)

In [16]:
aprHospFeat.loc[aprHospFeat['Emergency Department Indicator'] == 'Y', 'emergencyRoom' ]= 1

In [17]:
aprHospFeat['emergencyRoom'].fillna(0, inplace=True)

##### Hospital Ownership

In [18]:
aprHospFeat = pd.get_dummies(aprHospFeat, columns=['Hospital Ownership'])

##### Rural vs Urban Setting

In [19]:
aprHospFeat = pd.get_dummies(aprHospFeat, columns=['rural_versus_urban'])

In [20]:
aprHospFeat.drop(['Payment Typology 1', 'Payment Typology 2', 'Payment Typology 3', 
                 'Emergency Department Indicator'], 
                axis=1, inplace=True)

#### Feature Engineering & Creating New Variables

##### Ratio # insured to # total population

In [21]:
aprHospFeat['ratioInsuredTotalPopulation'] = aprHospFeat['totalInsured']/ aprHospFeat['totalPopulation']

##### Ratio # fertile women to # total population

In [22]:
aprHospFeat['ratioFertilityTotalPopulation'] = aprHospFeat['totalFertile']/ aprHospFeat['totalPopulation']

##### Ratio total charges to total costs

In [23]:
aprHospFeat['ratioChargesCosts'] = aprHospFeat['Total Charges']/ aprHospFeat['Total Costs']

##### Charges per day of stay

In [24]:
aprHospFeat.loc[aprHospFeat['Length of Stay']== '120 +', 'days'] = 120

In [25]:
aprHospFeat.loc[aprHospFeat['Length of Stay'] != '120 +', 'days'] = aprHospFeat['Length of Stay']

In [26]:
aprHospFeat['days'] = pd.to_numeric(aprHospFeat['days'])

In [27]:
aprHospFeat['chargesPerDay'] = aprHospFeat['Total Charges']/ aprHospFeat['days']

##### Average number of days per year per facility id per drg

In [28]:
avgDays = aprHospFeat.groupby(['yr', 'Facility Id', 'APR DRG Code']).mean()[['days']].reset_index()\
.rename(index=str, columns={"days":"avgNumberDays"})

In [29]:
avgDays.shape

(4961, 4)

In [30]:
aprHospFeat = aprHospFeat.merge(avgDays, left_on=['yr','Facility Id' ,'APR DRG Code'],
                  right_on=['yr','Facility Id' ,'APR DRG Code'], how = 'outer')

##### Split into C-Section & Knee Replacement

In [31]:
aprHospFeat.to_csv('dataFiles/top10AprHospFeat.csv')

In [38]:
aprHospFeat

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2209446 entries, 0 to 2209445
Data columns (total 67 columns):
Health Service Area                                               object
Facility Id                                                       float64
Length of Stay                                                    object
APR DRG Code                                                      int64
APR DRG Description                                               object
APR Severity of Illness Code                                      float64
Total Charges                                                     float64
Total Costs                                                       float64
yr                                                                int64
Provider ID                                                       int64
Hospital overall rating                                           float64
Mortality national comparison                                     float64
Safety o

In [32]:
kneeAprHospFeat = aprHospFeat[aprHospFeat['APR DRG Code'] == 302]

In [33]:
kneeAprHospFeat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112842 entries, 3468 to 2207575
Data columns (total 67 columns):
APR DRG Code                                                      112842 non-null int64
APR DRG Description                                               112842 non-null object
APR Severity of Illness Code                                      112842 non-null float64
Facility Id                                                       112842 non-null float64
Health Service Area                                               112842 non-null object
Length of Stay                                                    112842 non-null object
Total Charges                                                     112842 non-null float64
Total Costs                                                       112842 non-null float64
yr                                                                112842 non-null int64
Provider ID                                                       112842 non-null 

In [34]:
kneeAprHospFeat.to_csv('dataFiles/kneeAprHospFeat.csv')

In [35]:
csecAprHospFeat = aprHospFeat[aprHospFeat['APR DRG Code'] == 540]

In [36]:
csecAprHospFeat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215245 entries, 152 to 2206571
Data columns (total 67 columns):
APR DRG Code                                                      215245 non-null int64
APR DRG Description                                               215245 non-null object
APR Severity of Illness Code                                      215245 non-null float64
Facility Id                                                       215245 non-null float64
Health Service Area                                               215245 non-null object
Length of Stay                                                    215245 non-null object
Total Charges                                                     215245 non-null float64
Total Costs                                                       215245 non-null float64
yr                                                                215245 non-null int64
Provider ID                                                       215245 non-null i

In [38]:
csecAprHospFeat.to_csv('dataFiles/cSectionAprHospFeat.csv')

In [39]:
aprHospFeat.to_csv('dataFiles/top10AprHospFeat.csv')