##### 4_featureEngineering

This notebook takes the csv that was compiled from the notebook 3_compileDischargesHospitalCensusData and does further processing on it such as: 
- Dealing with Categorical Variables

- Feature Engineering

The final features in this dataset are:


In [81]:
import os
import pandas as pd
import numpy as np

In [82]:
aprHospFeat = pd.read_csv('dataFiles/compiledDischargesHospitalFeaturesCensus.csv', low_memory=False)

In [83]:
aprHospFeat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2209446 entries, 0 to 2209445
Data columns (total 42 columns):
Unnamed: 0                                              int64
Health Service Area                                     object
Facility Id                                             float64
Length of Stay                                          object
Type of Admission                                       object
APR DRG Code                                            int64
APR DRG Description                                     object
APR Severity of Illness Code                            float64
Payment Typology 1                                      object
Payment Typology 2                                      object
Payment Typology 3                                      object
Emergency Department Indicator                          object
Total Charges                                           float64
Total Costs                                             float64
yr 

#### Recoding the categorical vars into OneHot

##### Payment Types 1,2,3
There are 3 columns for payment types. This variable is looking at the difference payment types across the 3 columns and inputing a "1" for the various types.

In [84]:
aprHospFeat['Payment Typology 1'].unique()[:-1]

array(['Medicare', 'Blue Cross/Blue Shield', 'Private Health Insurance',
       'Self-Pay', 'Medicaid', 'Federal/State/Local/VA',
       'Miscellaneous/Other', 'Department of Corrections',
       'Managed Care, Unspecified'], dtype=object)

In [85]:
for i in range(len(aprHospFeat['Payment Typology 1'].unique())-1):
    aprHospFeat.loc[(aprHospFeat['Payment Typology 1'] == aprHospFeat['Payment Typology 1'].unique()[i]) | 
                (aprHospFeat['Payment Typology 2'] == aprHospFeat['Payment Typology 1'].unique()[i]) |
                (aprHospFeat['Payment Typology 3'] == aprHospFeat['Payment Typology 1'].unique()[i]), 
                    'payment_'+aprHospFeat['Payment Typology 1'].unique()[i]] = 1

In [86]:
# replace the NaNs with 0
for i in range(len(list(aprHospFeat.filter(like = 'payment').columns))):
    aprHospFeat[list(aprHospFeat.filter(like = 'payment').columns)[i]].fillna(0, inplace=True)

In [87]:
aprHospFeat['numberOfPaymentTypes'] = aprHospFeat.filter(like = 'payment').sum(axis=1)

##### Type of Admission
One hot encoding the type of admission. 

In [88]:
aprHospFeat['Type of Admission'].unique()

array(['Urgent', 'Elective', 'Emergency', 'Newborn', 'Not Available',
       'Trauma'], dtype=object)

In [100]:
urgentAdmits = aprHospFeat[aprHospFeat['Type of Admission'].isin(['Urgent', 'Emergency', 'Trauma'])]\
.groupby(['yr', 'Facility Id', 'APR DRG Code']).count().reset_index()\
.rename(columns={'Unnamed: 0':'numberUrgentAdmits'})[['yr', 'Facility Id', 'APR DRG Code', 'numberUrgentAdmits']]

In [102]:
aprHospFeat = pd.get_dummies(aprHospFeat, columns=['Type of Admission'])

In [105]:
aprHospFeat = aprHospFeat.merge(urgentAdmits, left_on=['yr','Facility Id' ,'APR DRG Code'],
                  right_on=['yr','Facility Id' ,'APR DRG Code'])

##### Emergency Department Indicator

In [106]:
aprHospFeat['Emergency Department Indicator'].unique()

array(['Y', 'N'], dtype=object)

In [107]:
aprHospFeat.loc[aprHospFeat['Emergency Department Indicator'] == 'Y', 'emergencyRoom' ]= 1

In [108]:
aprHospFeat['emergencyRoom'].fillna(0, inplace=True)

##### Hospital Ownership

In [109]:
aprHospFeat = pd.get_dummies(aprHospFeat, columns=['Hospital Ownership'])

##### Rural vs Urban Setting

In [110]:
aprHospFeat = pd.get_dummies(aprHospFeat, columns=['rural_versus_urban'])

In [111]:
aprHospFeat.drop(['Payment Typology 1', 'Payment Typology 2', 'Payment Typology 3', 
                 'Emergency Department Indicator'], 
                axis=1, inplace=True)

#### Feature Engineering & Creating New Variables

##### 1. ratio # insured to # total population

In [112]:
aprHospFeat['ratioInsuredTotalPopulation'] = aprHospFeat['totalInsured']/ aprHospFeat['totalPopulation']

##### 2. ratio # fertile women to # total population

In [113]:
aprHospFeat['ratioFertilityTotalPopulation'] = aprHospFeat['totalFertile']/ aprHospFeat['totalPopulation']

##### 3. ratio total charges to total costs

In [114]:
aprHospFeat['ratioChargesCosts'] = aprHospFeat['Total Charges']/ aprHospFeat['Total Costs']

##### 4. charges per day of stay

In [115]:
aprHospFeat.loc[aprHospFeat['Length of Stay']== '120 +', 'days'] = 120

In [116]:
aprHospFeat.loc[aprHospFeat['Length of Stay'] != '120 +', 'days'] = aprHospFeat['Length of Stay']

In [117]:
aprHospFeat['days'] = pd.to_numeric(aprHospFeat['days'])

In [118]:
aprHospFeat['chargesPerDay'] = aprHospFeat['Total Charges']/ aprHospFeat['days']

##### 5. average number of days per year per facility id per drg

In [123]:
avgDays = aprHospFeat.groupby(['yr', 'Facility Id', 'APR DRG Code']).mean().reset_index()\
.rename(columns={'Unnamed: 0':'avgNumberDaysDrg'})[['yr', 'Facility Id', 'APR DRG Code', 'avgNumberDaysDrg']]

In [124]:
aprHospFeat = aprHospFeat.merge(avgDays, left_on=['yr','Facility Id' ,'APR DRG Code'],
                  right_on=['yr','Facility Id' ,'APR DRG Code'])

##### 7. Number of discharges per year per facility id by APR DRG that is in Urgent, Emergency and Trauma

In [None]:
aprHospFeat[(aprHospFeat['Type of Admission_Emergency'] == 1) |
           (aprHospFeat['Type of Admission_Trauma'] == 1) |
           (aprHospFeat['Type of Admission_Urgent'] == 1)].groupby('yr', )

In [25]:
aprHospFeat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2551443 entries, 0 to 2551442
Data columns (total 66 columns):
Unnamed: 0                                                        int64
Health Service Area                                               object
Facility Id                                                       float64
Length of Stay                                                    object
APR DRG Code                                                      float64
APR DRG Description                                               object
APR Severity of Illness Code                                      float64
Total Charges                                                     float64
Total Costs                                                       float64
yr                                                                float64
Provider ID                                                       float64
Hospital overall rating                                           float64
Mort