## Group by Hospitals ##
***
In grouping by zip code, I found through EDA that there were very weak correlations between HHC quality and Hospital Readmission Ratios.  As such, I determined that more observations may be needed in order to improve the model.  <br><br>
To produce more observations I decided to link the individual hospital observations with their HHC ratings data grouped and then merged by the common zip code.

In [3]:
# Import Necessary Tools
import pandas as pd
import numpy as np

In [4]:
# Read in Data
df = pd.read_csv('HHC_Agencies_Cleaned.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,state,name,zip,nursing_care,physical_therapy,occupational_therapy,pathology_services,medical_soc_services,home_health_aid,...,move_buff,in_out_bed_buff,bathing_buff,move_pain_debuff,breathing_buff,oral_rx_buff,hospital_admit,urgent_noadmit,readmit_expectation,er_admit_expectation
0,0,AL,ALACARE HOME HEALTH & HOSPICE,35216,1,1,1,1,1,1,...,79.4,75.4,83.5,85.9,81.3,72.4,18.3,11.4,1,1
1,1,AL,KINDRED AT HOME,36330,1,1,1,0,0,1,...,77.6,71.4,80.3,83.6,79.3,59.9,15.5,15.1,2,1
2,2,AL,AMEDISYS HOME HEALTH,35031,1,1,1,1,1,1,...,81.3,72.8,82.1,78.0,85.7,68.5,18.9,12.1,2,2
3,3,AL,SOUTHEAST ALABAMA HOMECARE,36330,1,1,1,1,1,0,...,85.8,79.0,87.9,91.5,87.2,80.6,16.9,11.9,2,2
4,4,AL,KINDRED AT HOME,35906,1,1,1,1,1,1,...,82.8,73.9,85.2,80.8,85.0,66.0,22.2,10.2,1,2


In [5]:
# Count Number of HHC Agencies by Zip Code
count = df.groupby('zip').count()
count = count.state.reset_index()
count.columns=['zip', 'hhc_count']


In [6]:
# Initialize Merger DataFrame with the unique zip codes
gmerger = count
gmerger = gmerger.drop_duplicates()
gmerger = gmerger.reset_index()
gmerger = gmerger.drop('index', axis=1)

In [7]:
#  Loop Through Columns Intended to be Grouped by Zip to get Grouped Mean
mean_cols = ['nursing_care','physical_therapy','occupational_therapy','pathology_services','medical_soc_services',
               'home_health_aid','star_rating','timeliness','rx_ed','fall_risk','depression_check','flu_shot','pneumonia_shot',
               'd_foot_care','move_buff','in_out_bed_buff','bathing_buff','move_pain_debuff','breathing_buff','oral_rx_buff',
               'hospital_admit','urgent_noadmit']

for col in mean_cols:
    gmerger[col] = np.array(df.groupby('zip')[col].mean())   

In [8]:
#  Loop Through Columns Intended to be Grouped by Zip to get Grouped Mode
mode_cols = ['readmit_expectation', 'er_admit_expectation']
from scipy.stats import mode

# Mode was Chosen as These are Categorical Indicators
for col in mode_cols:
    df[col] = df[col].astype(int)
    gmerger[col] = np.array(df.groupby('zip')[col].agg(lambda x: x.value_counts().index[0]))

In [9]:
gmerger.head()

Unnamed: 0,zip,hhc_count,nursing_care,physical_therapy,occupational_therapy,pathology_services,medical_soc_services,home_health_aid,star_rating,timeliness,...,move_buff,in_out_bed_buff,bathing_buff,move_pain_debuff,breathing_buff,oral_rx_buff,hospital_admit,urgent_noadmit,readmit_expectation,er_admit_expectation
0,740,1,1,1.0,0.0,0.0,1.0,1.0,3.0,98.4,...,60.8,66.3,63.8,68.0,79.4,64.7,18.8,8.8,2,2
1,917,1,1,1.0,1.0,1.0,1.0,1.0,4.5,96.3,...,85.9,80.6,90.3,95.0,86.8,73.2,12.9,10.7,2,1
2,968,1,1,1.0,1.0,1.0,1.0,1.0,3.0,81.2,...,74.4,70.5,75.6,77.6,81.4,51.9,6.7,10.8,0,0
3,970,1,1,1.0,1.0,1.0,1.0,1.0,3.5,96.9,...,74.0,66.6,75.4,89.2,73.0,50.3,13.3,12.5,2,2
4,1001,1,1,1.0,1.0,1.0,1.0,1.0,5.0,98.6,...,80.4,78.0,89.7,90.0,88.0,71.6,17.1,14.3,2,1


In [10]:
# Import Hospital Data
hospital = pd.read_csv('Readmissions_2.csv', index_col=0)
hospital.head()

Unnamed: 0,hospital_name,state,readmission_ratio,predicted_rate,expected_rate,zip
0,NORTHEAST ALABAMA REGIONAL MEDICAL CENTER,AL,1.4044,6.1,4.3,36207
1,NORTHEAST ALABAMA REGIONAL MEDICAL CENTER,AL,0.9653,16.7,17.3,36207
2,NORTHEAST ALABAMA REGIONAL MEDICAL CENTER,AL,0.9243,20.3,22.0,36207
3,NORTHEAST ALABAMA REGIONAL MEDICAL CENTER,AL,1.0895,15.9,14.6,36207
4,NORTHEAST ALABAMA REGIONAL MEDICAL CENTER,AL,0.9232,16.9,18.3,36207


In [11]:
# Merge grouped HHC data to cleaned Hospital observations via zip code.
final_test = pd.merge(hospital,gmerger, on='zip')
final_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5882 entries, 0 to 5881
Data columns (total 31 columns):
hospital_name           5882 non-null object
state                   5882 non-null object
readmission_ratio       5882 non-null float64
predicted_rate          5882 non-null float64
expected_rate           5882 non-null float64
zip                     5882 non-null int64
hhc_count               5882 non-null int64
nursing_care            5882 non-null int64
physical_therapy        5882 non-null float64
occupational_therapy    5882 non-null float64
pathology_services      5882 non-null float64
medical_soc_services    5882 non-null float64
home_health_aid         5882 non-null float64
star_rating             5882 non-null float64
timeliness              5882 non-null float64
rx_ed                   5882 non-null float64
fall_risk               5882 non-null float64
depression_check        5882 non-null float64
flu_shot                5882 non-null float64
pneumonia_shot          588

5882 observations!  This is significantly more than when I grouped everything by zip code.  Let's find out if this produces any interesting results. 

In [12]:
# Save New Data Set to use for EDA.
final_test.to_csv('final_by_hospital.csv')