# Capstone: Supervised Learning

## Instructions: 

First: Go out and find a dataset of interest. It could be from one of our recommended resources, some other aggregation, or scraped yourself. Just make sure it has lots of variables in it, including an outcome of interest to you.

Second: Explore the data. Get to know the data. Spend a lot of time going over its quirks and peccadilloes. You should understand how it was gathered, what's in it, and what the variables look like.

Third: Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power and experiment with both.

Please execute the three tasks above in a Jupyter notebook that you will submit to the grading team below.

Next, in order to prepare for your presentation, create a slide deck and 15 minute presentation that guides viewers through your model. Be sure to cover a few specific things:

    A specified research question your model addresses
    How you chose your model specification and what alternatives you compared it to
    The practical uses of your model for an audience of interest
    Any weak points or shortcomings of your model

This presentation is not a drill. You'll be presenting this slide deck live to a group as the culmination of all your work so far on supervised learning. As a secondary matter, your slides and the Jupyter notebook should be worthy of inclusion as examples of your work product when applying to jobs.

Supervised Learning Capstone Presentation Details:

Next, in order to prepare for your presentation, create a slide deck and 15 minute presentation that guides viewers through your model. Be sure to cover a few specific things:

    A specified research question your model addresses
    How you chose your model specification and what alternatives you compared it to
    The practical uses of your model for an audience of interest
    Any weak points or shortcomings of your model

This presentation is not a drill. You'll be presenting this slide deck live to a group as the culmination of all your work so far on supervised learning. As a secondary matter, your slides and the Jupyter notebook should be worthy of inclusion as examples of your work product when applying to jobs.

    You should have a slide deck and 15 minute presentation that guides your assessor through the different models you tried and be able to speak to the best performing model and why it’s the best performing model.

    The presentation flow should be

        A quick intro about the context/topic of the project

        Information on the data (where it came from, how it was obtained, missingness, quick stats on the data, etc)

        A specified research question your model addresses

        How you chose your model specification and what alternatives you compared it to

        The practical uses of your model for an audience of interest

        Any weak points or shortcomings of your model

    After the presentation there will be a 5-10 min Q&A where other students and mentors can ask questions about your project.  So make sure you really understand the data and modelling you used!

## Data Source and information

data source: https://healthdata.gov/dataset/hospital-readmission-reduction/resource/f3830eb1-2d22-496c-b663-46b54e175d9f

https://healthdata.gov/dataset/hospital-readmission-reduction

In October 2012, CMS began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.

https://data.medicare.gov/data/hospital-compare
Hospital Compare is a consumer-oriented website that provides information on the quality of care hospitals are providing to their patients. This information can help consumers make informed decisions about health care. Hospital Compare allows consumers to select multiple hospitals and directly compare performance measure information related to heart attack, emergency department care, preventive care, stroke care, and other conditions. The Centers for Medicare & Medicaid Services (CMS) created the Hospital Compare website to better inform health care consumers about a hospital’s quality of care. Hospital Compare provides data on over 4,000 Medicare-certified hospitals, including acute care hospitals, critical access hospitals (CAHs), children’s hospitals, Veterans Health Administration (VHA) Medical Centers, and hospital outpatient departments. Hospital Compare is part of an Administration-wide effort to increase the availability and accessibility of information on quality, utilization, and costs for effective, informed decision-making. More information about Hospital Compare can be found by visiting the CMS.gov website and performing a search for Hospital Compare. To access the Hospital Compare website, please visit www.medicare.gov/hospitalcompare. 

https://www.medicare.gov/hospitalcompare/Data/Data-Updated.html#%20
measures & current data period

## https://www.medicare.gov/hospitalcompare/Data/Hospital-overall-ratings-calculation.html

hospital compare overall hospital rating
n = 4,573
distribution of stars (N/A, 1-5)

The methodology uses a statistical model known as a latent variable model. Seven different latent variable models are used to calculate scores for 7 groups of measures.
Mortality
Safety of Care
Readmission
Patient Experience
Effectiveness of Care
Timeliness of Care
Efficient Use of Medical Imaging

A hospital summary score is then calculated by taking the weighted average of these group scores. If a hospital is missing a measure category or group, the weights are redistributed amongst the qualifying measure categories or groups.
Finally, the overall hospital rating is calculated using the hospital summary score.

## Load data files

In [None]:
C:\Users\katec\Thinkful\data_collections\capstone_2\

In [None]:
hcahps_ratings.csv, hosp_gen_info.csv, hosp_motality_measures.csv, hosp_readm_reduct_prog.csv, unplan_hosp_readm.csv

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
unplan_readm_df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\capstone_3\unplan_hosp_readm.csv')

In [3]:
readm_red_df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\capstone_3\hosp_readm_reduct_prog.csv')

In [4]:
mort_meas_df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\capstone_3\hosp_mortality_measures.csv')

In [5]:
hcahps_df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\capstone_3\hcahps_ratings.csv')

In [6]:
gen_info_df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\capstone_3\hosp_gen_info.csv')

## Cleaning data
clean data for each file individually prior to merging dataframes into one file for modeling

### reduce each file to relevant data only,  address missing values, convert to numeric, dummy categorical values

In [7]:
unplan_readm_df.head(3)

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Measure Start Date,Measure End Date
0,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,Hospital return days for heart attack patients,EDAC_30_AMI,Average Days per 100 Discharges,742,-0.8,-11.6,10.6,,7/1/2015,6/30/2018
1,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,Hospital return days for heart failure patients,EDAC_30_HF,More Days Than Average per 100 Discharges,1114,17.8,4.5,31.5,,7/1/2015,6/30/2018
2,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,Hospital return days for pneumonia patients,EDAC_30_PN,Average Days per 100 Discharges,604,-7.8,-21.3,6.0,,7/1/2015,6/30/2018


In [8]:
unplan_readm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54093 entries, 0 to 54092
Data columns (total 18 columns):
Provider ID             54093 non-null object
Hospital Name           54093 non-null object
Address                 54093 non-null object
City                    54093 non-null object
State                   54093 non-null object
ZIP Code                54093 non-null int64
County Name             54093 non-null object
Phone Number            54093 non-null int64
Measure Name            54093 non-null object
Measure ID              54093 non-null object
Compared to National    54093 non-null object
Denominator             54093 non-null object
Score                   54093 non-null object
Lower Estimate          54093 non-null object
Higher Estimate         54093 non-null object
Footnote                19755 non-null object
Measure Start Date      54093 non-null object
Measure End Date        54093 non-null object
dtypes: int64(2), object(16)
memory usage: 7.4+ MB


In [9]:
#drop unnecessary columns
unplan_readm_df = unplan_readm_df.drop(['Address', 'City', 'County Name', 'Phone Number', 'Measure Name', 'Footnote', 'Measure Start Date', 'Measure End Date'], axis=1)

In [10]:
#drop all unnecessary Measure IDs, keep only HF
unplan_readm_df['Measure ID'].unique()

array(['EDAC_30_AMI', 'EDAC_30_HF', 'EDAC_30_PN', 'OP_32', 'READM_30_AMI',
       'READM_30_CABG', 'READM_30_COPD', 'READM_30_HF',
       'READM_30_HIP_KNEE', 'READM_30_HOSP_WIDE', 'READM_30_PN'],
      dtype=object)

In [11]:
#drop all but HF values
unplan_readm_df = unplan_readm_df[unplan_readm_df['Measure ID'].isin(['READM_30_HF'])] 
unplan_readm_df.head(3)

Unnamed: 0,Provider ID,Hospital Name,State,ZIP Code,Measure ID,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate
7,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,READM_30_HF,No Different Than the National Rate,1114,22.6,20.5,24.8
18,10005,MARSHALL MEDICAL CENTER SOUTH,AL,35957,READM_30_HF,No Different Than the National Rate,341,21.3,18.2,24.9
29,10006,NORTH ALABAMA MEDICAL CENTER,AL,35630,READM_30_HF,No Different Than the National Rate,793,20.4,18.1,22.9


In [12]:
#change name of Measure ID columns to identify the original df and measure
unplan_readm_df.rename(columns={'Measure ID':'READM_30_HF'}, inplace=True)

In [13]:
#change value of 'READM_30_HF' to numeric
unplan_readm_df.loc[unplan_readm_df.READM_30_HF == 'READM_30_HF', 'READM_30_HF'] = 1

In [14]:
unplan_readm_df.head(3)

Unnamed: 0,Provider ID,Hospital Name,State,ZIP Code,READM_30_HF,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate
7,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,1,No Different Than the National Rate,1114,22.6,20.5,24.8
18,10005,MARSHALL MEDICAL CENTER SOUTH,AL,35957,1,No Different Than the National Rate,341,21.3,18.2,24.9
29,10006,NORTH ALABAMA MEDICAL CENTER,AL,35630,1,No Different Than the National Rate,793,20.4,18.1,22.9


In [15]:
unplan_readm_df.isnull().sum()

Provider ID             0
Hospital Name           0
State                   0
ZIP Code                0
READM_30_HF             0
Compared to National    0
Denominator             0
Score                   0
Lower Estimate          0
Higher Estimate         0
dtype: int64

In [16]:
unplan_readm_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4930 entries, 7 to 54089
Data columns (total 10 columns):
Provider ID             4930 non-null object
Hospital Name           4930 non-null object
State                   4930 non-null object
ZIP Code                4930 non-null int64
READM_30_HF             4930 non-null int64
Compared to National    4930 non-null object
Denominator             4930 non-null object
Score                   4930 non-null object
Lower Estimate          4930 non-null object
Higher Estimate         4930 non-null object
dtypes: int64(2), object(8)
memory usage: 423.7+ KB


In [17]:
#convert to numeric
cols = ['Denominator', 'Score', 'Lower Estimate', 'Higher Estimate']
unplan_readm_df[cols] = unplan_readm_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [18]:
unplan_readm_df['Compared to National'].unique()

array(['No Different Than the National Rate', 'Number of Cases Too Small',
       'Not Available', 'Worse Than the National Rate',
       'Better Than the National Rate'], dtype=object)

In [19]:
unplan_readm_df = pd.concat([unplan_readm_df, pd.get_dummies(unplan_readm_df['Compared to National'], prefix='comp_nat', drop_first=True)], axis =1)

dummy_column_names = list(pd.get_dummies(unplan_readm_df['Compared to National'], prefix='comp_nat', drop_first=True).columns)

In [20]:
unplan_readm_df = unplan_readm_df.drop(['Compared to National'], axis=1)

In [21]:
unplan_readm_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4930 entries, 7 to 54089
Data columns (total 13 columns):
Provider ID                                     4930 non-null object
Hospital Name                                   4930 non-null object
State                                           4930 non-null object
ZIP Code                                        4930 non-null int64
READM_30_HF                                     4930 non-null int64
Denominator                                     3692 non-null float32
Score                                           3692 non-null float32
Lower Estimate                                  3692 non-null float32
Higher Estimate                                 3692 non-null float32
comp_nat_No Different Than the National Rate    4930 non-null uint8
comp_nat_Not Available                          4930 non-null uint8
comp_nat_Number of Cases Too Small              4930 non-null uint8
comp_nat_Worse Than the National Rate           4930 non-null uin

In [22]:
readm_red_df.head(3)

Unnamed: 0,Hospital Name,Provider ID,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_AMI_HRRP,Not Available,1 - The number of cases/patients is too few to...,Not Available,Not Available,Not Available,Not Available,7/1/2014,6/30/2017
1,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_CABG_HRRP,Not Available,1 - The number of cases/patients is too few to...,Not Available,Not Available,Not Available,Not Available,7/1/2014,6/30/2017
2,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_COPD_HRRP,217,,1.0195,20.9722,20.5712,47,7/1/2014,6/30/2017


In [23]:
readm_red_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19674 entries, 0 to 19673
Data columns (total 12 columns):
Hospital Name                 19674 non-null object
Provider ID                   19674 non-null int64
State                         19674 non-null object
Measure Name                  19674 non-null object
Number of Discharges          19674 non-null object
Footnote                      8157 non-null object
Excess Readmission Ratio      19674 non-null object
Predicted Readmission Rate    19674 non-null object
Expected Readmission Rate     19674 non-null object
Number of Readmissions        19674 non-null object
Start Date                    19674 non-null object
End Date                      19674 non-null object
dtypes: int64(1), object(11)
memory usage: 1.8+ MB


In [24]:
#drop unnecessary columns
readm_red_df = readm_red_df.drop(['Footnote', 'Start Date', 'End Date'], axis=1)

In [25]:
readm_red_df['Measure Name'].unique()

array(['READM_30_AMI_HRRP', 'READM_30_CABG_HRRP', 'READM_30_COPD_HRRP',
       'READM_30_HF_HRRP', 'READM_30_HIP_KNEE_HRRP', 'READM_30_PN_HRRP'],
      dtype=object)

In [26]:
#drop all but HF values
readm_red_df = readm_red_df[readm_red_df['Measure Name'].isin(['READM_30_HF_HRRP'])] 
readm_red_df.head(3)

Unnamed: 0,Hospital Name,Provider ID,State,Measure Name,Number of Discharges,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions
3,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_HF_HRRP,259,1.0773,23.9788,22.2578,67
9,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,READM_30_HF_HRRP,75,0.9726,19.6816,20.2355,13
15,HARMON HOSPITAL,290042,NV,READM_30_HF_HRRP,Not Available,Not Available,Not Available,Not Available,Not Available


In [27]:
#change name of Measure ID columns to identify the original df and measure
readm_red_df.rename(columns={'Measure Name':'READM_30_HF_HRRP'}, inplace=True)

In [28]:
#change value of READM_30_HF_HRRP to numeric
readm_red_df.loc[readm_red_df.READM_30_HF_HRRP == 'READM_30_HF_HRRP', 'READM_30_HF_HRRP'] = 1

In [29]:
readm_red_df.head(3)

Unnamed: 0,Hospital Name,Provider ID,State,READM_30_HF_HRRP,Number of Discharges,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions
3,BYRD REGIONAL HOSPITAL,190164,LA,1,259,1.0773,23.9788,22.2578,67
9,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,1,75,0.9726,19.6816,20.2355,13
15,HARMON HOSPITAL,290042,NV,1,Not Available,Not Available,Not Available,Not Available,Not Available


In [30]:
readm_red_df.isnull().sum()

Hospital Name                 0
Provider ID                   0
State                         0
READM_30_HF_HRRP              0
Number of Discharges          0
Excess Readmission Ratio      0
Predicted Readmission Rate    0
Expected Readmission Rate     0
Number of Readmissions        0
dtype: int64

In [31]:
readm_red_df = readm_red_df.applymap(lambda elem: float('NaN') if elem == "Not Available" else elem)

In [32]:
readm_red_df.isnull().sum()

Hospital Name                   0
Provider ID                     0
State                           0
READM_30_HF_HRRP                0
Number of Discharges          602
Excess Readmission Ratio      397
Predicted Readmission Rate    397
Expected Readmission Rate     397
Number of Readmissions        609
dtype: int64

In [33]:
#will drop Nan; most are missing numer of d/c and readmin; predicted values are not usefull w/o actual readmin rates; 
#dataset is large enough to support loss of 609 records
readm_red_df = readm_red_df.dropna(how='any',axis=0)
 

In [34]:
readm_red_df.isnull().sum()

Hospital Name                 0
Provider ID                   0
State                         0
READM_30_HF_HRRP              0
Number of Discharges          0
Excess Readmission Ratio      0
Predicted Readmission Rate    0
Expected Readmission Rate     0
Number of Readmissions        0
dtype: int64

In [35]:
readm_red_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2670 entries, 3 to 19671
Data columns (total 9 columns):
Hospital Name                 2670 non-null object
Provider ID                   2670 non-null int64
State                         2670 non-null object
READM_30_HF_HRRP              2670 non-null int64
Number of Discharges          2670 non-null object
Excess Readmission Ratio      2670 non-null object
Predicted Readmission Rate    2670 non-null object
Expected Readmission Rate     2670 non-null object
Number of Readmissions        2670 non-null object
dtypes: int64(2), object(7)
memory usage: 208.6+ KB


In [36]:
cols = ['Number of Discharges', 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', 'Number of Readmissions']
readm_red_df[cols] = readm_red_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [37]:
readm_red_df.head(3)

Unnamed: 0,Hospital Name,Provider ID,State,READM_30_HF_HRRP,Number of Discharges,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions
3,BYRD REGIONAL HOSPITAL,190164,LA,1,259.0,1.0773,23.9788,22.257799,67.0
9,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,1,75.0,0.9726,19.681601,20.2355,13.0
21,CHESHIRE MEDICAL CENTER,300019,NH,1,303.0,0.9514,19.8545,20.8696,57.0


In [38]:
readm_red_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2670 entries, 3 to 19671
Data columns (total 9 columns):
Hospital Name                 2670 non-null object
Provider ID                   2670 non-null int64
State                         2670 non-null object
READM_30_HF_HRRP              2670 non-null int64
Number of Discharges          2670 non-null float32
Excess Readmission Ratio      2670 non-null float32
Predicted Readmission Rate    2670 non-null float32
Expected Readmission Rate     2670 non-null float32
Number of Readmissions        2670 non-null float32
dtypes: float32(5), int64(2), object(2)
memory usage: 156.4+ KB


In [39]:
mort_meas_df.head(3)

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Measure Start Date,Measure End Date
0,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,Rate of complications for hip/knee replacement...,COMP_HIP_KNEE,No Different Than the National Rate,292,3.2,2.1,4.8,,4/1/2015,3/31/2018
1,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,Death rate for heart attack patients,MORT_30_AMI,No Different Than the National Rate,688,13.0,11.0,15.5,,7/1/2015,6/30/2018
2,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,Death rate for CABG surgery patients,MORT_30_CABG,No Different Than the National Rate,291,4.3,2.6,6.8,,7/1/2015,6/30/2018


In [40]:
mort_meas_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91889 entries, 0 to 91888
Data columns (total 18 columns):
Provider ID             91889 non-null object
Hospital Name           91889 non-null object
Address                 91889 non-null object
City                    91889 non-null object
State                   91889 non-null object
ZIP Code                91889 non-null int64
County Name             91889 non-null object
Phone Number            91889 non-null int64
Measure Name            91889 non-null object
Measure ID              91889 non-null object
Compared to National    91889 non-null object
Denominator             91889 non-null object
Score                   91889 non-null object
Lower Estimate          91889 non-null object
Higher Estimate         91889 non-null object
Footnote                37628 non-null object
Measure Start Date      91889 non-null object
Measure End Date        91889 non-null object
dtypes: int64(2), object(16)
memory usage: 12.6+ MB


In [41]:
mort_meas_df = mort_meas_df.drop(['Address', 'City', 'County Name', 'Phone Number', 'Measure Name', 'Footnote', 'Measure Start Date', 'Measure End Date'], axis=1)

In [42]:
mort_meas_df['Measure ID'].unique()

array(['COMP_HIP_KNEE', 'MORT_30_AMI', 'MORT_30_CABG', 'MORT_30_COPD',
       'MORT_30_HF', 'MORT_30_PN', 'MORT_30_STK', 'PSI_10_POST_KIDNEY',
       'PSI_11_POST_RESP', 'PSI_12_POSTOP_PULMEMB_DVT',
       'PSI_13_POST_SEPSIS', 'PSI_14_POSTOP_DEHIS', 'PSI_15_ACC_LAC',
       'PSI_3_ULCER', 'PSI_4_SURG_COMP', 'PSI_6_IAT_PTX',
       'PSI_8_POST_HIP', 'PSI_90_SAFETY', 'PSI_9_POST_HEM'], dtype=object)

In [43]:
#drop all but HF values
mort_meas_df = mort_meas_df[mort_meas_df['Measure ID'].isin(['MORT_30_HF'])] 
mort_meas_df.head(3)

Unnamed: 0,Provider ID,Hospital Name,State,ZIP Code,Measure ID,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate
4,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,MORT_30_HF,No Different Than the National Rate,869,12.7,10.7,15.0
23,10005,MARSHALL MEDICAL CENTER SOUTH,AL,35957,MORT_30_HF,No Different Than the National Rate,318,14.4,11.4,17.9
42,10006,NORTH ALABAMA MEDICAL CENTER,AL,35630,MORT_30_HF,No Different Than the National Rate,671,12.9,10.6,15.4


In [44]:
#change name of Measure ID columns to identify the original df and measure
mort_meas_df.rename(columns={'Measure ID':'MORT_30_HF'}, inplace=True)

In [45]:
#change value of MORT_30_HF to numeric
mort_meas_df.loc[mort_meas_df.MORT_30_HF == 'MORT_30_HF', 'MORT_30_HF'] = 1

In [46]:
mort_meas_df.isnull().sum()

Provider ID             0
Hospital Name           0
State                   0
ZIP Code                0
MORT_30_HF              0
Compared to National    0
Denominator             0
Score                   0
Lower Estimate          0
Higher Estimate         0
dtype: int64

In [47]:
mort_meas_df['Compared to National'].nunique()

5

In [48]:
mort_meas_df = pd.concat([mort_meas_df, pd.get_dummies(mort_meas_df['Compared to National'], prefix='nat_comp', drop_first=True)], axis =1)

In [49]:
mort_meas_df.head(3)

Unnamed: 0,Provider ID,Hospital Name,State,ZIP Code,MORT_30_HF,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,nat_comp_No Different Than the National Rate,nat_comp_Not Available,nat_comp_Number of Cases Too Small,nat_comp_Worse Than the National Rate
4,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,1,No Different Than the National Rate,869,12.7,10.7,15.0,1,0,0,0
23,10005,MARSHALL MEDICAL CENTER SOUTH,AL,35957,1,No Different Than the National Rate,318,14.4,11.4,17.9,1,0,0,0
42,10006,NORTH ALABAMA MEDICAL CENTER,AL,35630,1,No Different Than the National Rate,671,12.9,10.6,15.4,1,0,0,0


In [50]:
cols = ['Denominator', 'Score', 'Lower Estimate', 'Higher Estimate']
mort_meas_df[cols] = mort_meas_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [51]:
mort_meas_df = mort_meas_df.drop(['Compared to National'], axis=1)

In [52]:
mort_meas_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4930 entries, 4 to 91874
Data columns (total 13 columns):
Provider ID                                     4930 non-null object
Hospital Name                                   4930 non-null object
State                                           4930 non-null object
ZIP Code                                        4930 non-null int64
MORT_30_HF                                      4930 non-null int64
Denominator                                     3617 non-null float32
Score                                           3617 non-null float32
Lower Estimate                                  3617 non-null float32
Higher Estimate                                 3617 non-null float32
nat_comp_No Different Than the National Rate    4930 non-null uint8
nat_comp_Not Available                          4930 non-null uint8
nat_comp_Number of Cases Too Small              4930 non-null uint8
nat_comp_Worse Than the National Rate           4930 non-null uin

In [53]:
hcahps_df.head(5)

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,HCAHPS Measure ID,HCAHPS Question,...,Patient Survey Star Rating Footnote,HCAHPS Answer Percent,HCAHPS Answer Percent Footnote,HCAHPS Linear Mean Value,Number of Completed Surveys,Number of Completed Surveys Footnote,Survey Response Rate Percent,Survey Response Rate Percent Footnote,Measure Start Date,Measure End Date
0,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_A_P,"Patients who reported that their nurses ""Alway...",...,,72,,Not Applicable,526,,21,,10/1/2017,9/30/2018
1,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_SN_P,"Patients who reported that their nurses ""Somet...",...,,9,,Not Applicable,526,,21,,10/1/2017,9/30/2018
2,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_U_P,"Patients who reported that their nurses ""Usual...",...,,19,,Not Applicable,526,,21,,10/1/2017,9/30/2018
3,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_LINEAR_SCORE,Nurse communication - linear mean score,...,,Not Applicable,,87,526,,21,,10/1/2017,9/30/2018
4,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_STAR_RATING,Nurse communication - star rating,...,,Not Applicable,,Not Applicable,526,,21,,10/1/2017,9/30/2018


In [54]:
hcahps_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455235 entries, 0 to 455234
Data columns (total 22 columns):
Provider ID                              455235 non-null object
Hospital Name                            455235 non-null object
Address                                  455235 non-null object
City                                     455235 non-null object
State                                    455235 non-null object
ZIP Code                                 455235 non-null int64
County Name                              455235 non-null object
Phone Number                             455235 non-null int64
HCAHPS Measure ID                        455235 non-null object
HCAHPS Question                          455235 non-null object
HCAHPS Answer Description                455235 non-null object
Patient Survey Star Rating               455235 non-null object
Patient Survey Star Rating Footnote      15103 non-null object
HCAHPS Answer Percent                    455235 non-null obj

In [55]:
hcahps_df = hcahps_df.drop(['Address', 'City', 'County Name', 'Phone Number', 'HCAHPS Answer Percent Footnote',
              'Patient Survey Star Rating Footnote', 'Number of Completed Surveys Footnote',
              'Survey Response Rate Percent Footnote', 'Measure Start Date', 'Measure End Date'], axis=1)

#### unable to use dataset without significant wrangling due to current structure; 
will use only patient star rating and HCAHPS linear mean value as features at this time

In [56]:
cols = ['Patient Survey Star Rating', 'HCAHPS Linear Mean Value']
hcahps_df[cols] = hcahps_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [57]:
hcahps_df = hcahps_df.loc[(hcahps_df['Patient Survey Star Rating'] >0) | 
                          (hcahps_df['HCAHPS Linear Mean Value'] >0)] 

In [58]:
hcahps_df.head()

Unnamed: 0,Provider ID,Hospital Name,State,ZIP Code,HCAHPS Measure ID,HCAHPS Question,HCAHPS Answer Description,Patient Survey Star Rating,HCAHPS Answer Percent,HCAHPS Linear Mean Value,Number of Completed Surveys,Survey Response Rate Percent
3,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,H_COMP_1_LINEAR_SCORE,Nurse communication - linear mean score,Nurse communication - linear mean score,,Not Applicable,87.0,526,21
4,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,H_COMP_1_STAR_RATING,Nurse communication - star rating,Nurse communication - star rating,2.0,Not Applicable,,526,21
17,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,H_COMP_2_LINEAR_SCORE,Doctor communication - linear mean score,Doctor communication - linear mean score,,Not Applicable,90.0,526,21
18,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,H_COMP_2_STAR_RATING,Doctor communication - star rating,Doctor communication - star rating,3.0,Not Applicable,,526,21
31,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,36301,H_COMP_3_LINEAR_SCORE,Staff responsiveness - linear mean score,Staff responsiveness - linear mean score,,Not Applicable,76.0,526,21


In [59]:
hcahps_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73962 entries, 3 to 454955
Data columns (total 12 columns):
Provider ID                     73962 non-null object
Hospital Name                   73962 non-null object
State                           73962 non-null object
ZIP Code                        73962 non-null int64
HCAHPS Measure ID               73962 non-null object
HCAHPS Question                 73962 non-null object
HCAHPS Answer Description       73962 non-null object
Patient Survey Star Rating      38742 non-null float32
HCAHPS Answer Percent           73962 non-null object
HCAHPS Linear Mean Value        35220 non-null float32
Number of Completed Surveys     73962 non-null object
Survey Response Rate Percent    73962 non-null object
dtypes: float32(2), int64(1), object(9)
memory usage: 6.8+ MB


In [60]:
hcahps_df['HCAHPS Measure ID'].unique()

array(['H_COMP_1_LINEAR_SCORE', 'H_COMP_1_STAR_RATING',
       'H_COMP_2_LINEAR_SCORE', 'H_COMP_2_STAR_RATING',
       'H_COMP_3_LINEAR_SCORE', 'H_COMP_3_STAR_RATING',
       'H_COMP_5_LINEAR_SCORE', 'H_COMP_5_STAR_RATING',
       'H_COMP_6_LINEAR_SCORE', 'H_COMP_6_STAR_RATING',
       'H_COMP_7_LINEAR_SCORE', 'H_COMP_7_STAR_RATING',
       'H_CLEAN_LINEAR_SCORE', 'H_CLEAN_STAR_RATING',
       'H_QUIET_LINEAR_SCORE', 'H_QUIET_STAR_RATING',
       'H_HSP_RATING_LINEAR_SCORE', 'H_HSP_RATING_STAR_RATING',
       'H_RECMND_LINEAR_SCORE', 'H_RECMND_STAR_RATING', 'H_STAR_RATING'],
      dtype=object)

In [None]:
hcahps_df = hcahps_df.drop(['HCAHPS Question', 'HCAHPS Answer Percent', 'Survey Response Rate Percent'], axis=1)

#### original dataframe info:

**unplan_readm_df:** 54093 entries reduced to 4930, measure column name = READM_30_HF, value = 1, no null; 'Compared to National' with 5 possible values - dummy; 
**readm_red_df:** 19674 entries reduced to 2670, no ZIP code column, measure column name = READM_30_HF_HRRP, value = 1, nulls removed; 
**mort_meas_df:** 91889 entries reduced to 4930, measure column name =  MORT_30_HF, value = 1; no null; 'Compared to National' with 5 possible values - dummy 
**hcahps_df:** 455235 entries reduced to 
**gen_info_df:** 

In [None]:
gen_info_df.head(3)

In [None]:
gen_info_df.info()

In [None]:
gen_info_df = gen_info_df.drop(['Address', 'City', 'County Name', 'Phone Number', 'Hospital overall rating footnote',
                'Mortality national comparison footnote', 'Safety of care national comparison footnote',
                'Readmission national comparison footnote', 'Patient experience national comparison footnote',
                'Effectiveness of care national comparison footnote', 'Timeliness of care national comparison footnote',
                'Efficient use of medical imaging national comparison footnote'], axis=1)

In [None]:
#convert to dummies
dummy_list = ['Hospital Type', 'Hospital Ownership', 'Emergency Services', 'Meets criteria for meaningful use of EHRs', 
'Hospital overall rating', 'Mortality national comparison', 'Safety of care national comparison', 'Readmission national comparison', 
'Patient experience national comparison', 'Effectiveness of care national comparison', 'Timeliness of care national comparison', 
'Efficient use of medical imaging national comparison']

In [None]:
for column in dummy_list:
    gen_info_df = pd.concat([gen_info_df,pd.get_dummies(gen_info_df[column], prefix=[column], drop_first=True)], axis=1)


In [None]:
gen_info_df.head(3)

In [None]:
gen_info_df.info()

In [None]:
gen_info_df = gen_info_df.drop(['Hospital Type', 'Hospital Ownership', 'Emergency Services', 'Meets criteria for meaningful use of EHRs', 
'Hospital overall rating', 'Mortality national comparison', 'Safety of care national comparison', 'Readmission national comparison', 
'Patient experience national comparison', 'Effectiveness of care national comparison', 'Timeliness of care national comparison', 
'Efficient use of medical imaging national comparison'], axis=1)

In [None]:
gen_info_df.info()

### convert objects to floats

In [None]:
unplan_readm_df, convert cols 
readm_red_df, convert cols + Provider ID to non-null object
mort_meas_df, convert cols 
hcahps_df, convert cols 
gen_info_df, + Provider ID to non-null object

In [None]:
#this converts everything (all columns) to float
def num_convert(cols, data, x):
    for cols in data:
        data[cols] = data[cols].apply(pd.to_numeric, errors = 'coerce', downcast = x)

In [None]:
cols = ['Compared to National', 'Denominator', 'Score', 'Lower Estimate']
data = unplan_readm_df
x = 'float'
num_convert(cols, data, x)

In [None]:
cols = ['Provider ID']
data = unplan_readm_df
x = 'integer'
num_convert(cols, data, x)

In [None]:
unplan_readm_df.info()

In [None]:
unplan_readm_df['Provider ID'].isnull().sum()

In [None]:
unplan_readm_df['Provider ID'] = unplan_readm_df['Provider ID'].astype(np.int64)

In [None]:
#why did it convert all columns to float64
unplan_readm_df.info()

### merge dataframes

In [None]:
hosp_readmin_df = pd.merge(unplan_readm_df, readm_red_df, on=['Provider ID', 'Hospital Name', 'State', ], how='outer')

In [None]:
hosp_readmin_df.head(3)

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df = pd.merge(hosp_readmin_df, mort_meas_df, on=['Provider ID', 'Hospital Name', 'State'], how='outer')

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df = pd.merge(hosp_readmin_df, hcahps_df, on=['Provider ID', 'Hospital Name', 'State'], how='outer')

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df = hosp_readmin_df.drop(['Compared to National_x'], axis=1)

In [None]:
hosp_readmin_df = pd.merge(hosp_readmin_df, gen_info_df, on=['Provider ID', 'Hospital Name', 'State'], how='outer')

### missing values

In [None]:
hcahps_df: will need to fill NaN with dummy values, do not drop rows

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df.isnull().sum()

In [None]:
hosp_readmin_df.head()

In [None]:
hosp_readmin_df = hosp_readmin_df.drop(['Measure Start Date_x', 'Measure Start Date_y', 'Measure End Date_x', 'Start Date', 'End Date', 'Measure Start Date_y', 
'Measure End Date_y', 'Measure Start Date', 'Measure End Date'], axis=1)

In [None]:
#cut
hosp_readmin_df = hosp_readmin_df.drop(['Measure Start Date_x'], axis=1)

In [None]:
hosp_readmin_df.head()

In [None]:
print(hosp_readmin_df[['Measure Name_y', 'Number of Discharges', 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', 'Number of Readmissions', 'Measure Name', 
'Measure ID_y', 'Compared to National_y',  'Denominator_y',  'Score_y',  'Lower Estimate_y',  'Higher Estimate_y', 'HCAHPS Measure ID', 
'HCAHPS Question', 'HCAHPS Answer Description', 'Patient Survey Star Rating', 'HCAHPS Answer Percent', 'HCAHPS Linear Mean Value', 
'Number of Completed Surveys', 'Survey Response Rate Percent', 'Hospital Type', 'Hospital Ownership']].head(10))

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df.isnull()

In [None]:
hosp_readmin_df.isnull().values.any(axis=1)

### creating  additional feature
Excess Readmission Ratio: represents Hospital's "predicted" number of readmissions compared to CMS "expected" number of readmissions. 

Calculate the **Actual Readmission Rate** = 'actual_rrate' (number of readmissions/discharges)

Calculate the **Actual Readmission Ratio** = 'actual_rratio' ('actual_rrate'/'Expected Readmission Rate') in order to compare the Excess Readmission Ratio

In [None]:
df_readmin['actual_rrate'] = df_readmin['Number of Readmissions']/df_readmin['Number of Discharges'] * 100

In [None]:
df_readmin['actual_rratio'] = df_readmin['actual_rrate']/df_readmin['Expected Readmission Rate']