## Data Source and information

data source: https://healthdata.gov/dataset/hospital-readmission-reduction/resource/f3830eb1-2d22-496c-b663-46b54e175d9f

https://healthdata.gov/dataset/hospital-readmission-reduction

In October 2012, CMS began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.

https://data.medicare.gov/data/hospital-compare
Hospital Compare is a consumer-oriented website that provides information on the quality of care hospitals are providing to their patients. This information can help consumers make informed decisions about health care. Hospital Compare allows consumers to select multiple hospitals and directly compare performance measure information related to heart attack, emergency department care, preventive care, stroke care, and other conditions. The Centers for Medicare & Medicaid Services (CMS) created the Hospital Compare website to better inform health care consumers about a hospital’s quality of care. Hospital Compare provides data on over 4,000 Medicare-certified hospitals, including acute care hospitals, critical access hospitals (CAHs), children’s hospitals, Veterans Health Administration (VHA) Medical Centers, and hospital outpatient departments. Hospital Compare is part of an Administration-wide effort to increase the availability and accessibility of information on quality, utilization, and costs for effective, informed decision-making. More information about Hospital Compare can be found by visiting the CMS.gov website and performing a search for Hospital Compare. To access the Hospital Compare website, please visit www.medicare.gov/hospitalcompare. 

https://www.medicare.gov/hospitalcompare/Data/Data-Updated.html#%20
measures & current data period

## https://www.medicare.gov/hospitalcompare/Data/Hospital-overall-ratings-calculation.html

hospital compare overall hospital rating
n = 4,573
distribution of stars (N/A, 1-5)

The methodology uses a statistical model known as a latent variable model. Seven different latent variable models are used to calculate scores for 7 groups of measures.
Mortality
Safety of Care
Readmission
Patient Experience
Effectiveness of Care
Timeliness of Care
Efficient Use of Medical Imaging

A hospital summary score is then calculated by taking the weighted average of these group scores. If a hospital is missing a measure category or group, the weights are redistributed amongst the qualifying measure categories or groups.
Finally, the overall hospital rating is calculated using the hospital summary score.

## Load data files

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
hcahps_df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\capstone_3\hcahps_ratings.csv')

In [None]:
https://healthdata.gov/dataset/hcahps-hospital

## Cleaning data

### reduce each file to relevant data only

In [4]:
hcahps_df.head(3)

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,HCAHPS Measure ID,HCAHPS Question,...,Patient Survey Star Rating Footnote,HCAHPS Answer Percent,HCAHPS Answer Percent Footnote,HCAHPS Linear Mean Value,Number of Completed Surveys,Number of Completed Surveys Footnote,Survey Response Rate Percent,Survey Response Rate Percent Footnote,Measure Start Date,Measure End Date
0,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_A_P,"Patients who reported that their nurses ""Alway...",...,,72,,Not Applicable,526,,21,,10/1/2017,9/30/2018
1,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_SN_P,"Patients who reported that their nurses ""Somet...",...,,9,,Not Applicable,526,,21,,10/1/2017,9/30/2018
2,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,H_COMP_1_U_P,"Patients who reported that their nurses ""Usual...",...,,19,,Not Applicable,526,,21,,10/1/2017,9/30/2018


In [5]:
hcahps_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455235 entries, 0 to 455234
Data columns (total 22 columns):
Provider ID                              455235 non-null object
Hospital Name                            455235 non-null object
Address                                  455235 non-null object
City                                     455235 non-null object
State                                    455235 non-null object
ZIP Code                                 455235 non-null int64
County Name                              455235 non-null object
Phone Number                             455235 non-null int64
HCAHPS Measure ID                        455235 non-null object
HCAHPS Question                          455235 non-null object
HCAHPS Answer Description                455235 non-null object
Patient Survey Star Rating               455235 non-null object
Patient Survey Star Rating Footnote      15103 non-null object
HCAHPS Answer Percent                    455235 non-null obj

In [6]:
hcahps_df = hcahps_df.drop(['Address', 'City', 'ZIP Code', 'County Name', 'Phone Number', 'HCAHPS Answer Percent Footnote',
              'Patient Survey Star Rating Footnote', 'Number of Completed Surveys Footnote',
              'Survey Response Rate Percent Footnote', 'Measure Start Date', 'Measure End Date'], axis=1)

In [None]:
#keeping all questions for now; figure out what to do with number of completed surveys - 
#weight responses based on total surveys?; get_dummies?
#also, fill Nan for star values with fake or repeat same star rate for each subcategory?

In [7]:
hcahps_df['HCAHPS Measure ID'].unique()

array(['H_COMP_1_A_P', 'H_COMP_1_SN_P', 'H_COMP_1_U_P',
       'H_COMP_1_LINEAR_SCORE', 'H_COMP_1_STAR_RATING',
       'H_NURSE_RESPECT_A_P', 'H_NURSE_RESPECT_SN_P',
       'H_NURSE_RESPECT_U_P', 'H_NURSE_LISTEN_A_P', 'H_NURSE_LISTEN_SN_P',
       'H_NURSE_LISTEN_U_P', 'H_NURSE_EXPLAIN_A_P',
       'H_NURSE_EXPLAIN_SN_P', 'H_NURSE_EXPLAIN_U_P', 'H_COMP_2_A_P',
       'H_COMP_2_SN_P', 'H_COMP_2_U_P', 'H_COMP_2_LINEAR_SCORE',
       'H_COMP_2_STAR_RATING', 'H_DOCTOR_RESPECT_A_P',
       'H_DOCTOR_RESPECT_SN_P', 'H_DOCTOR_RESPECT_U_P',
       'H_DOCTOR_LISTEN_A_P', 'H_DOCTOR_LISTEN_SN_P',
       'H_DOCTOR_LISTEN_U_P', 'H_DOCTOR_EXPLAIN_A_P',
       'H_DOCTOR_EXPLAIN_SN_P', 'H_DOCTOR_EXPLAIN_U_P', 'H_COMP_3_A_P',
       'H_COMP_3_SN_P', 'H_COMP_3_U_P', 'H_COMP_3_LINEAR_SCORE',
       'H_COMP_3_STAR_RATING', 'H_CALL_BUTTON_A_P', 'H_CALL_BUTTON_SN_P',
       'H_CALL_BUTTON_U_P', 'H_BATH_HELP_A_P', 'H_BATH_HELP_SN_P',
       'H_BATH_HELP_U_P', 'H_COMP_5_A_P', 'H_COMP_5_SN_P', 'H_COMP_5_U_P',


# Slack question:
Currently each of these values (above and below) represent ROWS in the dataframe that are QUESTIONS in a survey. All these values are contained under the column 'HCAHPS Measure ID'. 

Each question/row essentially contains the answer. For example, the question doesn't ask: "Did your nurse communicate well with you?" and then given them the choice of answering "Always," "sometimes," or "usually." The question would be: "Patients who reported that their nurses "Always" communicated well" or "Patients who reported that their nurses "Sometimes" communicated well", etc, so that every question/answer combo is a separate row with the 'HCAHPS Answer Percent' containing the percentage of all possible answers in one column but different rows. 

I would like to use the answers (7 possible answers) as features in my model. I THINK I will need to create a column for each answer. I would like to combine the  3 "questions" into one row and somehow transfer the "answer" to one of the 7 new columns like so: 

column head: hospital     question    always     sometimes/never    usually   hosp <6   hosp 7-8  hosp 9-10  recommend
ABC Hospital        Nurse communicate  42         75                67        0        0         0           0

instead of :
hospital      question                    answer
ABC hosp      nurse always communicate    42
ABC hosp      nurse sometimes commumic    75
ABC hsop      nurese usually communic     67

HOW DO I DO THAT??

In [None]:
#question 1: nurses communication
'H_COMP_1_A_P', 'H_COMP_1_SN_P', 'H_COMP_1_U_P',
       'H_COMP_1_LINEAR_SCORE', 'H_COMP_1_STAR_RATING',

In [None]:
'H_NURSE_RESPECT_A_P', 'H_NURSE_RESPECT_SN_P',
       'H_NURSE_RESPECT_U_P', 

In [None]:
 'H_NURSE_LISTEN_A_P', 'H_NURSE_LISTEN_SN_P',
       'H_NURSE_LISTEN_U_P',

In [None]:
'H_NURSE_EXPLAIN_A_P',
       'H_NURSE_EXPLAIN_SN_P', 'H_NURSE_EXPLAIN_U_P', 

In [None]:
 'H_COMP_2_A_P',
       'H_COMP_2_SN_P', 'H_COMP_2_U_P', 'H_COMP_2_LINEAR_SCORE',
       'H_COMP_2_STAR_RATING',

In [None]:
'H_DOCTOR_RESPECT_A_P',
       'H_DOCTOR_RESPECT_SN_P', 'H_DOCTOR_RESPECT_U_P',

In [None]:
'H_DOCTOR_LISTEN_A_P', 'H_DOCTOR_LISTEN_SN_P',
       'H_DOCTOR_LISTEN_U_P', 

In [None]:
'H_DOCTOR_EXPLAIN_A_P',
       'H_DOCTOR_EXPLAIN_SN_P', 'H_DOCTOR_EXPLAIN_U_P', 

In [None]:
 'H_COMP_3_A_P',
       'H_COMP_3_SN_P', 'H_COMP_3_U_P', 'H_COMP_3_LINEAR_SCORE',
       'H_COMP_3_STAR_RATING', 

In [None]:
 'H_CALL_BUTTON_A_P', 'H_CALL_BUTTON_SN_P',
       'H_CALL_BUTTON_U_P', 

In [None]:
'H_BATH_HELP_A_P', 'H_BATH_HELP_SN_P',
       'H_BATH_HELP_U_P',

In [None]:
'H_COMP_5_A_P', 'H_COMP_5_SN_P', 'H_COMP_5_U_P',
       'H_COMP_5_LINEAR_SCORE', 'H_COMP_5_STAR_RATING', 

In [None]:
'H_MED_FOR_A_P',
       'H_MED_FOR_SN_P', 'H_MED_FOR_U_P', 

In [None]:
'H_SIDE_EFFECTS_A_P',
       'H_SIDE_EFFECTS_SN_P', 'H_SIDE_EFFECTS_U_P', 

In [None]:
#yes/no, yes = always, no = sometimes/never
'H_COMP_6_N_P',
       'H_COMP_6_Y_P', 'H_COMP_6_LINEAR_SCORE', 'H_COMP_6_STAR_RATING',

In [None]:
 'H_DISCH_HELP_N_P', 'H_DISCH_HELP_Y_P',

In [None]:
'H_SYMPTOMS_N_P',
       'H_SYMPTOMS_Y_P',

In [None]:
 'H_COMP_7_A', 'H_COMP_7_D_SD', 'H_COMP_7_SA',
       'H_COMP_7_LINEAR_SCORE', 'H_COMP_7_STAR_RATING',

In [None]:
'H_CT_PREFER_A',
       'H_CT_PREFER_D_SD', 'H_CT_PREFER_SA', 

In [None]:
 'H_CT_UNDER_A',
       'H_CT_UNDER_D_SD', 'H_CT_UNDER_SA', 

In [None]:
'H_CT_MED_A', 'H_CT_MED_D_SD',
       'H_CT_MED_SA', 

In [None]:
'H_CLEAN_HSP_A_P', 'H_CLEAN_HSP_SN_P',
       'H_CLEAN_HSP_U_P', 'H_CLEAN_LINEAR_SCORE', 'H_CLEAN_STAR_RATING',

In [None]:
 'H_QUIET_HSP_A_P', 'H_QUIET_HSP_SN_P', 'H_QUIET_HSP_U_P',
       'H_QUIET_LINEAR_SCORE', 'H_QUIET_STAR_RATING', 

In [None]:
'H_HSP_RATING_0_6',
       'H_HSP_RATING_7_8', 'H_HSP_RATING_9_10',
       'H_HSP_RATING_LINEAR_SCORE', 'H_HSP_RATING_STAR_RATING',

In [None]:
'H_RECMND_DN', 'H_RECMND_DY', 'H_RECMND_PY',
       'H_RECMND_LINEAR_SCORE', 'H_RECMND_STAR_RATING', 'H_STAR_RATING'

### convert objects to floats

In [None]:
unplan_readm_df, convert cols 
readm_red_df, convert cols + Provider ID to non-null object
mort_meas_df, convert cols 
hcahps_df, convert cols 
gen_info_df, + Provider ID to non-null object

In [None]:
#this converts everything (all columns) to float
def num_convert(cols, data, x):
    for cols in data:
        data[cols] = data[cols].apply(pd.to_numeric, errors = 'coerce', downcast = x)

In [None]:
cols = ['Compared to National', 'Denominator', 'Score', 'Lower Estimate']
data = unplan_readm_df
x = 'float'
num_convert(cols, data, x)

In [None]:
cols = ['Provider ID']
data = unplan_readm_df
x = 'integer'
num_convert(cols, data, x)

In [None]:
unplan_readm_df.info()

In [None]:
unplan_readm_df['Provider ID'].isnull().sum()

In [None]:
unplan_readm_df['Provider ID'] = unplan_readm_df['Provider ID'].astype(np.int64)

In [None]:
#why did it convert all columns to float64
unplan_readm_df.info()

In [None]:
#convert to numeric
cols = ['Compared to National', 'Denominator', 'Score', 'Lower Estimate']
unplan_readm_df[cols] = unplan_readm_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [None]:
cols = ['Number of Discharges', 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', 'Number of Readmissions']
readm_red_df[cols] = readm_red_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [None]:
readm_red_df['Provider ID'] = readm_red_df['Provider ID'].astype(str)

In [None]:
readm_red_df.info()

In [None]:
cols = ['Denominator', 'Score', 'Lower Estimate', 'Higher Estimate']
mort_meas_df[cols] = mort_meas_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [None]:
cols = ['Patient Survey Star Rating', 'HCAHPS Answer Percent', 'HCAHPS Linear Mean Value', 'Number of Completed Surveys', 'Survey Response Rate Percent']
hcahps_df[cols] = hcahps_df[cols].apply(pd.to_numeric, errors = 'coerce', downcast = 'float')

In [None]:
gen_info_df['Provider ID'] = gen_info_df['Provider ID'].astype(str)

### merge dataframes

In [None]:
hosp_readmin_df = pd.merge(unplan_readm_df, readm_red_df, on=['Provider ID', 'Hospital Name', 'State', ], how='outer')

In [None]:
hosp_readmin_df.head(3)

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df = pd.merge(hosp_readmin_df, mort_meas_df, on=['Provider ID', 'Hospital Name', 'State'], how='outer')

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df = pd.merge(hosp_readmin_df, hcahps_df, on=['Provider ID', 'Hospital Name', 'State'], how='outer')

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df = hosp_readmin_df.drop(['Compared to National_x'], axis=1)

In [None]:
hosp_readmin_df = pd.merge(hosp_readmin_df, gen_info_df, on=['Provider ID', 'Hospital Name', 'State'], how='outer')

### missing values

In [None]:
hcahps_df: will need to fill NaN with dummy values, do not drop rows

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df.isnull().sum()

In [None]:
hosp_readmin_df.head()

In [None]:
hosp_readmin_df = hosp_readmin_df.drop(['Measure Start Date_x', 'Measure Start Date_y', 'Measure End Date_x', 'Start Date', 'End Date', 'Measure Start Date_y', 
'Measure End Date_y', 'Measure Start Date', 'Measure End Date'], axis=1)

In [None]:
#cut
hosp_readmin_df = hosp_readmin_df.drop(['Measure Start Date_x'], axis=1)

In [None]:
hosp_readmin_df.head()

In [None]:
print(hosp_readmin_df[['Measure Name_y', 'Number of Discharges', 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', 'Number of Readmissions', 'Measure Name', 
'Measure ID_y', 'Compared to National_y',  'Denominator_y',  'Score_y',  'Lower Estimate_y',  'Higher Estimate_y', 'HCAHPS Measure ID', 
'HCAHPS Question', 'HCAHPS Answer Description', 'Patient Survey Star Rating', 'HCAHPS Answer Percent', 'HCAHPS Linear Mean Value', 
'Number of Completed Surveys', 'Survey Response Rate Percent', 'Hospital Type', 'Hospital Ownership']].head(10))

In [None]:
hosp_readmin_df.info()

In [None]:
hosp_readmin_df.isnull()

In [None]:
hosp_readmin_df.isnull().values.any(axis=1)

### creating  additional feature
Excess Readmission Ratio: represents Hospital's "predicted" number of readmissions compared to CMS "expected" number of readmissions. 

Calculate the **Actual Readmission Rate** = 'actual_rrate' (number of readmissions/discharges)

Calculate the **Actual Readmission Ratio** = 'actual_rratio' ('actual_rrate'/'Expected Readmission Rate') in order to compare the Excess Readmission Ratio

In [None]:
df_readmin['actual_rrate'] = df_readmin['Number of Readmissions']/df_readmin['Number of Discharges'] * 100

In [None]:
df_readmin['actual_rratio'] = df_readmin['actual_rrate']/df_readmin['Expected Readmission Rate']