Find a dataset with at least four continuous variables and one categorical variable. Create one master plot that gives insight into the variables and their interrelationships, including:
    + Probability distributions
    + Bivariate relationships
    + Whether the distributions or the relationships vary across groups
    + Accompany your plot with a written description of what you see.

data source: 
http://www.usap-dc.org/view/dataset/609273

Data Source:
https://data.medicare.gov/Hospital-Compare/Hospital-Readmissions-Reduction-Program/9n3s-kdb3

Data Description:
n October 2012, CMS began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, calculated by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack (AMI), heart failure (HF), pneumonia, chronic obstructive pulmonary disease (COPD), hip/knee replacement (THA/TKA), and coronary artery bypass graft surgery (CABG) by the number that would be “expected,” based on an average hospital with similar patients.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\Hospital_Readmissions_Reduction_Program.csv')

###### Remove missing values by converting 'Not Available' values to NaN, then dropping NaN values.

In [3]:
df_with_nans = df.applymap(lambda elem: float('NaN') if elem == "Not Available" else elem)

In [4]:
df_with_nans.head(10)

Unnamed: 0,Hospital Name,Provider ID,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_AMI_HRRP,,1 - The number of cases/patients is too few to...,,,,,7/1/2014,6/30/2017
1,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_CABG_HRRP,,1 - The number of cases/patients is too few to...,,,,,7/1/2014,6/30/2017
2,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_COPD_HRRP,217.0,,1.0195,20.9722,20.5712,47.0,7/1/2014,6/30/2017
3,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_HF_HRRP,259.0,,1.0773,23.9788,22.2578,67.0,7/1/2014,6/30/2017
4,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_HIP_KNEE_HRRP,,1 - The number of cases/patients is too few to...,,,,,7/1/2014,6/30/2017
5,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_PN_HRRP,213.0,,1.1031,19.2445,17.4459,47.0,7/1/2014,6/30/2017
6,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,READM_30_AMI_HRRP,,1 - The number of cases/patients is too few to...,,,,,7/1/2014,6/30/2017
7,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,READM_30_CABG_HRRP,,5 - Results are not available for this reporti...,,,,,7/1/2014,6/30/2017
8,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,READM_30_COPD_HRRP,,5 - Results are not available for this reporti...,1.0024,18.0061,17.963,,7/1/2014,6/30/2017
9,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,READM_30_HF_HRRP,75.0,,0.9726,19.6816,20.2355,13.0,7/1/2014,6/30/2017


####Instead of changing the df, take a copy( rather than a slice) to avoid errors and warnings

In [5]:
df_readmin = df_with_nans.dropna(subset=df_with_nans.columns.drop('Footnote')).copy() 

In [6]:
df_readmin.head(5)

Unnamed: 0,Hospital Name,Provider ID,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
2,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_COPD_HRRP,217,,1.0195,20.9722,20.5712,47,7/1/2014,6/30/2017
3,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_HF_HRRP,259,,1.0773,23.9788,22.2578,67,7/1/2014,6/30/2017
5,BYRD REGIONAL HOSPITAL,190164,LA,READM_30_PN_HRRP,213,,1.1031,19.2445,17.4459,47,7/1/2014,6/30/2017
9,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,READM_30_HF_HRRP,75,,0.9726,19.6816,20.2355,13,7/1/2014,6/30/2017
11,GRAND ITASCA CLINIC AND HOSPITAL,240064,MN,READM_30_PN_HRRP,153,,0.9719,14.1502,14.5594,20,7/1/2014,6/30/2017


In [7]:
df_readmin.groupby('Measure Name')['Excess Readmission Ratio'].describe()

Unnamed: 0_level_0,count,unique,top,freq
Measure Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
READM_30_AMI_HRRP,1666,1195,1.0498,6
READM_30_CABG_HRRP,596,550,1.0562,3
READM_30_COPD_HRRP,2551,1564,0.9765,7
READM_30_HF_HRRP,2670,1760,0.9943,6
READM_30_HIP_KNEE_HRRP,1301,1141,1.1385,4
READM_30_PN_HRRP,2748,1806,1.0213,6


Why doesn't this provide mean, std, etc. 

In [8]:
df_readmin.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11532 entries, 2 to 19673
Data columns (total 12 columns):
Hospital Name                 11532 non-null object
Provider ID                   11532 non-null int64
State                         11532 non-null object
Measure Name                  11532 non-null object
Number of Discharges          11532 non-null object
Footnote                      15 non-null object
Excess Readmission Ratio      11532 non-null object
Predicted Readmission Rate    11532 non-null object
Expected Readmission Rate     11532 non-null object
Number of Readmissions        11532 non-null object
Start Date                    11532 non-null object
End Date                      11532 non-null object
dtypes: int64(1), object(11)
memory usage: 1.1+ MB


###### Convert selected columns to numeric

In [9]:
for column in ['Number of Discharges', 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', 'Number of Readmissions']:
    df_readmin[column] = pd.to_numeric(df_readmin[column], errors = 'coerce', downcast = 'float')

In [10]:
df_readmin.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11532 entries, 2 to 19673
Data columns (total 12 columns):
Hospital Name                 11532 non-null object
Provider ID                   11532 non-null int64
State                         11532 non-null object
Measure Name                  11532 non-null object
Number of Discharges          11532 non-null float32
Footnote                      15 non-null object
Excess Readmission Ratio      11532 non-null float32
Predicted Readmission Rate    11532 non-null float32
Expected Readmission Rate     11532 non-null float32
Number of Readmissions        11532 non-null float32
Start Date                    11532 non-null object
End Date                      11532 non-null object
dtypes: float32(5), int64(1), object(6)
memory usage: 946.0+ KB


In [11]:
df_readmin.groupby('Measure Name')['Excess Readmission Ratio'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Measure Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
READM_30_AMI_HRRP,1666.0,1.005694,0.068506,0.7479,0.96205,1.0062,1.049875,1.2927
READM_30_CABG_HRRP,596.0,1.022244,0.118904,0.7428,0.940425,1.02185,1.10055,1.7072
READM_30_COPD_HRRP,2551.0,1.004937,0.062906,0.8126,0.96435,1.0022,1.04175,1.3222
READM_30_HF_HRRP,2670.0,1.003497,0.076744,0.7467,0.953625,1.00095,1.052,1.3394
READM_30_HIP_KNEE_HRRP,1301.0,1.022134,0.15325,0.5982,0.9115,1.0108,1.1169,1.8256
READM_30_PN_HRRP,2748.0,1.004802,0.081388,0.761,0.9484,0.9988,1.055625,1.3801


##### Find a dataset you'd like to explore. This can be something you're familiar with or something new. Create a Jupyter notebook and then:

    Choose one variable and plot that variable four different ways.
    Choose two continuous variables, and plot them three different ways.
    Choose one continuous variable and one categorical variable, and plot them six different ways.
    Give the pros and cons of each plot you create. You can use variables from multiple datasets if you like.


In [12]:
df_readmin['Measure Name'].unique()

array(['READM_30_COPD_HRRP', 'READM_30_HF_HRRP', 'READM_30_PN_HRRP',
       'READM_30_AMI_HRRP', 'READM_30_HIP_KNEE_HRRP',
       'READM_30_CABG_HRRP'], dtype=object)

In [15]:
sns.set()
readmission = sns.load_dataset("df_readmin")

HTTPError: HTTP Error 404: Not Found

In [14]:

sns.relplot(x = 'Measure Name', y = 'Excess Readmission Ratio', size = 'size',
            data = readmission);

HTTPError: HTTP Error 404: Not Found

In [None]:
# Creating variables for each of the 'Measure Name' to graph
copd = df_readmin.loc[(df_readmin['Measure Name']== 'READM_30_COPD_HRRP')] 


In [None]:
plt.hist(x = copd, y = 'Number of Readmissions', color='red',  alpha=.5, label='COPD Readmissions')  
plt.xlabel('Number of Readmissions')
plt.legend(loc='upper right')
plt.title('Plot 1: COPD Histogram')
plt.show()

df_copd = df_mortality.apply(lambda row: row[df_mortality['Measure ID'].isin(['MORT_30_COPD'])])