**Cleaning of Readmissions Data**
***
The goal of this cleaning is to reduce the data from ratings by facility to ratings by state.<br>
The rating by state will be found by finding the average rating for all facilities within a state.<br>
Will capture facilities by counting them and adding a coulumn to the final database<br>
Each facility can have up to six excessive readmission ratings based on individual measures.

In [12]:
# Importing Necessary Tools
import pandas as pd
import numpy as np

#Pull File Into Database and Set Column Names
col = ['hospital_name', 'provider_number', 'state', 'measure', 'discharges','footnote',
           'readmission_ratio','predicted_rate','expected_rate','readmissions','starte_date','end_Date']
df = pd.read_csv('Readmissions.csv')
df.columns=col

**Initial removal of the following columns:**<br>
-  Measure: The observations are to be grouped together on the state level to get the overall state readmission ratio.  Subsequently, rendering this column unneccessary to this analysis.
-  Footnote:  Footnotes are associated with a lack of information.  Most of which will be removed in the cleaning process.
-  Start_Date:  Does not provide any useful information for this analysis.  Also the same for all rows.
-  End_Date: Does not provide any useful information fro this analysis.  Also the same for all rows.

In [13]:
# Explore Data Pre-Clean
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19830 entries, 0 to 19829
Data columns (total 12 columns):
hospital_name        19830 non-null object
provider_number      19830 non-null int64
state                19830 non-null object
measure              19830 non-null object
discharges           19830 non-null object
footnote             5435 non-null float64
readmission_ratio    19830 non-null object
predicted_rate       19830 non-null object
expected_rate        19830 non-null object
readmissions         19830 non-null object
starte_date          19830 non-null object
end_Date             19830 non-null object
dtypes: float64(1), int64(1), object(10)
memory usage: 1.8+ MB


In [14]:
# Get Hospital Count for Unique Provider Numbers and Readmission Counts By State
hospital_count = df.groupby('state').provider_number.nunique()



In [15]:
# Coerce Discharges, Readmission Ratios, Predicted Rates, Expected Rates, and Readmissions to get NaNs
tonumeric=['discharges','readmission_ratio','predicted_rate','expected_rate','readmissions']
dfa = df[tonumeric].apply(pd.to_numeric, errors='coerce')
#Setting up additional columns to concatinate
dfb = df[['hospital_name','provider_number','state']]


In [16]:
# Concatenating Data Back Together and Confirming DataFrame Integrity
df2= pd.concat([dfb,dfa], axis=1)
df2.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19830 entries, 0 to 19829
Data columns (total 8 columns):
hospital_name        19830 non-null object
provider_number      19830 non-null int64
state                19830 non-null object
discharges           11758 non-null float64
readmission_ratio    14411 non-null float64
predicted_rate       14411 non-null float64
expected_rate        14411 non-null float64
readmissions         11638 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 1.2+ MB


In [17]:
# Build Dictionary of States and Assign them to 0
dictionary ={}
for n in df2['state']:
    if n in dictionary.keys():
        continue
    else:
        dictionary[n]=0
# Count the Number of Excessive Readmissions Per State
for x in range(len(df)): 
    if df2.iloc[x][4]>1:
        dictionary[df2.iloc[x][2]] +=1


In [18]:
# Convert Dictionary to DataFrame for Merger
data = pd.DataFrame(list(dictionary.items()))
data.columns=['state','excessive_count']


In [19]:
# Initializing and Creating New Dataframe To Group By State
cleaned= pd.DataFrame(hospital_count)
cleaned.columns=['hospital_count']
cleaned['readmission_ratio'] = df2.groupby('state').readmission_ratio.mean()
cleaned['discharges']= df2.groupby('state').discharges.sum()
cleaned['predicted_rate']= df2.groupby('state').predicted_rate.sum()
cleaned['expected_rate'] = df2.groupby('state').expected_rate.sum()
cleaned['readmissions'] = df2.groupby('state').readmissions.sum()

In [20]:
# Reset Index to Get State Column 
cleaned = cleaned.reset_index()


In [21]:
# Merge Excessive Readmission Count with Cleaned DataFrame
final = pd.merge(cleaned,data, on='state')

In [23]:
# Save and Print Final DataFrame Heading
final.to_csv('Readmissions_Cleaned.csv')
final.head()

Unnamed: 0,state,hospital_count,readmission_ratio,discharges,predicted_rate,expected_rate,readmissions,excessive_count
0,AK,8,0.969563,5019.0,530.2,548.7,606.0,11
1,AL,85,1.017475,95303.0,5351.5,5308.2,15305.0,188
2,AR,45,1.032275,61703.0,2973.3,2879.7,9965.0,127
3,AZ,63,0.988116,76353.0,3930.2,3990.3,10290.0,104
4,CA,297,1.000689,303151.0,19823.2,19733.7,49252.0,580
