# Data Provenance

This data provides insight into different Veterans Health Administration (VHA) Hospitals and includes their hospital name, address, city, state, ZIP code, county name, phone number, condition, as well as data analyzing the timeliness and effectiveness of each hospital, measured by different aspects such as wait times and service provided. Therefore, I selected this data because I find it to be relevant to my term project, although it might not allow me to fulfill my initial vision to its fullest extent, I can still create an analysis on which hospitals would be optimal for different patients depending on their scores and convenience in location.

In [1]:
# import dependencies
from dash import Dash, dcc, html, Input, Output, callback
import pandas as pd
import plotly.express as px

In [2]:
df = pd.read_csv("data.csv")
df.head()
                 

Unnamed: 0,index,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Condition,Measure ID,Measure Name,Score,Sample,Footnote,Measure Start Date,Measure End Date,Location
0,0,39012F,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PITTSBURGH,PA,15240,ALLEGHENY,4126886100,Emergency Department (ED) Measures,ED-1b,Median time from ED arrival to ED departure fo...,286,256,2 - Data submitted were based on a sample of c...,4/1/2016,3/31/2017,"UNIVERSITY DRIVE\nPITTSBURGH, PA\n"
1,1,39012F,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PITTSBURGH,PA,15240,ALLEGHENY,4126886100,Emergency Department (ED) Measures,ED-2b,Admit Decision Time to ED Departure Time for A...,118,251,2 - Data submitted were based on a sample of c...,4/1/2016,3/31/2017,"UNIVERSITY DRIVE\nPITTSBURGH, PA\n"
2,2,39012F,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PITTSBURGH,PA,15240,ALLEGHENY,4126886100,Preventative Care Measures,IMM-2,Patients assessed and given influenza vaccination,71,596,2 - Data submitted were based on a sample of c...,10/1/2016,3/31/2017,"UNIVERSITY DRIVE\nPITTSBURGH, PA\n"
3,3,39012F,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PITTSBURGH,PA,15240,ALLEGHENY,4126886100,Emergency Department (ED) Measures,OP-18b,Median time from ED arrival to ED departure fo...,Not Available,Not Available,7 - No cases met the criteria for this measure.,4/1/2016,3/31/2017,"UNIVERSITY DRIVE\nPITTSBURGH, PA\n"
4,4,39012F,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PITTSBURGH,PA,15240,ALLEGHENY,4126886100,Emergency Department (ED) Measures,OP-20,Door to diagnostic eval,Not Available,Not Available,7 - No cases met the criteria for this measure.,4/1/2016,3/31/2017,"UNIVERSITY DRIVE\nPITTSBURGH, PA\n"


In [3]:
dropped_cols = ["Sample", "Location", "City", "Provider ID", "Measure ID", "Phone Number", "Footnote", "Measure Start Date", "Measure End Date"]
df = df.drop(dropped_cols, axis=1)
df.head()

Unnamed: 0,index,Hospital Name,Address,State,ZIP Code,County Name,Condition,Measure Name,Score
0,0,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Median time from ED arrival to ED departure fo...,286
1,1,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Admit Decision Time to ED Departure Time for A...,118
2,2,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Preventative Care Measures,Patients assessed and given influenza vaccination,71
3,3,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Median time from ED arrival to ED departure fo...,Not Available
4,4,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Door to diagnostic eval,Not Available


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1651 entries, 0 to 1650
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   index          1651 non-null   int64 
 1   Hospital Name  1651 non-null   object
 2   Address        1651 non-null   object
 3   State          1651 non-null   object
 4   ZIP Code       1651 non-null   int64 
 5   County Name    1651 non-null   object
 6   Condition      1651 non-null   object
 7   Measure Name   1651 non-null   object
 8   Score          1651 non-null   object
dtypes: int64(2), object(7)
memory usage: 116.2+ KB


In [5]:
df.isna().sum()

index            0
Hospital Name    0
Address          0
State            0
ZIP Code         0
County Name      0
Condition        0
Measure Name     0
Score            0
dtype: int64

0 Na's in the dataset, but it is clear that there are "Not Available"'s in the "Score" column, so we can convert these to n/a's and count how many there are

In [6]:
df["Score"] = df["Score"].replace("Not Available", pd.NA)
df.head()

Unnamed: 0,index,Hospital Name,Address,State,ZIP Code,County Name,Condition,Measure Name,Score
0,0,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Median time from ED arrival to ED departure fo...,286.0
1,1,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Admit Decision Time to ED Departure Time for A...,118.0
2,2,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Preventative Care Measures,Patients assessed and given influenza vaccination,71.0
3,3,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Median time from ED arrival to ED departure fo...,
4,4,VA PITTSBURGH HEALTHCARE SYSTEM - UNIV DR,UNIVERSITY DRIVE,PA,15240,ALLEGHENY,Emergency Department (ED) Measures,Door to diagnostic eval,


In [7]:
df.isna().sum()

index               0
Hospital Name       0
Address             0
State               0
ZIP Code            0
County Name         0
Condition           0
Measure Name        0
Score            1277
dtype: int64

1277 n/a's in the score section, we can drop these because they would be unhelpful for our purposes.

In [8]:
df = df.dropna()


We should convert the score and zipcode datatypes to int's and objects respectively. This is because the score is a continuous variable that represents the score the hospital receives on their efficiency. The zipcode doesn't represent a continuous variable for our purposes so it can be converted to an object.

In [9]:
df["Score"] = df["Score"].astype(int)    
df["ZIP Code"] = df["ZIP Code"].astype(object)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 374 entries, 0 to 1647
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   index          374 non-null    int64 
 1   Hospital Name  374 non-null    object
 2   Address        374 non-null    object
 3   State          374 non-null    object
 4   ZIP Code       374 non-null    object
 5   County Name    374 non-null    object
 6   Condition      374 non-null    object
 7   Measure Name   374 non-null    object
 8   Score          374 non-null    int32 
dtypes: int32(1), int64(1), object(7)
memory usage: 27.8+ KB


In [10]:
df.describe()

Unnamed: 0,index,Score
count,374.0,374.0
mean,860.751337,171.76738
std,505.526784,117.731948
min,0.0,0.0
25%,419.5,84.0
50%,887.5,128.0
75%,1332.75,256.0
max,1647.0,685.0


In [11]:
len(df["Measure Name"].unique())
#len(df["Condition"].unique())
#len(df["County Name"].unique())

9

There were originally 1651 observations in this dataset. However, the score column, which is the most important column in this case, had 1277 rows as "Not Available", which I found to be equivalent to n/a. So, I converted these to n/a and dropped these rows. This made me end up with 374 total observations. There are 9 unique Measure Names, 4 unique conditions, and 117 county names. Other than the n/a's in the score column, there was no missing data. The main continuous variable is "score" and it can be seen from the above cell that the standard deviation is ~118 and that there are definitely outliers on the upper end of the distribution.I am deciding to keep these outliers because they show the stark difference in hospital scores on the higher end. If they were outliers on the lower end, I would have removed them because they wouldn't be helpful for my purposes of choosing high scoring hospitals nearby.

Data Dictionary
- Hospital Name (String): Name of the VHA hospital. 
- Address (String): Street address of the VHA hospital. 
- State (String): State where the VHA hospital is located. 
- ZIP Code (String): ZIP code of the VHA hospital. 
- County Name (String): County where the VHA hospital is located. 
- Condition (String): Condition of the patient being admitted. 
- Measure Name (String): Measure used to measure the condition. 
- Score (Integer): Score achieved by the VHA hospital on the measure (higher = better). 

UI Components
- Typography to format the name's of the different hospital's by state in different fonts to characterize them and group them together
- Color scheme's for the hospital's within varying score ranges (light= bad score, darker = better score)
- Icon's for each condition
- Button/dropdown for selecting state or county name when creating data visualizations
- Grid output for hospital's and their address info 

Data Visualizations
- Bar graph of average hospital scores by state to compare state hospital performance
- Side by side box plots of scores by different zipcode's (selected by multi select dropdown menu)
- Bar graph of hospital scores in a certain zipcode, when a zipcode is selected (selected by a dropdown menu)
- Histogram of hospital scores for a selected state
- List of hospital names and addresses on a grid by selected state from dropdown menu