Adding 2 more data sets changed my research question to: How does exposure to extreme heat events affect hospital admissions for cardiovascular disease in the United States, while controlling for air quality and access to green spaces?
Hypothesis: Exposure to extreme heat events, as measured by high temperature and heat index values, will increase the risk of hospital admissions for cardiovascular disease, with potentially greater effects in certain regions and for certain demographic groups. This relationship will be influenced by air pollution and access to green spaces, with higher levels of air pollution and lack of access to green spaces exacerbating the health effects of extreme heat events.
Datasets:
1.	Extreme heat event data from the National Oceanic and Atmospheric Administration (NOAA) National Centers for Environmental Information, downloaded from https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00761.
2.	Cardiovascular disease hospitalization data from the Centers for Disease Control and Prevention (CDC), downloaded from https://www.cdc.gov/dhdsp/data_statistics/index.htm.
3.	Demographic and socioeconomic data from the American Community Survey (ACS), downloaded from https://www.census.gov/programs-surveys/acs.

Added data sets:

4.	Air quality data from the Environmental Protection Agency (EPA) Air Quality System (AQS), downloaded from https://www.epa.gov/aqs.

5.	Green spaces data, downloaded from https://www.mphonline.org/green-states/

 Variables: 
• Extreme heat events: daily maximum temperature for each State in the United States. 
• Cardiovascular disease hospitalizations: number of hospitalizations for cardiovascular disease for each state (2009-2011). 
• Demographic and socioeconomic: population, median household income, and education level for each state. 
• Air quality: indicator is air toxics and the measure is the concentration of Benzene measured in micrograms per cubic meter (ug/m3)
• Green spaces: Rank of each state as based on the following four measures (weights): green energy prevalence (25%), open spaces and natural beauty (25%), waste diversion and recycling (25%) and social justice and access to clean outdoors (25%). 
I will load the initially cleaned merged data set from PS2 as well as the 2 additional data sets

In [1]:
import pandas as pd

In [2]:
df1 = pd.read_excel("C:/Users/User/Desktop/merged_data.xlsx")

In [3]:
print(df1)

                   State  Hospitalizations  Max_temp median_income  \
0                Alabama              62.0     75.01        40,474   
1                Arizona              48.8     72.14        46,789   
2               Arkansas              61.5     73.84        38,307   
3             California              42.8     68.33        57,708   
4               Colorado              35.8     56.98        54,046   
5            Connecticut              52.0     57.92        64,032   
6               Delaware              52.5     63.07        55,847   
7   District of Columbia              48.8     65.18        60,903   
8                Florida              57.5     80.99        44,409   
9                Georgia              55.7     75.45        46,430   
10                 Idaho              31.8     52.01        43,490   
11              Illinois              58.7     62.82        52,972   
12               Indiana              58.1     62.46        44,613   
13                  

In [6]:
df2 = pd.read_excel("C:/Users/User/Desktop\Greenspaces.xlsx")

In [7]:
print(df2)

             State  Overall Rank  Green Energy Rank  Open Spaces and Beauty  \
0          Alabama            31                 27                      29   
1           Alaska            49                 50                      19   
2          Arizona            38                 30                      13   
3         Arkansas            30                 32                      35   
4       California             3                 25                       1   
5         Colorado            35                 42                      15   
6      Connecticut            22                 31                      16   
7         Delaware             9                  1                      21   
8          Florida             6                 18                       6   
9          Georgia            24                 19                      37   
10          Hawaii             2                  1                       2   
11           Idaho            17                  1 

In [9]:
df3 = pd.read_excel("C:/Users/User/Desktop/Airquality.xlsx")

In [10]:
print(df3)

             State  Year  Air Quality           Pollutant
0          Alabama  2011         0.49  Pollutant: Benzene
1          Arizona  2011         0.43  Pollutant: Benzene
2         Arkansas  2011         0.46  Pollutant: Benzene
3       California  2011         0.64  Pollutant: Benzene
4         Colorado  2011         0.87  Pollutant: Benzene
5      Connecticut  2011         0.86  Pollutant: Benzene
6         Delaware  2011         0.69  Pollutant: Benzene
7          Florida  2011         0.53  Pollutant: Benzene
8          Georgia  2011         0.55  Pollutant: Benzene
9           Hawaii  2011         0.42  Pollutant: Benzene
10           Idaho  2011         0.55  Pollutant: Benzene
11        Illinois  2011         0.66  Pollutant: Benzene
12         Indiana  2011         0.66  Pollutant: Benzene
13            Iowa  2011         0.55  Pollutant: Benzene
14          Kansas  2011         0.56  Pollutant: Benzene
15        Kentucky  2011         0.56  Pollutant: Benzene
16       Louis

In the green spaces worksheet we will remove the following variables 1)Green Energy Rank 2) Open Spaces and Beauty 3)Waste Diversion and Recycling 4)Racial Justice and Access to Clean Outdoors).
We will also change the variable Overall Rank to Green_spaces

In [11]:
df2.drop("Green Energy Rank", axis=1, inplace=True)

In [13]:
df2.drop("Open Spaces and Beauty", axis=1, inplace=True)

In [14]:
df2.drop("Waste Diversion and Recycling", axis=1, inplace=True)

In [16]:
df2.drop("Racial Justice And Access to Clean Outdoors", axis=1, inplace=True)

In [17]:
df2=df2.rename(columns={'Overall Rank': 'Green_spaces'})

In [18]:
print(df2)

             State  Green_spaces
0          Alabama            31
1           Alaska            49
2          Arizona            38
3         Arkansas            30
4       California             3
5         Colorado            35
6      Connecticut            22
7         Delaware             9
8          Florida             6
9          Georgia            24
10          Hawaii             2
11           Idaho            17
12        Illinois            45
13         Indiana            46
14            Iowa            27
15          Kansas            44
16        Kentucky            28
17       Louisiana            50
18           Maine             5
19        Maryland            14
20   Massachusetts            13
21        Michigan            12
22       Minnesota            11
23     Mississippi            42
24        Missouri            23
25         Montana            40
26        Nebraska            32
27          Nevada            16
28   New Hampshire             7
29      Ne

In the Airquality worksheet we will remove the following variables 1) Year 2) Pollutant.
We will also change the variable Air Quality to Air_quality

In [19]:
df3.drop("Year", axis=1, inplace=True)

In [20]:
df3.drop("Pollutant", axis=1, inplace=True)

In [21]:
df3=df3.rename(columns={'Air Quality': 'Air_quality'})

In [22]:
print(df3)

             State  Air_quality
0          Alabama         0.49
1          Arizona         0.43
2         Arkansas         0.46
3       California         0.64
4         Colorado         0.87
5      Connecticut         0.86
6         Delaware         0.69
7          Florida         0.53
8          Georgia         0.55
9           Hawaii         0.42
10           Idaho         0.55
11        Illinois         0.66
12         Indiana         0.66
13            Iowa         0.55
14          Kansas         0.56
15        Kentucky         0.56
16       Louisiana         0.52
17           Maine         0.51
18        Maryland         0.80
19   Massachusetts         0.81
20        Michigan         0.67
21       Minnesota         0.72
22     Mississippi         0.42
23        Missouri         0.54
24         Montana         0.39
25        Nebraska         0.55
26          Nevada         0.65
27   New Hampshire         0.62
28      New Jersey         0.91
29      New Mexico         0.45
30      

Now that my data is clean, I will merge the three data sets. 
State is a common variable among all the three data sets, I will merge the data set from problem set 1(df1)
and Greenspaces (df2) on "State"

Then i will used the merged_df above to merge Air Quality data set (df3)on"State"


In [23]:
merged_df = pd.merge(df1, df2, on='State')

In [24]:
merged_df = pd.merge(merged_df, df3, on='State')

In [25]:
print(merged_df)

             State  Hospitalizations  Max_temp median_income  \
0          Alabama              62.0     75.01        40,474   
1          Arizona              48.8     72.14        46,789   
2         Arkansas              61.5     73.84        38,307   
3       California              42.8     68.33        57,708   
4         Colorado              35.8     56.98        54,046   
5      Connecticut              52.0     57.92        64,032   
6         Delaware              52.5     63.07        55,847   
7          Florida              57.5     80.99        44,409   
8          Georgia              55.7     75.45        46,430   
9            Idaho              31.8     52.01        43,490   
10        Illinois              58.7     62.82        52,972   
11         Indiana              58.1     62.46        44,613   
12            Iowa              43.0     59.02        47,961   
13          Kansas              47.2     68.80        48,257   
14        Kentucky              70.6    

In [26]:
merged_df.to_excel("C:/Users/User/Desktop/newmerged_data.xlsx", index=False)