# Megan's Data Compilation

###### The following information will explain how to create a subset from the original data sheet.

The original data sheet is county health data for 2014 and 2015. This includes all states, regions, counties, etc. I extracted the physical inactivity and access to exercise opportunities data from Alabama (AL) and Arkansas (AR).

The purpose of this is to see the relationship between physical inactivity and accessability to exercise opportunities. I chose two states relatively close together to get more data for the southern region.

## Getting Started

1. First you need to mount your Google Drive in the notebook to ensure accessibility to your data. Do this by running the following code:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


2. Next, you will need to import the packages you'll need to use with Python (Pandas and Numpy). Make sure to include them as pd and as np in the statement so that they're easier to use later on. Do this by running the following code:

In [None]:
import numpy as np
import pandas as pd

3. Now, we have to create a dataframe. You use the function df = pd.read_csv(). In the parentheses, the file path is there to access the data. Do this by running the code below:

In [None]:
df=pd.read_csv('gdrive/My Drive/ColabNotebooks/CountyHealthData_2014-2015.csv')

To see what a random row of this dataframe looks like we can use df.sample():

In [None]:
df.sample()

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
5119,TX,South,West South Central,Hockley County,48219,48219,Insuff Data,1/1/2014,7338.0,0.162,...,,0.31,0.147,10588.0,0.074,39.0,46010,0.516,,0.146


# Creating the subset

Now we can create a subset from the original datasheet and filter some things out that we don't need. For the information we are looking for, we can use the df.loc function to create a range. The range below includes the rows that have data from Alabama. The corresponding numbers go in the bracket.

We can also decide what columns to use. We are looking at the relationship between physical inactivity and access to exercise opportunities in 2014 and 2015 and between the two states. To see all of this information, we can put in "Physical inactivity," "Access to exercise opportunities," "Year," and "State" in the brackets.

Running the code will show the new subset:

In [None]:
df.loc[48:179,["Physical inactivity", "Access to exercise opportunities", "Year", "State"]]

Unnamed: 0,Physical inactivity,Access to exercise opportunities,Year,State
48,0.276,0.580,1/1/2014,AL
49,0.252,0.736,1/1/2015,AL
50,0.350,0.115,1/1/2014,AL
51,0.323,0.441,1/1/2015,AL
52,0.370,0.183,1/1/2014,AL
...,...,...,...,...
175,0.309,0.157,1/1/2015,AL
176,0.330,0.058,1/1/2014,AL
177,0.316,0.291,1/1/2015,AL
178,0.351,0.666,1/1/2014,AL


Now, you can do the same thing but in a different range to see Arkansas. You can do this by changing the numbers to 182 and 329 in the bracket.

Run the code to see the subset:

In [None]:
df.loc[182:329,["Physical inactivity", "Access to exercise opportunities", "Year", "State"]]

Unnamed: 0,Physical inactivity,Access to exercise opportunities,Year,State
182,0.374,0.473,1/1/2014,AR
183,0.333,0.554,1/1/2015,AR
184,0.309,0.326,1/1/2014,AR
185,0.298,0.786,1/1/2015,AR
186,0.266,0.640,1/1/2014,AR
...,...,...,...,...
325,0.280,0.549,1/1/2015,AR
326,0.309,0.230,1/1/2014,AR
327,0.338,0.230,1/1/2015,AR
328,0.287,0.537,1/1/2014,AR


# Exporting Subsets

First, define the two subsets. The Alabama subset will be called ALSet and the Arkansas subset will be called ARSet. Run the two codes below to do this:

In [None]:
ALSet = df.loc[48:179,["Physical inactivity", "Access to exercise opportunities", "Year", "State"]]

In [None]:
ARSet = df.loc[182:329,["Physical inactivity", "Access to exercise opportunities", "Year", "State"]]

Finally, we can export the subsets by using the function .to_csv after our now-defined subset names. If we add index=False to our statement, it won't include default index numbers. Do this by running the two codes below:

In [None]:
ALSet.to_csv("ALSet.csv", index=False)

In [None]:
ARSet.to_csv("ARSet.csv", index=False)

You're done!