# **Creating a Dataframe Tutorial**

This notebook will walk you through how to create and analyze a subset of data using python. This specific tutorial and accompanying dataset pertains to the County Health Data collected from 2014-2015. Our objective is to determine whether there is a greater prevalence of adult obesity in some regions of the United States than others, and if so, hypothesize why this may be.

# **Importing and Downloading Software**

**Pandas**

To use python, make sure you have pulled up Google CoLab and have created a new notebook. Packages are not a part of basic python but can provide additional functions that aid in our analysis. Pandas allows us to visualize and store data into columns and rows with headers. It also helps us to manipulate data so that we can see and analyze the subset we need. Pandas is imported by adding the cell: `import pandas as pd` which is shown below:




In [None]:
import pandas as pd
df = pd.read_csv("CountyHealthData_2014-2015.csv")

Given that we're using pandas to manipulate data, we need to specify what data we want it to use, so that we can make the proper adjustments. To do this, make sure you have the dataset downloaded as a csv file. It can then be imported under the files section of CoLab by selecting the "content" folder and uploading a file from your computer. Then we can use the function below to have clarify what data pandas should use and what it is called, which in this case is "df" to represent "dataframe." The cell `df = pd.read.csv("CountyHealthData_2014-2015.csv")`tells pandas to read the given dataset and file it into a dataframe that we can work with.






**Numpy**

Numpy is another package that can help with mathematic computations. Below is the cell to `import numpy`.

In [None]:
import numpy as np

# **Data Attributes**

It is important that we explore and visualize the attributes of our data to answer our research question. We can use the format of the code below to explore different attributes.

```
DataFrame name.attribute name
```



For example, it might be useful to know how many rows and columns are in the dataframe, so we would use the cell below with the attribute `.shape`.

In [None]:
df.shape

(6109, 64)

The same can be done with the `.size` of the dataframe, as is seen below.

In [None]:
df.size

390976

When formulating our research question, it is helpful to know what sorts of categorical variables are available within the dataframe so we can narrow down what we want to ask ourselves and analyze. We can do this if we know the names of all of the columns. The cell to enter to achieve this is seen below.

In [None]:
df.columns

Index(['State', 'Region', 'Division', 'County', 'FIPS', 'GEOID', 'SMS Region',
       'Year', 'Premature death', 'Poor or fair health',
       'Poor physical health days', 'Poor mental health days',
       'Low birthweight', 'Adult smoking', 'Adult obesity',
       'Food environment index', 'Physical inactivity',
       'Access to exercise opportunities', 'Excessive drinking',
       'Alcohol-impaired driving deaths', 'Sexually transmitted infections',
       'Teen births', 'Uninsured', 'Primary care physicians', 'Dentists',
       'Mental health providers', 'Preventable hospital stays',
       'Diabetic screening', 'Mammography screening', 'High school graduation',
       'Some college', 'Unemployment', 'Children in poverty',
       'Income inequality', 'Children in single-parent households',
       'Social associations', 'Violent crime', 'Injury deaths',
       'Air pollution - particulate matter', 'Drinking water violations',
       'Severe housing problems', 'Driving alone to work'

# Filtering the Dataframe

We saw when we explored the `.shape` attribute that the dataframe is very large, so we need to filter through it and create a subset of data that we can work with.

**Indexing**

Our research question pertains to regions of the US and Adult obesity, so we need to filter through the dataframe to just look at those variables as our only columns.

We can use the code `df["columns"]["rows"]`to do this. Make sure to use square brackets for these cells.

For our dataframe, we will run this cell with Region and Adult obesity. Below is the line of code used to look as just those variables.

In [None]:
df[["Region", "Adult obesity"]]

Unnamed: 0,Region,Adult obesity
0,West,0.300
1,West,0.329
2,West,0.257
3,West,0.268
4,West,0.315
...,...,...
6104,West,0.293
6105,West,0.241
6106,West,0.242
6107,West,0.313


For our research question, we want to take the average of the adult obesity values for each of the different regions, so we want to use all of the 6109 rows.First, we need to know how many regions are accounted for in the dataframe. To do this, we run the cell `df.Region.value_counts()`. The line of code and the results are seen below.

In [None]:
df.Region.value_counts()

South        2803
Midwest      2038
West          834
Northeast     434
Name: Region, dtype: int64

## **Creating Our Subset**

We want to create a final dataframe of data that includes the average Adult obesity for each of the four regions: South, Midwest, West, and Northeast. To do this we need to specify that we are creating a new frame as we take the averages. The code fdf = df`[['Region', 'Adult obesity']]` will do this. "fdf" is a new variable representing our final dataframe. To take the means by each group we add the code `.groupby('Region').mean()`. The code below will do these function together.

In [None]:
fdf = df[['Region', 'Adult obesity']].groupby('Region').mean()

Great! Now we have created our subset of data. To visulaize and analyze this for our research we use the cell `fdf.head()` as is seen below.

In [None]:
fdf.head()

Unnamed: 0_level_0,Adult obesity
Region,Unnamed: 1_level_1
Midwest,0.311058
Northeast,0.277198
South,0.322495
West,0.259058


# Findings

Now that we can view the different averages for Adult obesity by Region, we can compare them. All of the values are relatively similar, though Adult obesity is the most prevalent in the South and the least prevalent in the West according to this dataset.

Some reasons for this might be the difference in culture between each of the regions. Perhaps access to healthier food is at a greater availability in the West. It's also possible that healthy lifestyles are expensive and can be afforded by more people in the West than in other regions. More research would need to be done on the variables that influence this data.

# Exporting the Dataframe

Next we need to export our data, and to do so we must rename it and create is as a csv. file. To export is as a csv. file we use the code `fdf.to_csv`and to rename it we add to that code `('Average_of_AO_by_Region.csv')`. The line of code put together is seen below.

In [None]:
fdf.to_csv('Average_of_AO_by_Region.csv')

Once this is completed, you've filtered and created a new dataframe from the original! It can now be found under the "content" folder in Google CoLab and can be downloaded to your computer.