Overview of the Project


You are tasked with using computational methods to create a dataset showing the correlation between adult smoking and adult obesity as well as poor mental health within the states of California, Virginia, North Carolina, and Louisiana. The purpose of this is to examine how public health varies in these states and to determine if they should change their regulation of smoking in regards to obesity and mental health. You will be taking this dataset from the csv file "CountyHealthData_2014-2015.csv". This file can be found on canvas. 

Setting Up

Like spreadsheets in Microsoft Excel, Pandas allows you to store our data in tabular, multi-dimensional objects (dataframes) with familiar features like rows, columns, and headers. This is useful because it makes management, manipulation, and cleaning of large datasets much easier than would be the case using Python's built-in data structures such as lists. Pandas also provides a wide range of useful tools for working with data once it has been stored and structured.

Begin by importing the pandas package using the following command:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("CountyHealthData_2014-2015.csv")

A good first step in exploring your dataframe is to examine some of its basic attributes. Attributes contain **values** that provide  helpful information about the dataframe, that guide your interaction with the dataframe. In pandas, you will access attributes with the following syntax:

`<DataFrame name>.<attribute name>`

You can use the `.shape` attribute to determine how many rows and columns (in that order) are available. The `.size` attribute gives you the number of cells in the dataframe (rows * columns).

Your next step will be to filter the data from the desired states. The states whose data you will choose will be California, Virginia, North Carolina, and Louisiana. (The order in which you filter them isn't relevant.

To do this, you will type the command df["state"][the number of the rows you want]

To start, filter the rows for California by typing in the following command:

In [4]:
df["State"][361:458]

361    CA
362    CA
363    CA
364    CA
365    CA
       ..
453    CA
454    CA
455    CA
456    CA
457    CA
Name: State, Length: 97, dtype: object

You will also need to select multiple columns as well.
The columns you will select are "Poor mental health days", "Adult obesity", and "Adult Smoking". 
To do this you will type the same command as the previous step but you will add the names of the columns you want within two brackets on each side separated by commas.
Once again start with the data for California.



In [11]:
df[["State","Adult smoking","Adult obesity","Poor mental health days"]][361:458]

Unnamed: 0,State,Adult smoking,Adult obesity,Poor mental health days
361,CA,0.106,0.205,3.2
362,CA,0.212,0.230,4.2
363,CA,0.212,0.248,4.2
364,CA,0.192,0.243,4.5
365,CA,0.192,0.244,4.5
...,...,...,...,...
453,CA,0.135,0.265,3.3
454,CA,0.140,0.225,3.4
455,CA,0.140,0.224,3.4
456,CA,0.171,0.308,3.9


Now repeat this for the other states simply by replacing the California row numbers with the rows attributed to the other states.

In [14]:
df[["State", "Adult smoking", "Adult obesity", "Poor mental health days"]][2186:2313]

Unnamed: 0,State,Adult smoking,Adult obesity,Poor mental health days
2186,LA,0.236,0.323,3.4
2187,LA,0.213,0.376,4.8
2188,LA,0.213,0.347,4.8
2189,LA,0.194,0.345,3.2
2190,LA,0.194,0.332,3.2
...,...,...,...,...
2308,LA,0.295,0.379,3.4
2309,LA,,0.349,
2310,LA,,0.366,
2311,LA,0.207,0.342,2.9


In [15]:
df[["State", "Adult smoking","Adult obesity","Poor mental health days"]][3244:3443]

Unnamed: 0,State,Adult smoking,Adult obesity,Poor mental health days
3244,NC,0.238,0.332,3.6
3245,NC,0.260,0.272,4.6
3246,NC,0.260,0.283,4.6
3247,NC,0.271,0.247,4.4
3248,NC,0.271,0.235,4.4
...,...,...,...,...
3438,NC,0.121,0.373,3.1
3439,NC,0.255,0.297,4.6
3440,NC,0.255,0.301,4.6
3441,NC,0.214,0.287,4.1


In [16]:
df[["State","Adult smoking","Adult obesity","Poor mental health days"]][5438:5703]

Unnamed: 0,State,Adult smoking,Adult obesity,Poor mental health days
5438,VA,0.212,0.350,3.2
5439,VA,0.124,0.270,2.3
5440,VA,0.124,0.256,2.3
5441,VA,0.092,0.195,2.1
5442,VA,0.092,0.204,2.1
...,...,...,...,...
5698,VA,0.330,0.319,6.6
5699,VA,0.242,0.288,3.6
5700,VA,0.242,0.301,3.6
5701,VA,0.075,0.282,2.5


Your next step is going to be to combine these four data sets.
To start this process, you'll need to define each data set.
This can be done by putting any name, for the example I picked the state's abbreviation, and then an = followed by the code you typed to yield the data set. 

In [19]:
ca=df[["State","Adult smoking","Adult obesity","Poor mental health days"]][361:458]

In [20]:
la=df[["State", "Adult smoking", "Adult obesity", "Poor mental health days"]][2186:2313]

In [21]:
nc=df[["State", "Adult smoking","Adult obesity","Poor mental health days"]][3244:3443]

In [22]:
va=df[["State","Adult smoking","Adult obesity","Poor mental health days"]][5438:5703]

Now that you've defined each data set you will use the pd.concat function to combine them.
Simply type the code pd.concat([the names of your data sets separated by commas])
(Remember to make things easiest name the data sets the state abbreviations, so the command should look like pd.concat([ca,la,nc,va])

In [23]:
pd.concat([ca,la,nc,va])

Unnamed: 0,State,Adult smoking,Adult obesity,Poor mental health days
361,CA,0.106,0.205,3.2
362,CA,0.212,0.230,4.2
363,CA,0.212,0.248,4.2
364,CA,0.192,0.243,4.5
365,CA,0.192,0.244,4.5
...,...,...,...,...
5698,VA,0.330,0.319,6.6
5699,VA,0.242,0.288,3.6
5700,VA,0.242,0.301,3.6
5701,VA,0.075,0.282,2.5


The last step is going to be to export this data as a csv file. 
Your first step in doing this is going to be naming this dataset; in this it will be named "healthdata", but any name will do.

In [24]:
healthdata=pd.concat([ca,la,nc,va])

In order to turn this data into a csv file, you will use the command .to_csv, putting the name of your file at the beginning. 

In [35]:
healthdata.to_csv

<bound method NDFrame.to_csv of      State  Adult smoking  Adult obesity  Poor mental health days
361     CA          0.106          0.205                      3.2
362     CA          0.212          0.230                      4.2
363     CA          0.212          0.248                      4.2
364     CA          0.192          0.243                      4.5
365     CA          0.192          0.244                      4.5
...    ...            ...            ...                      ...
5698    VA          0.330          0.319                      6.6
5699    VA          0.242          0.288                      3.6
5700    VA          0.242          0.301                      3.6
5701    VA          0.075          0.282                      2.5
5702    VA          0.075          0.283                      2.5

[688 rows x 4 columns]>

Then to move the csv file to your folder, add in parentheses to the end ("healthdata.csv")

In [37]:
healthdata.to_csv("healthdata.csv") 

The csv file "healthdata" should now be in your folder. Congratulations, you are finished!