# Crunching the Numbers

## A guide to Statistical Research Findings 

An informational guide of looking through the County Health Dataset in Python. 
### Overview of This Python Notebook Guide 
*The tutorial steps will consist of*
1. Importing the pandas package 
2. Using functions to filter through the *County Health Data* dataset. 
4. creating a subset and exporting as a new .csv file

*The informational steps will consist of*

1. Filtering through our data to find patterns, trends, and statistical research. 
2. Using our data to find events that correlate with each other, and use this data to harvest changing regulations, precautions, etc.  

## Python Basics

The following cells steps help illustrate some of the basic principles of coding in Python.

First, we will go over how to create a directory for your python notebooks on your computer, and how to launch the Jupyter Lab coding environment.

### Creating Your Python Notebook

To create your python notebooks, you will first need to navigate to the appropriate directory on your computer.

#### Getting Started

To get started, you will need to create the directory, and launch the Jupyter Lab environment.

Notice that the file is a `.pynb` file — a python notebook. This is the file type we will use to write our code and document our methods during the data compiliation process.

Make sure to save this file onto your computer in a safe place. Save this file in a safe place so the file is easily accessible to you throughout working with your data repository. 

2. Launch Jupyter Lab to create a new `.ipynb` file that will be saved to your directory. 

On a Mac, use the terminal to navigate to the directory you've created and launch Jupyter Lab.

> Refering to my file path above, I will now use the `cd` commands to navigate into my directory: 

    `cd Desktop`
    'cd UNC-Writing-105`
    `cd Python_Lessons`
    
Along the way, I can check my whereabouts by using the `ls` command to list the contains of my current directory location.

Once in the appropriate directory, I can use the command `jupyter lab` to launch the environment in my browser.

On a PC, use the Anaconda Navigator to launch jupyter.

3. Create a new notebook file, using the graphical user interface.

Now that you have created a python notebook, you must locate our data repository set, County Health Data. 

### Where can this Data set be found? 

The dataset can be found in the README file of the GitHub repository. 

1. Download your downloaded data repository set into excel. The dataset should download as a csv file if you press on it. 
3. After this step, you are all set to begin setting up your python notebook. 

## Step 1 : Importing the Pandas Package

You'll begin by importing the packages in Jupyter Lab. Before we begin coding our dataset, we must download the pandas package into your notebook to help us with data collection. 

Notice that we load pandas with the usual `import pandas` and an extra `as pd` statement. This allows us to call functions from `pandas` with `pd.<function>` instead of `pandas.<function>` for convenience. `as pd` is **not** necessary to load the package.

Note, we also imported the `numpy` package, which is going to help pandas do some of its math.

In [1]:
import numpy as np
import pandas as pd

We'll also need to create our dataframe object again, by using pandas to read in our dataset. 

Our data set is loaded into a .csv file. 

### Where can I find the dataset? 
The data set is loaded into the README of the README file! The dataset is loaded under the data tab, under *"County Health Data"*. 


Now that you have the data set loaded into your computer as a csv file, now we must load our dataset into our Jupyter Lab notebook, to be able to code and filter through our data. 

We will use the `pd.read_csv` function, which reads the tabular data from a Comma Separated Values (csv) file into a dataframe object that we'll define as `df`.

To create our dataframe object we'll define our object `df` by executing the `pd.read_csv()`function on our data file by inserting the relative file path into the paranthesis.

To upload your CSV file into python notebook, these are the steps, type in

`df=pd.read_csv("Name of your file)`

There is an example below. Make sure to write these instructions in Code. If your notebook is in markdown as of now, or raw, go to the top of the screen in your notebook where there is a dropdown and select Code. This is the setting you will stay on while looking for data. 



In [3]:
df=pd.read_csv('CountyHealthData_2014-2015 (2).csv')

### This tutorial aims to look into factors that affect North Carolina's number of teen births each year. This tutorial will show the instructions to finding the information.

After your CSV file has been uploaded into your Juptyer notebook, we are going to start finding specific trends in our data. 

This data repository set has information from all 50 states, but we are only interested in data from North Carolina. Let's reduce our data down to North Carolina by selecting only the columns we are interested in for information. 

## Step 2: Filtering our Data

To use the "Series" feature of our dataframes to isolate single columns from our tabular dataset, using either dot notation `df.Region` or bracket notation `df["Region"]`.

We can also filter our dataset by using logical conditions (based on true or false), these can be added using nested square brackets. 

Note the example below.

- The inner statement, `df["State"]=="NC"` looks for a column name and checks if it equals `"NC"`

Since we are looking to seperate the data only be the state of North Carolina, we will use the inner statement 

`df= [df["State"]== "NC"]`

This statement will take away any data row that doesn't include "NC" in it, including only data from North Carolina in the following table. 


In [4]:
df[df["State"]== "NC"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


# Step 3: Creating our Subset 

**We just filtered our first set of data! To continue filtering data into a subset without having to repeat steps over and over again, we need to use the** *"SettingwithCopyWarning"* **function**

#### `SettingwithCopyWarning` and filtered data

Since we are going to continually use our filtered data of only North Carolina rows, it will take a long time to continually code out df `[df["State"]== "NC"]`. So we will use the `SettingwithCopyWarning` warning to save this code anytime we would like to use it (which will be many times), so we will use this function! 

Making an available copy is simple! We can easily create copy our data as followed, making the name of the subset shorter. Since the function name is shorter, it will be easy to write it out at anytime! 

Shorten the function name as follows:

In [5]:
NC_subset = df[df["State"] == "NC"].copy()

When you type in those code, now you can easily access your subset with a shorter code. Enter the code `NC_subset` as shown below to see your more limited data. 

In [6]:
NC_subset 

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


## Now, let's save our North Carolina data set as a csv file. 

Since we will be filterting and narrowing data exclusively in North Carolina, we will want to save a copy of our whole dataset. Putting in the code below, you will save your data as a csv file onto your computer. 

`NC_subset.to_csv("NC_subset.csv", index=False)`

In [7]:
NC_subset.to_csv("NC_subset.csv", index=False)

## Step 4: Filtering our Dataset even narrower to find trends 

Now that you know how to filter your data by State, let's look at how to filter our data by row. We will want to filter our data by column so we are not comparing 64 different factors in our data. Although all data is important, we don't need all 64 rows at the moment. 

We can do this by any row if we code in the exact name of the column. Since we want to find rates of teen births, counties in North Carolina, and the percent of Uninsured civilians, we will only type in these rows into our dataset. The code to write is shown below;


`df [["Name of Column", "Name of Column", "Name of Column"]]`


In [8]:
NC_subset [[ "County","Teen births", "Uninsured"]]


Unnamed: 0,County,Teen births,Uninsured
3243,Alamance County,42.4,0.206
3244,Alamance County,40.3,0.203
3245,Alexander County,44.2,0.195
3246,Alexander County,42.1,0.194
3247,Alleghany County,53.8,0.272
...,...,...,...
3438,Wilson County,57.3,0.209
3439,Yadkin County,48.8,0.209
3440,Yadkin County,46.8,0.201
3441,Yancey County,40.2,0.228


## Now, Let's factor our rows. 

As you might have guessed, it may be hard to compare 200 rows of data. To compare only a portion of the data, we will take a random sample of our data. Since we have 200 rows of data, the maximum number of data we should sample from is 20 rows, due to the 10% in data. 

## For our data set, we will only look at 19 rows. 

We will use the format below to filter out the columns we want to have present. The sample size will be the amount of rows in our data `(n= # of rows)`

`NC_subset [["Column Name", "Column Name", "Column Name"]].sample (n=# of rows)` 

In [9]:
NC_subset [["County","Uninsured", "Teen births" ]].sample(n=19)

Unnamed: 0,County,Uninsured,Teen births
3281,Cherokee County,0.216,45.7
3290,Columbus County,0.213,57.1
3402,Rowan County,0.207,49.2
3389,Pitt County,0.169,29.0
3398,Robeson County,0.263,70.7
3366,Montgomery County,0.236,69.1
3334,Hertford County,0.197,53.1
3377,Orange County,0.164,11.0
3355,Madison County,0.186,31.4
3415,Swain County,0.232,66.8


# Let's name our sample data in python! 


By copying our reduced subset earlier, we can create a name for our sample subset. We may want to do this to not have to retype the whole code when downloading our sample data into a csv to have a shorter name. We can do this by following the code 

`Nameofsubset= NC_subset [["Column Name", "Column Name", "Column Name"]].sample(n= # of rows)`

I am going to call my sample `NC_samplesubset`

In [10]:
NC_samplesubset= NC_subset [["County","Uninsured", "Teen births" ]].sample(n=19)

## Last Step: Turn our sample subset into a csv! 

Now that we have a name for our sample dataset, we will want to download this sample set into a csv file. We will save this as a csv file to save our data onto our computer! We will follow the exact code as earlier of downloading our data as a csv. Here is a refresher of the steps below! 

Follow this code: `Nameofsubset.to_csv("Nameofsubset", index=False)`

In [11]:
NC_samplesubset.to_csv("Nameofsubset", index=False)