# Crunching the Numbers

## A guide to Statistical Research Findings 


This notebook will walk us through a quick tutorial in using the pandas package for data anlysis with python.

### Overview of Tutorial
*The first steps will consist of*
1. importing the pandas package 
2. creating a dataframe
3. exploring our dataframe's attributes 

*The second set of steps will consist of*

4. using functions to filter our data
5. using functions to merge and join our data
6. creating a subset and exporting as a new .csv file

## Python Basics

The following cells will help illustrate some of the basic principles of coding in Python.

First, we will go over how to create a directory for your python notebooks on your computer, and how to launch the Jupyter Lab coding environment.

### Creating Your Python Notebook

To create your python notebooks, you will first need to navigate to the appropriate directory on your computer.

#### Getting Started

To get started, you will need to create the directory, and launch the Jupyter Lab environment.

Notice that the file is a `.pynb` file — a python notebook. This is the file type we will use to write our code and document our methods during the data compiliation process.

Make sure to save this file onto your computer in a safe place. Save this file in a safe place so the file is easily accessible to you throughout working with your data repository. 

2. Launch Jupyter Lab to create a new `.ipynb` file that will be saved to your directory. 

On a Mac, use the terminal to navigate to the directory you've created and launch Jupyter Lab.

> Refering to my file path above, I will now use the `cd` commands to navigate into my directory: 

    `cd Desktop`
    'cd UNC-Writing-105`
    `cd Python_Lessons`
    
Along the way, I can check my whereabouts by using the `ls` command to list the contains of my current directory location.

Once in the appropriate directory, I can use the command `jupyter lab` to launch the environment in my browser.

On a PC, use the Anaconda Navigator to launch jupyter.

3. Create a new notebook file, using the graphical user interface.

Now that you have created a python notebook, we must find a data repository set. In this example, we will use the County Health Data Repository set. 

Download your downloaded data repository set into excel. 
Then, using excel, hit file to download this data into a csv file. We download this data into a csv file because this file type is the most accessible for the Python domain. After this step, you are all set to begin setting up your python notebook. 

### Setup

You'll begin by importing the packages that we'll need to use with Python into the notebook we created into earlier steps. Before we begin, we must download the pandas package into your notebook to help us with data collection. 

Notice that we load pandas with the usual `import pandas` and an extra `as pd` statement. This allows us to call functions from `pandas` with `pd.<function>` instead of `pandas.<function>` for convenience. `as pd` is **not** necessary to load the package.

Note, we also imported the `numpy` package, which is going to help pandas do some of its math.

In [116]:
import numpy as np
import pandas as pd

We'll also need to create our dataframe object again, by using pandas to read in our .csv file.

`pd.read_csv` reads the tabular data from a Comma Separated Values (csv) file into a dataframe object that we'll define as `df`.

To create our dataframe object we'll define our object `df` by executing the `pd.read_csv()`function on our data file by inserting the relative file path into the paranthesis.

We are examining data from the County Health Data Respository Set. Let's import this data now that we previously uploaded as a CSV file. 

To upload your CSV file into python notebook, these are the steps, type in

**df= pd.read_csv('Name of your file)**

There is an example below. Make sure to write these instructions in Code. If your notebook is in markdown as of now, or raw, go to the top of the screen in your notebook where there is a dropdown and select Code. This is the setting you will stay on while looking for data. 

In [117]:
df=pd.read_csv('CountyHealthData_2014-2015 (2).csv')

### This tutorial aims to look into factors that affect North Carolina's number of premature deaths each year. This tutorial will show the instructions to finding the information.

After your CSV file has been uploaded into your Juptyer notebook, we are going to start finding specific trends in our data. 

This data repository set has information from all 50 states, but we are only interested in data from North Carolina. Let's reduce our data down to North Carolina by selecting only the columns we are interested in for information. 

### Let's seperate our data by state.

#### Filtering

We've already discussed how to use the "Series" feature of our dataframes to isolate single columns from our tabular dataset, using either dot notation `df.Region` or bracket notation `df["Region"]`.

We can also filter our dataset by using logical conditions (baed on true or false), these can be added using nested square brackets. 

Note the example below.

- The inner statement, `df["State"]=="RI"` looks for a column name and checks if it equals `"RI"`
- The outer statement `df[ ... ]` uses the resulting column of `True/False` values to select rows
- When combined, these two commands call all of the data contained in rows where the value of the `State` field is equal to `"RI"`

Since we are looking to seperate the data only be the state of North Carolina, we will use the inner statement 

*df= [df["State"]== "NC"]* 

to seperate the data by state. The code is shown below. 

In [118]:
df[df["State"]== "NC"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


**We just filtered our first set of data! To continue filtering data without having to repeat steps over and over again, we need to use the** *"SettingwithCopyWarning"* **function**

#### `SettingwithCopyWarning` and filtered data

When we use the notation above to filter a DataFrame we may run into a `SettingwithCopyWarning` warning later on in our code if we save this object and then modify it later on. 

That's because this notation creates a reference back to the original dataframe, not a copy of the original dataframe, unless we explicitly use the `.copy()` method. 

If you want to use subset later on, you should create it as follows:

In [120]:
NC_subset = df[df["State"] == "NC"].copy()

When you type in those code, now you can easily access your subset with a shorter code. Enter the code "NC_subset" as shown below to see your more limited data. 

In [121]:
NC_subset

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


Awesome job! Now that you know how to filter your data by State, let's look at how to filter our data by column. We can do this by any column if we do the exact name of the column. Since we want to find rates of premature death, we would type in 

df ["Name of Column"]

as follows below 

In [122]:
df ["Premature death"]

0           NaN
1           NaN
2        6827.0
3        6856.0
4       13345.0
         ...   
6104     7436.0
6105     6580.0
6106     7572.0
6107     5633.0
6108     7819.0
Name: Premature death, Length: 6109, dtype: float64

Now we are able to only look at data through premature death. But since we also want to look at how this differs in North Carolina and by region, we need to import more data into this table. 

We are able to reduce our data into multiple columns, in order to get any information out of our data set that may not necessarily be important to us for our research. We can reduce our data into as little or as many columns we need. To only look at multiple columns, we will use the format 

df[[ "Column Name1", "Column Name2", "Column Name3"]]

In [123]:
df[["State", "Premature death","Region"]] [72:92]

Unnamed: 0,State,Premature death,Region
72,AL,9915.0,South
73,AL,8959.0,South
74,AL,10738.0,South
75,AL,10438.0,South
76,AL,8403.0,South
77,AL,8015.0,South
78,AL,10591.0,South
79,AL,10657.0,South
80,AL,12942.0,South
81,AL,13362.0,South


If you are interested in only looking at specific columns, you can do this by copying the same notation 

df[["ColumnName1", "ColumnName2", "ColumnName3"]]

Say you want to look at columns 72-92. You would then add specific numbers you want to use next to the df notation, as shown below

df [["ColumnName1", "ColumnName2". "ColumnName3", "ColumnName4"]] [72:92] 

In [124]:
df [["State", "Region", "Premature death"]] [72:92]

Unnamed: 0,State,Region,Premature death
72,AL,South,9915.0
73,AL,South,8959.0
74,AL,South,10738.0
75,AL,South,10438.0
76,AL,South,8403.0
77,AL,South,8015.0
78,AL,South,10591.0
79,AL,South,10657.0
80,AL,South,12942.0
81,AL,South,13362.0


### Be weary that the first column is 0, not one. So if you wanted to see the first twenty rows, including row number 20, you would have to say you wanted [0:21]. This is because 0 is counted as an actual row! 

In [125]:
df [["State", "Premature death"]][0:21] 

Unnamed: 0,State,Premature death
0,AK,
1,AK,
2,AK,6827.0
3,AK,6856.0
4,AK,13345.0
5,AK,12864.0
6,AK,9699.0
7,AK,15057.0
8,AK,5740.0
9,AK,5862.0


## You can also do the same for the subset we made earlier 

If you only want to see your selected columns in a specific subset, this is done exactly the same. Follow this  guideline with your data to see your subset as shown people. 

Name of subset [["Column Name", "Column Name 2", "Column Name 3"]]

In [126]:
NC_subset [["State", "Region", "Premature death"]] 

Unnamed: 0,State,Region,Premature death
3243,NC,South,7123.0
3244,NC,South,7291.0
3245,NC,South,7974.0
3246,NC,South,8079.0
3247,NC,South,8817.0
...,...,...,...
3438,NC,South,8028.0
3439,NC,South,7893.0
3440,NC,South,7258.0
3441,NC,South,6872.0


While the NC data set takes up columns 3243-3443 for the whole data repository, Python has already filtered our data to only this subset. So if you want to see the first 20 rows of this data, you would use [0:20] as your notation rather than [3243:3263}.

This is shown below. 

In [127]:
NC_subset [["State", "Region", "Premature death"]] [0:20]

Unnamed: 0,State,Region,Premature death
3243,NC,South,7123.0
3244,NC,South,7291.0
3245,NC,South,7974.0
3246,NC,South,8079.0
3247,NC,South,8817.0
3248,NC,South,7324.0
3249,NC,South,10220.0
3250,NC,South,9599.0
3251,NC,South,7521.0
3252,NC,South,7900.0


In [128]:
 NC_subset [["State", "Premature death", "Region"]][0:16]

Unnamed: 0,State,Premature death,Region
3243,NC,7123.0,South
3244,NC,7291.0,South
3245,NC,7974.0,South
3246,NC,8079.0,South
3247,NC,8817.0,South
3248,NC,7324.0,South
3249,NC,10220.0,South
3250,NC,9599.0,South
3251,NC,7521.0,South
3252,NC,7900.0,South


Typing all this out is a little time consuming. Using functions such as *'.iloc'* and *'.loc'* help with the long process of writing out all of our columns. 

If we use the *'.iloc'* attribute before our brackets, pandas accepts two numbers seperated by a comma. The first number is for rows and the second for columns. 

Below, we select the second row and third column. 

In my example, I am using the NC_subset. Though you can also use the data set by typing in 

df.iloc [2:3]

In [129]:
NC_subset.iloc[2,3]

'Alexander County'

We can use a colon  to select multiple columns or rows at once. Note the examples below. 

Since we want to look at differing rates of adult obesity and child mortality rates per county in North Carolina, we are going to start merging our data and looking at specifics in our data. To look at specific columns and rows at once, we need to first find the numbers of the rows we want to use. You can do this by looking at your original excel file to find the number of the selected column.

The rows will be numbered left to right. So to find the name of a row, the 1st row will be the row to the very left. For example, if we want to find the name of the first row by using Python, we can do so as shown below. Though we would have specify the amount of rows we want and the amount of columns. 

NC_subset.iloc[Columns:Columns, Rows:Rows]

If we only want to look at one row or one column, we can just put the number of the column we want to see. [Column,Row]

In [130]:
NC_subset.iloc[0:3, 1]

3243    South
3244    South
3245    South
Name: Region, dtype: object

### Region is the name of the first row in our dataset. 

Using this excel process, we can find the number of each of the rows we want to see, same with column numbers. 

The numbers of the columns I would like to see are 
- 1 for Region 
- 3 for County
- 8 for premature death 
- 13 for adult smoking 
- 14 for adult obesity 
- 20 for Sexual transmitted diseases 

Let's begin to detail our data by rows. Right now for columns, we are going to insert 0:200, since there are 200 columns in NC subset. 

In [131]:
NC_subset.iloc[0:200,1]

3243    South
3244    South
3245    South
3246    South
3247    South
        ...  
3438    South
3439    South
3440    South
3441    South
3442    South
Name: Region, Length: 200, dtype: object

#### Merging Data

When combining two dataframes that share a common set of rows but contain different columns, we usually can't just concatenate our dataframes together, instead we use certain key variables to make sure the same records end up in the same row.

So let's take our data sets from earlier and enter them into a data series. Enter the data subset that specifies row and columns into a series set into the template below. 

series0= subsetname.iloc[Columns, Rows]

Now that you know the coding template, let's get ready to merge! Create at least two data series as below.

In [132]:
series00= NC_subset.iloc[0:200, 1]

In [133]:
series01= NC_subset.iloc[0:200, 8]

Now, with the pd.concat tool, we will merge these two data sets. We will do this by having the template 

pd.concat([series, series], axis=1)

In [134]:
pd.concat([series00,series01], axis=1)

Unnamed: 0,Region,Premature death
3243,South,7123.0
3244,South,7291.0
3245,South,7974.0
3246,South,8079.0
3247,South,8817.0
...,...,...
3438,South,8028.0
3439,South,7893.0
3440,South,7258.0
3441,South,6872.0


## Your columns have been merged! Now let's do this with all the seperate columns to compile a data set. 

In [135]:
series00= NC_subset.iloc[0:200,1]

In [136]:
series02=NC_subset.iloc[0:200,3]

In [137]:
series01= NC_subset.iloc[0:200,8]

In [138]:
series03=NC_subset.iloc[0:200,13]

In [139]:
series04=NC_subset.iloc[0:200, 14]

In [140]:
series05=NC_subset.iloc[0:200,20]

In [141]:
pd.concat([series00,series01,series02,series03,series04,series05], axis=1)

Unnamed: 0,Region,Premature death,County,Adult smoking,Adult obesity,Sexually transmitted infections
3243,South,7123.0,Alamance County,0.238,0.341,459.9
3244,South,7291.0,Alamance County,0.238,0.332,471.0
3245,South,7974.0,Alexander County,0.260,0.272,213.0
3246,South,8079.0,Alexander County,0.260,0.283,206.2
3247,South,8817.0,Alleghany County,0.271,0.247,208.1
...,...,...,...,...,...,...
3438,South,8028.0,Wilson County,0.121,0.373,651.1
3439,South,7893.0,Yadkin County,0.255,0.297,188.1
3440,South,7258.0,Yadkin County,0.255,0.301,168.0
3441,South,6872.0,Yancey County,0.214,0.287,152.5



## 200 rows is still filtering through a lot of data.
So let's compare two counties by filing down the data just a little more. Let's compare the difference between data in a western part of North Carolina and an eastern part of North Carolina. 

For our western county, let's compare Henderson County, which is located in western North Carolina, and Beaufort County, which is in eastern North Carolina, by filing down our merged data. 

As we did earlier, we can file down the rows of our data instead of having 200. Looking our excel file, we can find find the row numbers that lead us directly to the counties we are focused on. 

Beaufort County Data is located on rows 3356 and 3357. 12 13. 

Henderson County Data is located on rows 3332 and 3333.

To transfer this information onto our data subset, since our data subset only contains North Carolina information, we need to subtract 3342 from the row numbers of our selected counties, since 3342-3542 is the span of all the North Carolina data. 

3356-3342= 12 and 3357-3342= 13 (Beaufort County is 12 and 14 on our NC subset.)
3332-3342= 88 and 3333-3342= 89 (Henderson County is 88 and 90 on our NC subset.) 

## The first step is to copy the data from earlier. 
Except since the series have already been created, we have to edit the data to have different series names. You can just change yours to different numbers. I will first do Beaufort county's statistics. 

In [142]:
series06=NC_subset.iloc[12:14,1]

In [143]:
series07=NC_subset.iloc[12:14,3]

In [144]:
series08= NC_subset.iloc[12:14,8]

In [145]:
series09=NC_subset.iloc[12:14,13]

In [146]:
series10=NC_subset.iloc[12:14, 14]

In [147]:
series11=NC_subset.iloc[12:14,20]

In [148]:
pd.concat([series06,series07,series08,series09,series10,series11], axis=1)

Unnamed: 0,Region,County,Premature death,Adult smoking,Adult obesity,Sexually transmitted infections
3255,South,Beaufort County,9400.0,0.283,0.343,463.4
3256,South,Beaufort County,8962.0,0.283,0.319,686.2


## Let's do the same for Henderson County. 

In [149]:
series12=NC_subset.iloc[88:90,1]

In [150]:
series13=NC_subset.iloc[88:90,3]

In [151]:
series14=NC_subset.iloc[88:90,8]

In [152]:
series15=NC_subset.iloc[88:90,13]

In [153]:
series16=NC_subset.iloc[88:90,14] 

In [154]:
series17=NC_subset.iloc[88:90,20]

In [155]:
pd.concat([series12,series13,series14,series15,series16,series17], axis=1)

Unnamed: 0,Region,County,Premature death,Adult smoking,Adult obesity,Sexually transmitted infections
3331,South,Henderson County,6873.0,0.174,0.219,194.6
3332,South,Henderson County,6755.0,0.174,0.218,211.5


Now that we have our information, we will want to merge the two to compare statistics. Though typing both out would be quite tedious. So let's turn both our concats into series as well by copying our 

pd.concat ([series], axis=1) 

into a series of its own. This is shown below

In [156]:
series20=pd.concat([series06,series07,series08,series09,series10,series11])

In [157]:
series20

3255              South
3256              South
3255    Beaufort County
3256    Beaufort County
3255             9400.0
3256             8962.0
3255              0.283
3256              0.283
3255              0.343
3256              0.319
3255              463.4
3256              686.2
dtype: object

In [171]:
series21=pd.concat([series12,series13,series14,series15,series16,series17])

### Now all we have to do is concat our two data sets together to get a table and compare the results betwen the two counties! 

We can do this with two ways, we can 
1. Type out all of the series into our pd formula as shown below
2. concat the two data sets together in order to look at a table comparing the two counties 

When typing out all the series, make sure to use this format 

pd.concat([series,series,series,series],axis=1,sort=True)

Since we want to see the names of the rows, we will not use this function. But if you want the names of the rows to turn into numbers, you will use the function ignore_index=True to make the rows names of numbers. 

pd.concat([series,series,series,series,series], axis=1, ignore_index=True, sort=True)

Since we want to see the rows names, we will not use this function. 

- `ignore_index=True`, resets the dataframe index to start at 0 and run to 3. Otherwise our row index would be 0 1 0 1, from the indices of the original two dataframes.
- `sort=False` addresses a behavior for sorting in my version of python that causes an error when non-concatenation axis is not aligned. The value `False` tells it to ignore this sorting behavior



## Option 1 

In [180]:
pd.concat([series06,series07,series08,series09,series10,series11,series12,series13,series14,series15,series16,series17],axis=1,sort=True)

Unnamed: 0,Region,County,Premature death,Adult smoking,Adult obesity,Sexually transmitted infections,Region.1,County.1,Premature death.1,Adult smoking.1,Adult obesity.1,Sexually transmitted infections.1
3255,South,Beaufort County,9400.0,0.283,0.343,463.4,,,,,,
3256,South,Beaufort County,8962.0,0.283,0.319,686.2,,,,,,
3331,,,,,,,South,Henderson County,6873.0,0.174,0.219,194.6
3332,,,,,,,South,Henderson County,6755.0,0.174,0.218,211.5


In [187]:
series22=pd.concat([series06,series07,series08,series09,series10,series11,series12,series13,series14,series15,series16,series17],axis=1,sort=True)

In [188]:
series22

Unnamed: 0,Region,County,Premature death,Adult smoking,Adult obesity,Sexually transmitted infections,Region.1,County.1,Premature death.1,Adult smoking.1,Adult obesity.1,Sexually transmitted infections.1
3255,South,Beaufort County,9400.0,0.283,0.343,463.4,,,,,,
3256,South,Beaufort County,8962.0,0.283,0.319,686.2,,,,,,
3331,,,,,,,South,Henderson County,6873.0,0.174,0.219,194.6
3332,,,,,,,South,Henderson County,6755.0,0.174,0.218,211.5


### Again concatting the two series groups. We will concat the two series groups to make rows of information for comparison. See below 

In [182]:
pd.concat([series20,series21], axis=0)

3255               South
3256               South
3255     Beaufort County
3256     Beaufort County
3255              9400.0
3256              8962.0
3255               0.283
3256               0.283
3255               0.343
3256               0.319
3255               463.4
3256               686.2
3331               South
3332               South
3331    Henderson County
3332    Henderson County
3331              6873.0
3332              6755.0
3331               0.174
3332               0.174
3331               0.219
3332               0.218
3331               194.6
3332               211.5
dtype: object

Either way, now you have the ability to compare and connect your data as a whole and see the differences between the two counties different rates of premature births, adult smoking, adult obesity, and sexual transmitted diseases.

### You may be wondering how to export this information. 

### Exporting our New Subsets

Once we've finished manipulating our datasets and creating more **usable** or **useful** subsets for further analysis, we can export them as new .csv files, giving us readymade and openly accessible outputs to share with the public on our GitHub repositories.

#### Exporting to .csv file

To do this we can use the method `.to_csv()` - adding the filename and extension within the parentheses at the end.

So for example, for our filtered subset we would run: `RI_subset.to_csv("RI_subset.csv")` this will export a `.csv` file in our working directory.

By default, this `.csv` will include the row of indices that pandas created when we read the original file into our notebook using `.read_csv`. 

To eliminate these, we can add `index=false` to our statement, which tells it not bring in those index numbers.

`NC_subset.to_csv("RI_subset.csv", index=False)`

In [193]:
NC_subset.to_csv("NC_subset.csv", index=False)

### We can also do this with our merged data series. 

In [192]:
series22.to_csv("series22.csv", index=False)