#Creating a subset from a data source

# Overview


You will be using Python to import and filter the data set "air emissions 2021" provided by the Environmental Protection Agency (EPA).

>The finished product will be a *data subset* that displays **carbon dioxide (CO2),** **methane (CH4),** and **nitrous oxide (NO2) emissions** in the **state where emissions occured** in the year 2021.



>This new data set is specific for the *southern states* in America. This is because the data shows that these states tend to have the highest emissions in the United States. This subset is useful because it can allow the user to more easily navigate the data.   

---

# Getting Started

1. The best way to set up persistent access to your data with Colab is to **mount your google drive** in the notebook. This ensures that the .csv files you're working with are stored there.
>This can be done by running the following code:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


2. Create a folder in the file explorer on your device for this project.
>You can name it whatever is easy to remember. This step will help with organizing the data in a central location.
3. Download the **air emissions data set** using the csv file from the [EPA website](https://www.epa.gov/ghgreporting).
>Drag the file into the folder you created.
4. Import the Pandas and Numpy packages as seen below.
>Make sure to include `as pd` following the import statement in order to make it easier to call functions later with Pandas. The same thing applies to `as np` for the Numpy package.

In [None]:
import numpy as np
import pandas as pd

5. Read the file with Pandas and display the data to make sure it is working properly.
>The file can be read using the `df=pd.read_csv()` function with the *file name* in parenthesis.
>Make sure the file name written in the code is exactly as the file name.
6. Once the function and csv are input, run the function to read the file.

In [None]:
df=pd.read_csv('gdrive/MyDrive/Colab Notebooks/air_emissions_2021.csv')

7. Next, you should be able to view the full data set by coding `df`.
>If you are experiencing issues, double check the file name you put in the .read_csv() function to ensure it perfectly matches the file name in the folder you created.

In [None]:
df

Unnamed: 0,Facility Id,FRS Id,Facility Name,State where Emissions Occur,Reported City,Reported State,Reported Zip Code,Reported Address,Reported County,Reported Latitude,Reported Longitude,Total reported direct emissions from Local Distribution Companies,CO2 emissions (non-biogenic),Methane (CH4) emissions,Nitrous Oxide (N2O) emissions,Does the facility employ continuous emissions monitoring?
0,1008026,1.100710e+11,Ameren Illinois,IL,Peoria,IL,61602,300 Liberty Street,PEORIA COUNTY,40.690450,-89.592400,76659.696,1102.6,75556.50,0.596,N
1,1004034,1.100710e+11,Ameren Missouri,MO,St. Louis,MO,63103,1901 Chouteau Avenue,ST. LOUIS CITY,38.620667,-90.211086,14890.650,17.9,14872.75,,N
2,1007872,1.100710e+11,Atlanta Gas Light Company,GA,Atlanta,GA,30309,Ten Peachtree Place,FULTON COUNTY,33.797114,-84.380489,200092.900,11793.4,188299.50,,N
3,1004794,1.100700e+11,Atmos Energy Corporation - Colorado,CO,Dallas,TX,75240,5430 LBJ Freeway,DALLAS COUNTY,32.925488,-96.816137,16014.800,19.3,15995.50,,N
4,1001388,1.100710e+11,Atmos Energy Corporation - Kansas,KS,Dallas,TX,75240,5430 LBJ Freeway,DALLAS COUNTY,32.925488,-96.816137,23302.850,28.1,23274.75,,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
160,1007850,1.100710e+11,Washington Gas Light Company (VA),VA,Springfield,VA,22151,6801 Industrial Rd,FAIRFAX COUNTY,38.798440,-77.179300,49393.100,72.6,49320.50,,N
161,1011115,1.100110e+11,"West Texas Gas Utility, LLC",TX,Midland,TX,79705,303 Veterans Airpark Ln. Ste 5000,MIDLAND COUNTY,31.999060,-102.076880,31463.650,37.9,31425.75,,N
162,1002025,1.100710e+11,Wisconsin Power & Light Gas Distribution,WI,Madison,WI,53718,4902 North Biltmore Lane,DANE COUNTY,43.152170,-89.295360,22191.100,162.6,22028.50,,N
163,1002250,1.100710e+11,Wisconsin Public Service Corporation,WI,Green Bay,WI,54307,700 North Adams,BROWN COUNTY,44.518347,-88.011616,39715.648,444.6,39270.75,0.298,N


8. It can also be helpful to use `df.dtypes` or to use `df.iloc[:,#]` to view the values of one column without the excess data.

In [None]:
df.dtypes

Unnamed: 0,0
Facility Id,int64
FRS Id,float64
Facility Name,object
State where Emissions Occur,object
Reported City,object
Reported State,object
Reported Zip Code,int64
Reported Address,object
Reported County,object
Reported Latitude,float64


In [None]:
df.iloc[:,12] # All rows of column 12

Unnamed: 0,CO2 emissions (non-biogenic)
0,1102.6
1,17.9
2,11793.4
3,19.3
4,28.1
...,...
160,72.6
161,37.9
162,162.6
163,444.6


---

#Creating a data subset

Now that you have all the data, we can begin by isolating values in the columns we want to include in our subset.
1. We want to first determine which *states have the highest levels of emissions*.
>To do this, we will use `.value_counts()` to determine how many times emissions have occured in each state.
>Again, make sure that you are inputting the correct row names to avoid any errors with running the code.

In [None]:
df["State where Emissions Occur"].value_counts()

Unnamed: 0_level_0,count
State where Emissions Occur,Unnamed: 1_level_1
TN,11
NY,8
IL,7
PA,7
IN,6
VA,6
TX,6
KY,6
LA,5
MO,5


> Of the states among the highest value counts, **TN, VA, TX, and KY** are located in the south.

2. We can create an easy to remember name for this subset by setting the name equal to the values.

In [None]:
southern_states = ['TN', 'VA', 'TX', 'KY']

3. Next we will isolate the `CO2 emissions (non-biogenic) ` data. This will be done by using the following code:
>The code is used to include rows where the emissions occur in southern states, as well as select and return the column of CO2 emissions for those rows.

In [None]:
result = df[df['State where Emissions Occur'].isin(southern_states)]["CO2 emissions (non-biogenic) "]
result

Unnamed: 0,CO2 emissions (non-biogenic)
5,29.4
8,21.5
9,617.5
10,3.2
14,1.7
20,35.1
25,2.0
28,200.4
31,11.2
36,1.8


**Boom!** Just like that, your dataset has been narrowed down drastically.

---

#Exporting

Exporting involves using the .to_csv() method.
1. The name of the file goes inside the parenthesis along with index=False. This is used in order to avoid the addition of an unnecessary row of indices that Pandas adds.
>Make sure you name the file something that will be easy to remember.

In [None]:
result.to_csv('/content/gdrive/MyDrive/Colab Notebooks/State_CO2_emissions.csv', index=False)

2. This file should now appear inside the foler you created for this project.

>If it is not appearing as a .csv file, double check the name you established for the file and make sure it includes the .csv extension at the end.

##Congratulations on successfully creating a new compiled subset of data!