# Exercise Solutions: NOTE FOR POSTING
# Queries into the state of electricity production and CO2 emissions in the United States

In data science, we often need to have a sense of the idiosyncrasies of the data, how they relate to the questions we are trying to answer, and to use that information to help us to determine what approach, such as machine learning, we may need to apply to achieve our goal. This exercise provides practice in exploring a dataset and answering questions that might arise from applications related to the data.

**Data**. The data for this problem can be found in the `data` folder. The filename is `egrid2016.xlsx`. This dataset is the U.S. Environmental Protection Agency's (EPA) [Emissions & Generation Resource Integrated Database (eGRID)](https://www.epa.gov/energy/emissions-generation-resource-integrated-database-egrid) containing information about all power plants in the United States, the amount of electricity they generate, what fuel they use, emissions produced, the location of the plant, and many more quantities. We'll be using a subset of those data.

The fields we'll be using include:					
    
|field    |description|
|:-----   |:-----|
|SEQPLT16 |eGRID2016 Plant file sequence number (the index)| 
|PSTATABB |Plant state abbreviation|
|PNAME    |Plant name |
|LAT      |Plant latitude |
|LON      |Plant longitude|
|PLPRMFL  |Plant primary fuel |
|NAMEPCAP |Plant nameplate capacity (Megawatts MW)|
|PLNGENAN |Plant annual net generation (Megawatt-hours MWh)|
|PLCO2EQA |Plant annual CO2 equivalent emissions (tons)|

For more details on the data, you can refer to the [eGrid technical documents](https://www.epa.gov/sites/default/files/2021-02/documents/egrid2019_technical_guide.pdf). For example, you may want to review page 45 and the section "Plant Primary Fuel (PLPRMFL)", which gives the full names of the fuel types including WND for wind, NG for natural gas, BIT for Bituminous coal, etc. Codebooks for data are common in data science when the variable names would otherwise be onerously long.

In [28]:
import pandas as pd

egrid = pd.read_csv('./data/egrid2016.csv')

In [2]:
egrid

Unnamed: 0,SEQPLT16,PSTATABB,PNAME,LAT,LON,PLPRMFL,NAMEPCAP,PLNGENAN,PLCO2EQA
0,1,AK,7-Mile Ridge Wind Project,63.210689,-143.247156,WND,1.8,0.00,0.00
1,2,AK,Agrium Kenai Nitrogen Operations,60.673200,-151.378400,NG,21.6,0.00,0.00
2,3,AK,Alakanuk,62.683300,-164.654400,DFO,2.6,1213.00,1049.86
3,4,AK,Allison Creek Hydro,61.084444,-146.353333,WAT,6.5,881.00,0.00
4,5,AK,Ambler,67.087980,-157.856719,DFO,1.1,1316.00,1087.88
...,...,...,...,...,...,...,...,...,...
9704,9705,WY,Western Sugar Coop - Torrington,42.046900,-104.186300,NG,2.0,6893.79,2377.68
9705,9706,WY,Wygen I,44.285800,-105.383300,SUB,90.0,710524.00,926215.62
9706,9707,WY,Wygen II,44.291900,-105.381100,SUB,95.0,734354.00,909354.80
9707,9708,WY,Wygen III,44.291900,-105.380600,SUB,116.0,821699.00,975147.43


**Your objective**. For this dataset, your goal is to answer the following questions about electricity generation in the United States by constructing appropriate queries of the data:

1. Which power plant generated the most energy in 2016 (measured in MWh)? Since there is a column with the annual generation ('PLNGENAN'), consider how this can be used to answer this question.

In [3]:
egrid.sort_values(by='PLNGENAN', ascending=False)['PNAME'].iloc[0]

'Palo Verde'

2. Which power plant produced the most CO2 emissions (measured in tons)? 

In [4]:
egrid.sort_values(by='PLCO2EQA', ascending=False)['PNAME'].iloc[0]

'James H Miller Jr'

3. In what state is the plant with the most CO2 emissions (question 2) located?

In [5]:
egrid.sort_values(by='PLCO2EQA', ascending=False)['PSTATABB'].iloc[0]

'AL'

4. What is the primary fuel of the plant with the most CO2 emissions?

In [6]:
egrid.sort_values(by='PLCO2EQA', ascending=False)['PLPRMFL'].iloc[0]

'SUB'

5. What is the name of the northern-most power plant in the United States? (hint: latitude is the quantity that measures how far north or south a location is across the globe)

In [7]:
egrid.sort_values(by='LAT', ascending=False)['PNAME'].iloc[0]

'Barrow'

6. In what state is the northern-most power plant in the United States located?

In [8]:
egrid.sort_values(by='LAT', ascending=False)['PSTATABB'].iloc[0]

'AK'

7. Which state has the largest number of hydroelectric plants? In this case, each power plant counts once so regardless of how large the power plant is, we want to determine which state has the most of them. Note the primary fuel for hydroelectric plants is listed as water in the documentation.

In [9]:
hydro = egrid[egrid['PLPRMFL']=='WAT']
hydro_grouped = hydro.groupby(by='PSTATABB').count().reset_index()
hydro_grouped.sort_values(by='PNAME', ascending=False)["PSTATABB"].iloc[0]

'CA'

8. How many hydroelectric plants does the state with the most (which you identified in the last question) have?

In [10]:
hydro_grouped.sort_values(by='PNAME', ascending=False)["PNAME"].iloc[0]

264

9. Which state(s) has generated the most *energy* (MWh) using coal? If there are more than one, list the state abbreviations in alphabetical order, separated with commas (but no spaces). You may also want to explore the documentation for the `isin()` method for `pandas`. Note: in the eGrid documentation, there are multiple types of coal listed; be sure to factor in each type of coal. 

In [11]:
coal = ['BIT','LIG','RC','SUB','WC']
coal_plants = egrid[egrid['PLPRMFL'].isin(coal)]
coal_plants_grouped = coal_plants.groupby(by='PSTATABB').sum().reset_index().drop(columns=['PNAME','LAT','LON','PLPRMFL','SEQPLT16'])
coal_plants_grouped.sort_values(by='PLNGENAN', ascending=False)['PSTATABB'].iloc[0]

'TX'

10. How much energy (in MWh) do the plants in question 9 produce in total? Please round to the nearest whole number.

In [12]:
coal_plants_grouped.sort_values(by='PLNGENAN', ascending=False)['PLNGENAN'].iloc[0]

122545095.0

11. Which states have EXACTLY 1 coal plant? List the state abbreviations in alphabetical order, separated with commas (but no spaces).

In [13]:
coal = ['BIT','LIG','RC','SUB','WC']
coal_plants = egrid[egrid['PLPRMFL'].isin(coal)]
coal_plants_grouped = coal_plants.groupby(by='PSTATABB').count().reset_index().drop(columns=['PNAME','LAT','LON','PLPRMFL','SEQPLT16'])
coal_plants_grouped_sorted = coal_plants_grouped.sort_values(by='PLNGENAN', ascending=True)
coal_plants_grouped_sorted[coal_plants_grouped_sorted['PLNGENAN']==1]['PSTATABB'].values

array(['SD', 'OR', 'DE', 'HI', 'NH', 'ID'], dtype=object)

In [31]:
# If we wanted EXACTLY 0 coals plants:
coal_plants = egrid[egrid['PLPRMFL'].isin(coal)]
coal_states = coal_plants['PSTATABB'].unique()
non_coal_plant_states = egrid[~egrid['PSTATABB'].isin(coal_states)]
non_coal_plant_states['PSTATABB'].unique()

array(['DC', 'ME', 'RI', 'VT'], dtype=object)

12. Which primary fuel produced the *most* CO2 emissions in the United States? We would like to compare natural gas, coal, oil, and renewables but the current categories are much more specific than that. As a first step, group the data as shown below, replacing the existing labels with the replacements suggested. For example, BIT and LIG should be replaced with COAL.
- COAL = BIT, LIG, RC, SUB, WC
- OIL = DFO, JF, KER, RFO, WO
- GAS = BFG, COG, LFG, NG, OG, PG, PRG 
- RENEW = GEO, SUN, WAT, WDL, WDS, WND

You may want to create a function that does this replacement prior to running your code. You can check whether or not it was successful by verifying that each of the values that should be replaced has been replaced - check that before moving on with the question.

You will want to use 'PLCO2EQA' to answer this question as it's the quantity of emissions each plant generates.

In [14]:
replacements = {
    'COAL': ['BIT', 'LIG', 'RC', 'SUB', 'WC'],
    'OIL': ['DFO', 'JF', 'KER', 'RFO', 'WO'],
    'GAS': ['BFG', 'COG', 'LFG', 'NG', 'OG', 'PG', 'PRG'],
    'RENEW': ['GEO','SUN','WAT','WDL','WDS','WND']
}

def replace(df, replacement_dict):
    for key in replacement_dict:
        df = df.replace(to_replace = replacements[key], value = key)
    return df

egrid = replace(egrid, replacements)
egrid

Unnamed: 0,SEQPLT16,PSTATABB,PNAME,LAT,LON,PLPRMFL,NAMEPCAP,PLNGENAN,PLCO2EQA
0,1,AK,7-Mile Ridge Wind Project,63.210689,-143.247156,RENEW,1.8,0.00,0.00
1,2,AK,Agrium Kenai Nitrogen Operations,60.673200,-151.378400,GAS,21.6,0.00,0.00
2,3,AK,Alakanuk,62.683300,-164.654400,OIL,2.6,1213.00,1049.86
3,4,AK,Allison Creek Hydro,61.084444,-146.353333,RENEW,6.5,881.00,0.00
4,5,AK,Ambler,67.087980,-157.856719,OIL,1.1,1316.00,1087.88
...,...,...,...,...,...,...,...,...,...
9704,9705,WY,Western Sugar Coop - Torrington,42.046900,-104.186300,GAS,2.0,6893.79,2377.68
9705,9706,WY,Wygen I,44.285800,-105.383300,COAL,90.0,710524.00,926215.62
9706,9707,WY,Wygen II,44.291900,-105.381100,COAL,95.0,734354.00,909354.80
9707,9708,WY,Wygen III,44.291900,-105.380600,COAL,116.0,821699.00,975147.43


In [15]:
co2emitters = egrid.groupby(by='PLPRMFL', as_index=False).sum().drop(columns=['PNAME','PSTATABB','LAT','LON','SEQPLT16'])
result = co2emitters.sort_values(by='PLCO2EQA', ascending=False)
result['PLPRMFL'].iloc[0]

'COAL'