# USA Electricity Generating Exploratory Analysis

In [1]:
import pandas as pd
import numpy as np

df_2018 = pd.read_csv("raw_data/y2018.csv")
# Questions to answer
## Explore the data

## In 2018
# What is the breakdown of capacity by technology
# What is the breakdown of usable electricty by technology
# What is the breakdown of capacity by renewables, fossil fuel, nuclear
# What is the breakdown of capacity by technology by state
# Geolocation of plants by different technologies - nuclear, coal, NG, hydro
## Larger questions
# How does energy makeup change throughout the years


In [2]:
print(df_2018.head(10), "\n")

print(f"Electricity generating technologies in this dataset: {df_2018.Technology.unique()}")

print(f"Number of power plants in this dataset: {df_2018.Plant_ID.nunique()}")

   Nameplate     Technology   Latitude  Longitude State  Plant_ID
0       53.9  Hydroelectric  33.458665 -87.356823    AL         2
1      153.1    Natural Gas  31.006900 -88.010300    AL         3
2      153.1    Natural Gas  31.006900 -88.010300    AL         3
3      403.7           Coal  31.006900 -88.010300    AL         3
4      788.8           Coal  31.006900 -88.010300    AL         3
5      170.1    Natural Gas  31.006900 -88.010300    AL         3
6      170.1    Natural Gas  31.006900 -88.010300    AL         3
7      195.2    Natural Gas  31.006900 -88.010300    AL         3
8      170.1    Natural Gas  31.006900 -88.010300    AL         3
9      170.1    Natural Gas  31.006900 -88.010300    AL         3 

Electricity generating technologies in this dataset: ['Hydroelectric' 'Natural Gas' 'Coal' 'Other' 'Nuclear' 'Wind' 'Solar']
Number of power plants in this dataset: 9062


Columns include the nameplate capacity, the generating technology, the location, and the Plant_ID.

######  Nameplate Capacity
The nameplate capacity is the "rated" capacity of the power plant, but it's important to note that this is just a theoretical maximum. For example, a solar plant could be rated for 10 MW in sunny, perfect conditions, but in reality, it might only generate a fraction of that because the sun doesn't always shine. There is a name for this: [Capacity Factor](https://en.wikipedia.org/wiki/Capacity_factor). 

Below are capacity factors taken from Wikipedia for 2018. Natural gas is assumed to mostly be combined cycle (Cogen) and other is assumed by be biomass, geothermal, etc. Solar is assumed to be photovoltaic (PV)

###### Repeat Columns
Notice rows 2-9. Looks like a repeat right (especially 5 and 6)? There are numerous rows that look like duplicates but are actually different modules of one power plant. For example, it could be two separate nuclear reactors at one site or two cogeneration trains at a natural gas plant. For the purposes of this analysis, it will be best to combine these with a sum these by technology to get the sense of the electricity generating capacity at a facility

In [3]:
capacity_factor_dict = {"Hydroelectric": .428, 
                        "Wind": .374, 
                        "Solar": .261, 
                        "Coal": .54, 
                        "Natural Gas":.5,
                        "Nuclear": .887, 
                        "Other": .5}

In [11]:
# Creating a function in case we would like to perform this data cleansing
def cleanse_df(df, capacity_factors=capacity_factor_dict):
    df = df.groupby(["Plant_ID", "Technology", "Latitude", "Longitude", "State"], as_index=False).sum()
    electricity_classification_dict = {"Hydroelectric": "Renewable", "Wind": "Renewable", "Solar": "Renewable", "Nuclear": "Renewable", "Other": "Renewable", 
                        "Coal": "Fossil Fuel", "Natural Gas": "Fossil Fuel"}
    df["Classification"] = df["Technology"].map(electricity_classification_dict)
    df["Electricity Output"] = df["Nameplate"] * df["Technology"].map(capacity_factor_dict)  
    return df

df_2018_clean = cleanse_df(df_2018)
with pd.option_context('expand_frame_repr', False):
    print(df_2018_clean.columns)
    print(df_2018_clean.head())
    
# If a powerplant has two modes of generation (eg Coal and NG), the Plant_ID will still be repeated

Index(['Plant_ID', 'Technology', 'Latitude', 'Longitude', 'State', 'Nameplate',
       'Classification', 'Electricity Output'],
      dtype='object')
   Plant_ID     Technology   Latitude  Longitude State  Nameplate Classification  Electricity Output
0         2  Hydroelectric  33.458665 -87.356823    AL       53.9      Renewable             23.0692
1         3           Coal  31.006900 -88.010300    AL     1192.5    Fossil Fuel            643.9500
2         3    Natural Gas  31.006900 -88.010300    AL     1377.0    Fossil Fuel            688.5000
3         4  Hydroelectric  32.583889 -86.283056    AL      225.0      Renewable             96.3000
4         7    Natural Gas  34.012800 -85.970800    AL      138.0    Fossil Fuel             69.0000
