In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
from os.path import join
import sys

cwd = os.getcwd()

# Set data path
data_path = join(cwd, '..', '..', 'data')

In [2]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [3]:
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

# add the 'src' directory as one where we can import modules
src_dir = join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

In [4]:
%aimport data.clean_import
from data.clean_import import import_epa_emissions, import_plant_capacity, import_plant_generation

## Import data
Use functions that were written based on experience from the `2 - Explore file imports` notebook. These functions will do all the processing work and keep this notebook cleaner.

In [5]:
epa_path = join(data_path, 'external', 'epa_emissions_2016.txt')
cap_path = join(data_path, 'external', '3_1_Generator_Y2016.xlsx')
gen_path = join(data_path, 'external', 'EIA923_Schedules_2_3_4_5_M_12_2016_Final_Revision.xlsx')

In [6]:
epa = import_epa_emissions(epa_path)
cap = import_plant_capacity(cap_path)
gen = import_plant_generation(gen_path)

## View each dataset

## Join data
Joining (or merging) different datasets based on some common factor is a powerful tool. It's a common SQL operation but difficult to implement in Excel (vlookup anyone?). Methods 2 and 3 are from [this stackoverflow post](https://stackoverflow.com/questions/23668427/pandas-joining-multiple-dataframes-on-columns).

### Method 1: Two independent joins

I'm going to do a "left" merge here, where all values from the left dataframe will be kept. When no corresponding values exist in the right dataframe (epa emissions) Pandas will insert `np.nan`. 

#### Join generation with epa emissions
I'm using the `pd.merge` function to join two dataframes. One is specified as "left" and the other as "right".

#### Join gen_epa with capacity data
The `merge` function can also be used as a method of the dataframe, which is automatically considered the "left" object in the join.

### Method 2: Chain the `join` method
I'm also going to limit the columns that are kept in some of the dataframes

In [23]:
epa_keep = [
    'plant_id', 'month', 'gross_load_mwh', 'so2_tons',
    'nox_tons', 'co2_short_tons', 'heat_input_mmbtu'
]

cap_keep = [
    'plant_id', 'state', 'nameplate_capacity_mw', 'summer_capacity_mw',
    'winter_capacity_mw', 'minimum_load_mw', 'technology'
]

### Method 3: Reduce function
This might not work well when mixing inner/left/right joins and what columns to join on.

## Export the combined data

In [21]:
out_path = join(data_path, 'processed', 'facility_gen_cap_emissions.csv')