# Initial exploration of each data file
Pandas makes importing data from files easy. But sometimes the file contents are poorly formatted or can hold hidden surprises. Make sure that the data - and data types - are what you expect them to be before starting your analysis.

In [1]:
import pandas as pd
import numpy as np
import os
from os.path import join

cwd = os.getcwd()
data_path = join(cwd, '..', '..', 'data')

I sometimes find it helpful to change the Pandas viewing options for max rows and max columns

In [3]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

### Don't write absolute paths
An absolute path is something like `/Users/Home/Documents/GitHub/python-data-analysis-class/data/epa_emissions_2016.txt`. Or in Windows it might be `C:\Users\gschivley\Documents\GitHub\python-data-analysis-class\data\epa_emissions_2016.txt`.

Use relative paths and Python built-in tools to write paths.

## Set file paths

In [6]:
# Paths to each of the data files (epa emissions, eia capacity by generator, and eia generation)

epa_path = join(data_path, 'external', 'epa_emissions_2016.txt')
cap_path = join(data_path, 'external', '3_1_Generator_Y2016.xlsx')
gen_path = join(data_path, 'external', 'EIA923_Schedules_2_3_4_5_M_12_2016_Final_Revision.xlsx')

## Load EPA epa data
Lets load the file and see what needs to be done to make sure the data is in good shape and accessible.

### Access parts of the dataframe

Look at the column names

### Data types of each column
Numeric columns will either be `int` or `float`. If a column is of type `object` it is either all strings or a mix of types. Watch out for columns that should be numeric but should up as `object`.

## Basic statistics of the data

## Load capacity data

### Check the column names and data types


### It looks like several columns we would expect to be numeric are `object`

Pandas will only list the type of a column as `int` or `float` if all items can be cast that way. `object` means that either all items are strings (or another non-numeric, non-categorical type) or that the values are a mix of types.

Having numeric columns as `object` is a problem for us because operations like `groupby` won't work on non-numeric columns.

### Finding non-numeric values with code
A search like this can also be done in the Excel or csv file. But sometimes it's easier to do with code.

Let's start by looking at the `Nameplate Power Factor` column and finding non-numeric entries. I'm using [this stackoverflow post](https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas) as a template for my code below.

First use the `map` method to apply a function to every row of the Series. `map` (for Series), `apply` (for rows or columns of Dataframes), and `applymap` (for every element in a Dataframe) are powerful tools.

We'll use a `lambda` function (which is a way to define functions inline) here as part of the `map` method.

### Try loading capacity data again

### Group capacity by plant code
EIA reports generation by fuel type and prime mover type, not necessarily by the actual prime mover. The EPA emissions data file I've included in this repository also groups data to the facility level. So lets group data to the facility level here to make everything easier.

### What technology makes up the largest fraction of capacity at each plant?

There is probably a faster/more clever way to do this, but I'm going to:
- Sum capacity by Technology at each plant
- Loop through every plant in a grouped object
- Identify the index with the largest summer capacity value and return the "Technology"
- If no summer capacity is given `idxmax` will return an error - use the nameplate capacity instead
- Build a new dataframe with this data

# Load generation data

### Melt generation data to tidy format and groupby facility

### Convert month columns to integer values
The EPA emissions data also has a `month` column but the values are integers. We'll use a built-in list of months from the `calendar` package to create a maping of names to integers.

### What fuel is used to generate the most electricity at each plant?