# Initial exploration of each data file
Pandas makes importing data from files easy. But sometimes the file contents are poorly formatted or can hold hidden surprises. Make sure that the data - and data types - are what you expect them to be before starting your analysis.

In [1]:
import pandas as pd
import numpy as np
import os
from os.path import join

cwd = os.getcwd()
data_path = join(cwd, '..', '..', 'data')

I sometimes find it helpful to change the Pandas viewing options for max rows and max columns

In [2]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

### Don't write absolute paths
An absolute path is something like `/Users/Home/Documents/GitHub/python-data-analysis-class/data/epa_emissions_2016.txt`. Or in Windows it might be `C:\Users\gschivley\Documents\GitHub\python-data-analysis-class\data\epa_emissions_2016.txt`.

Use relative paths and Python built-in tools to write paths.

In [4]:
cwd

'/Users/Home/Documents/GitHub/python-data-analysis-class/notebooks/Pandas'

In [3]:
data_path

'/Users/Home/Documents/GitHub/python-data-analysis-class/notebooks/Pandas/../../data'

## Define function to clean column names

In [15]:
def clean_columns(columns):
    'Remove special characters and convert to snake case'
    clean = (columns.str.lower()
                    .str.replace('[^0-9a-zA-Z\-]+', ' ')
                    .str.replace('-', '')
                    .str.strip()
                    .str.replace(' ', '_'))
    return clean

## Set file paths

In [5]:
# Paths to each of the data files (epa emissions, eia capacity by generator, and eia generation)

epa_path = join(data_path, 'external', 'epa_emissions_2016.txt')
gen_path = join(data_path, 'external', 'EIA923_Schedules_2_3_4_5_M_12_2016_Final_Revision.xlsx')

## Load EPA epa data
Lets load the file and see what needs to be done to make sure the data is in good shape and accessible.

### Access parts of the dataframe

Look at the column names

### Data types of each column
Numeric columns will either be `int` or `float`. If a column is of type `object` it is either all strings or a mix of types. Watch out for columns that should be numeric but should up as `object`.

In [21]:
epa.dtypes

state                  object
facility_name          object
facility_id_orispl      int64
month                   int64
year                    int64
gross_load_mwh        float64
so2_tons              float64
nox_tons              float64
co2_short_tons        float64
heat_input_mmbtu      float64
dtype: object

## Basic statistics of the data

# Load generation data

### Melt generation data to tidy format and groupby facility

### Convert month columns to integer values
The EPA emissions data also has a `month` column but the values are integers. We'll use a built-in list of months from the `calendar` package to create a maping of names to integers.

### What fuel is used to generate the most electricity at each plant?