# What patterns exist between energy consumption and generation?

## Goals

By the end of this case, you will be familiar with various data visualizations like heat maps, line plots, strip plots, and box plots. You will also begin to develop a sense of which plots to use for displaying certain types of information. This will help you write better reports and communicate your ideas and analyses much more clearly to non-data professionals.

## Introduction

**Business Context.** Energy supply and demand is a hotly debated topic across world governments and political parties. You are an analyst for the Department of Energy (DoE), and are responsible for discerning patterns in electric power generation and consumption across different energy sources as well as across sectors of the U.S. economy in order to help drive government initiatives.

**Business Problem.** Your boss would like you to answer the following question: **"Given patterns in energy consumption across sectors and time, how should we allocate government resources towards nuclear electricity generation?"** He needs to explain your findings to non-technical politicians so they can allocate resources appropriately across the country.

**Analytical Context.** You are given data in CSV format from the [Energy Information Administration](https://www.eia.gov/totalenergy/data/monthly/index.php) (EIA) for both energy consumption and net electricity generation, where energy consumption is broken down by sector and electricity generation is broken down by source. In this case, you will explore relationships between energy consumption in the electric power sector and electricity generation from nuclear electric power; identify any patterns in energy usage and generation and how they change over time; and finally, use plots to determine which sectors consume the most energy and how this has evolved over time.

## Getting started with the Energy Information Administration (EIA) data

The consumption and generation data are each given on a monthly basis in ```data/energy_consumption.csv``` and ```data/electricity_generation.csv```. The dataset's contents and some useful characteristics to note about the data are as follows:

The `data/energy_consumption.csv` table:
- Contains monthly energy consumption by sector for the U.S.
- Energy consumption is the use of energy as a source of heat or power or as an input in the manufacturing process
- Primary energy is first accounted for energy in a statistical energy balance, before any transformation to secondary or tertiary forms of energy
- Total energy consumption in sectors consists of primary energy consumption, electricity retail sales, and electrical system energy losses

The `data/electricity_generation.csv` table:
- Contains monthly net electricity generation for all sectors in the U.S.
- Net electricity generation is the amount of gross electricity generation less station use (the electric energy consumed at the generating station(s) for station service or auxiliaries)
- Btu stands for British Thermal Unit

The columns in the tables are:

1. **MSN:** Mnemonic Series Names (see https://www.eia.gov/state/seds/sep_use/notes/use_a.pdf for more information)
2. **YYYYMM:** The month of the energy use
3. **Value:** The amount of energy consumed/generated
4. **Column_Order:** The order of columns used in the official EIA reports
5. **Description:** The description of which sector consumed/generated the electricity
6. **Unit:** The unit of energy used for the value

(Source: https://www.eia.gov/totalenergy/data/monthly/pdf/sec13.pdf)

Hereinafter, we'll be referring to the energy consumption dataset as **`energy_df`** and the electricity generation dataset as **`electricity_df`**.

The ```Description``` column gives sector (energy consumption) or source (electricity generation) description. Let's look at all the available description values for each dataset to understand what data is available.

These are the first rows from `energy_df`:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>MSN</th>      <th>YYYYMM</th>      <th>Value</th>      <th>Column_Order</th>      <th>Description</th>      <th>Unit</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>TXRCBUS</td>      <td>194913</td>      <td>4460.588</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>1</th>      <td>TXRCBUS</td>      <td>195013</td>      <td>4829.528</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>2</th>      <td>TXRCBUS</td>      <td>195113</td>      <td>5104.680</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>3</th>      <td>TXRCBUS</td>      <td>195213</td>      <td>5158.406</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>4</th>      <td>TXRCBUS</td>      <td>195313</td>      <td>5052.749</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>    </tr>  </tbody></table>

And these are the unique descriptions available:

| Description 	|
|-	|
| Primary Energy Consumed by the Residential Sector 	|
| Total Energy Consumed by the Residential Sector 	|
| Primary Energy Consumed by the Commercial Sector 	|
| Total Energy Consumed by the Commercial Sector 	|
| Primary Energy Consumed by the Industrial Sector 	|
| Total Energy Consumed by the Industrial Sector 	|
| Primary Energy Consumed by the Transportation Sector 	|
| Total Energy Consumed by the Transportation Sector 	|
| Primary Energy Consumed by the Electric Power Sector 	|
| Energy Consumption Balancing Item 	|
| Primary Energy Consumption Total 	|


These are the first rows from `electricity_df`:


<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>MSN</th>      <th>YYYYMM</th>      <th>Value</th>      <th>Column_Order</th>      <th>Description</th>      <th>Unit</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>CLETPUS</td>      <td>194913</td>      <td>135451.32</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1</th>      <td>CLETPUS</td>      <td>195013</td>      <td>154519.994</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>2</th>      <td>CLETPUS</td>      <td>195113</td>      <td>185203.657</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>3</th>      <td>CLETPUS</td>      <td>195213</td>      <td>195436.666</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>4</th>      <td>CLETPUS</td>      <td>195313</td>      <td>218846.325</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>    </tr>  </tbody></table>

And these are the unique descriptions available:

| Description 	|
|-	|
| Electricity Net Generation From Coal, All Sectors 	|
| Electricity Net Generation From Petroleum, All Sectors 	|
| Electricity Net Generation From Natural Gas, All Sectors 	|
| Electricity Net Generation From Other Gases, All Sectors 	|
| Electricity Net Generation From Nuclear Electric Power, All Sectors 	|
| Electricity Net Generation From Hydroelectric Pumped Storage, All Sectors 	|
| Electricity Net Generation From Conventional Hydroelectric Power, All Sectors 	|
| Electricity Net Generation From Wood, All Sectors 	|
| Electricity Net Generation From Waste, All Sectors 	|
| Electricity Net Generation From Geothermal, All Sectors 	|
| Electricity Net Generation From Solar, All Sectors 	|
| Electricity Net Generation From Wind, All Sectors 	|
| Electricity Net Generation Total (including from sources not shown), All Sectors 	|

Here we see that we have a variety of energy consumption sectors, as well as a variety of energy generation sources for each sector. We are specifically interested in nuclear electric power generation and electric power consumption as the DoE is considering nuclear power projects.

## Pre-processing data to simplify analysis moving forward

In the real world, we often do not have the luxury of dealing with perfectly clean and formatted data. Therefore, we will need to perform some operations on provided data to get it into a format amenable to further analysis.

### Exercise 1

Take a look at the tables above and come up with ideas about what could need some cleaning up. Remember that this should serve the purpose of creating plots to tackle the business problem at hand.

**Hint:** You might find it useful to download the CSV files and open them in Excel or another spreadsheet package to inspect them. It is also useful to know the data types of the columns.

In `energy_df`:

| Column 	| dtype 	|
|-	|-	|
| MSN 	| object 	|
| YYYYMM 	| int64 	|
| Value 	| float64 	|
| Column_Order 	| int64 	|
| Description 	| object 	|
| Unit 	| object 	|

In `electricity_df`:

| Column 	| dtype 	|
|-	|-	|
| MSN 	| object 	|
| YYYYMM 	| int64 	|
| Value 	| object 	|
| Column_Order 	| int64 	|
| Description 	| object 	|
| Unit 	| object 	|


**Answer.** 
We need to deal with some missing values in the `Value` column in `electricity_df`. These missing values are encoded as the “Not Available” string and make this an `object` column instead of the more natural `float` like in the `energy_df` dataset. Moreover, the `YYYYMM` column is not very convenient to work with, since it mixes years and months inside a single string and we usually like being able to filter by year and month separately. The `Description`column seems to have cells that are too long, and that can make our plots less readable for the final user. Finally, we will need to also remove columns `MSN`and `Column_Order` because they are not relevant to our data analysis.

-------


### Missing values


The ```Value``` column is a numerical value but is currently in string format in `electricity_df`, which makes performing math operations on it more difficult. This is because this column is missing some values, which are recorded as "Not Available" (there are 830 of them). For instance:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>MSN</th>      <th>YYYYMM</th>      <th>Value</th>      <th>Column_Order</th>      <th>Description</th>      <th>Unit</th>    </tr>  </thead>  <tbody>    <tr>      <th>1890</th>      <td>NGETPUS</td>      <td>201905</td>      <td>116366.35</td>      <td>3</td>      <td>Electricity Net Generation From Natural Gas, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1891</th>      <td>NGETPUS</td>      <td>201906</td>      <td>136995.382</td>      <td>3</td>      <td>Electricity Net Generation From Natural Gas, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1892</th>      <td>NGETPUS</td>      <td>201907</td>      <td>174345.342</td>      <td>3</td>      <td>Electricity Net Generation From Natural Gas, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1893</th>      <td>NGETPUS</td>      <td>201908</td>      <td>176454.286</td>      <td>3</td>      <td>Electricity Net Generation From Natural Gas, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1894</th>      <td>NGETPUS</td>      <td>201909</td>      <td>150741.814</td>      <td>3</td>      <td>Electricity Net Generation From Natural Gas, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1895</th>      <td>NGETPUS</td>      <td>201910</td>      <td>133685.005</td>      <td>3</td>      <td>Electricity Net Generation From Natural Gas, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1896</th>      <td>OJETPUS</td>      <td>194913</td>      <td>Not Available</td>      <td>4</td>      <td>Electricity Net Generation From Other Gases, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1897</th>      <td>OJETPUS</td>      <td>195013</td>      <td>Not Available</td>      <td>4</td>      <td>Electricity Net Generation From Other Gases, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1898</th>      <td>OJETPUS</td>      <td>195113</td>      <td>Not Available</td>      <td>4</td>      <td>Electricity Net Generation From Other Gases, A...</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1899</th>      <td>OJETPUS</td>      <td>195213</td>      <td>Not Available</td>      <td>4</td>      <td>Electricity Net Generation From Other Gases, A...</td>      <td>Million Kilowatthours</td>    </tr>  </tbody></table>

This time, those missing values are not really relevant, so we decide to **drop them**.

### Shortening `Description` and removing columns

Moreover, notice that the existing descriptions are quite long. We might as well use some abbreviations:

- PEC: Primary Energy Consumption
- TEC: Total Energy Consumption
- ENG: Electricity Net Generation

Let's change the ```Description``` column to use the abbreviated form and reduce the clutter of the output. This will be useful later on when we are making plots and want clean organized figures. We will also remove columns ```MSN``` and ```Column_Order```:

This is for `energy_df`:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>YYYYMM</th>      <th>Value</th>      <th>Description</th>      <th>Unit</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>194913</td>      <td>4460.588</td>      <td>PEC Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>1</th>      <td>195013</td>      <td>4829.528</td>      <td>PEC Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>2</th>      <td>195113</td>      <td>5104.680</td>      <td>PEC Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>3</th>      <td>195213</td>      <td>5158.406</td>      <td>PEC Residential Sector</td>      <td>Trillion Btu</td>    </tr>    <tr>      <th>4</th>      <td>195313</td>      <td>5052.749</td>      <td>PEC Residential Sector</td>      <td>Trillion Btu</td>    </tr>  </tbody></table>

This is for `electricity_df`:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>YYYYMM</th>      <th>Value</th>      <th>Description</th>      <th>Unit</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>194913</td>      <td>135451.32</td>      <td>ENG Coal</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>1</th>      <td>195013</td>      <td>154519.994</td>      <td>ENG Coal</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>2</th>      <td>195113</td>      <td>185203.657</td>      <td>ENG Coal</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>3</th>      <td>195213</td>      <td>195436.666</td>      <td>ENG Coal</td>      <td>Million Kilowatthours</td>    </tr>    <tr>      <th>4</th>      <td>195313</td>      <td>218846.325</td>      <td>ENG Coal</td>      <td>Million Kilowatthours</td>    </tr>  </tbody></table>



### Splitting `YYYYMM` and figuring out month 13

Another suboptimal characteristic of the data as provided is that the ```YYYYMM``` column for the month and year is difficult to use. Let's split this column into a `YYYY` and a `MM` column. For `energy_df`:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>MSN</th>      <th>YYYYMM</th>      <th>Value</th>      <th>Column_Order</th>      <th>Description</th>      <th>Unit</th>      <th>YYYY</th>      <th>MM</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>TXRCBUS</td>      <td>194913</td>      <td>4460.588</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>      <td>1949</td>      <td>13</td>    </tr>    <tr>      <th>1</th>      <td>TXRCBUS</td>      <td>195013</td>      <td>4829.528</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>      <td>1950</td>      <td>13</td>    </tr>    <tr>      <th>2</th>      <td>TXRCBUS</td>      <td>195113</td>      <td>5104.680</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>      <td>1951</td>      <td>13</td>    </tr>    <tr>      <th>3</th>      <td>TXRCBUS</td>      <td>195213</td>      <td>5158.406</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>      <td>1952</td>      <td>13</td>    </tr>    <tr>      <th>4</th>      <td>TXRCBUS</td>      <td>195313</td>      <td>5052.749</td>      <td>1</td>      <td>Primary Energy Consumed by the Residential Sector</td>      <td>Trillion Btu</td>      <td>1953</td>      <td>13</td>    </tr>  </tbody></table>

And for `electricity_df`:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>MSN</th>      <th>YYYYMM</th>      <th>Value</th>      <th>Column_Order</th>      <th>Description</th>      <th>Unit</th>      <th>YYYY</th>      <th>MM</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>CLETPUS</td>      <td>194913</td>      <td>135451.32</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>      <td>1949</td>      <td>13</td>    </tr>    <tr>      <th>1</th>      <td>CLETPUS</td>      <td>195013</td>      <td>154519.994</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>      <td>1950</td>      <td>13</td>    </tr>    <tr>      <th>2</th>      <td>CLETPUS</td>      <td>195113</td>      <td>185203.657</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>      <td>1951</td>      <td>13</td>    </tr>    <tr>      <th>3</th>      <td>CLETPUS</td>      <td>195213</td>      <td>195436.666</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>      <td>1952</td>      <td>13</td>    </tr>    <tr>      <th>4</th>      <td>CLETPUS</td>      <td>195313</td>      <td>218846.325</td>      <td>1</td>      <td>Electricity Net Generation From Coal, All Sectors</td>      <td>Million Kilowatthours</td>      <td>1953</td>      <td>13</td>    </tr>  </tbody></table>

Now that we've de-cluttered the output, we can now easily see that the month column has some `13` values which is obviously one more than the Gregorian calendar. Let's try to figure out what is going on. Sometimes, datasets have a month 13 that is just the sum of all the actual months.  We will group the months according to if they are between 1 and 12 or not to investigate. This is the sum of `Value` for both categories ("Actual months" and "Month 13") in the first years present in the `energy_df` dataset.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th></th>      <th>Value</th>    </tr>    <tr>      <th>YYYY</th>      <th>Type of month</th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th>1949</th>      <th>Month 13</th>      <td>95903.492</td>    </tr>    <tr>      <th>1950</th>      <th>Month 13</th>      <td>103795.680</td>    </tr>    <tr>      <th>1951</th>      <th>Month 13</th>      <td>110860.796</td>    </tr>    <tr>      <th>1952</th>      <th>Month 13</th>      <td>110178.029</td>    </tr>    <tr>      <th>1953</th>      <th>Month 13</th>      <td>112921.833</td>    </tr>    <tr>      <th>1954</th>      <th>Month 13</th>      <td>109840.527</td>    </tr>    <tr>      <th>1955</th>      <th>Month 13</th>      <td>120534.823</td>    </tr>    <tr>      <th>1956</th>      <th>Month 13</th>      <td>125165.445</td>    </tr>    <tr>      <th>1957</th>      <th>Month 13</th>      <td>125261.853</td>    </tr>    <tr>      <th>1958</th>      <th>Month 13</th>      <td>124827.830</td>    </tr>    <tr>      <th>1959</th>      <th>Month 13</th>      <td>130268.958</td>    </tr>    <tr>      <th>1960</th>      <th>Month 13</th>      <td>135122.190</td>    </tr>    <tr>      <th>1961</th>      <th>Month 13</th>      <td>137072.448</td>    </tr>    <tr>      <th>1962</th>      <th>Month 13</th>      <td>143325.182</td>    </tr>    <tr>      <th>1963</th>      <th>Month 13</th>      <td>148765.772</td>    </tr>    <tr>      <th>1964</th>      <th>Month 13</th>      <td>155266.414</td>    </tr>    <tr>      <th>1965</th>      <th>Month 13</th>      <td>161859.542</td>    </tr>    <tr>      <th>1966</th>      <th>Month 13</th>      <td>170847.704</td>    </tr>    <tr>      <th>1967</th>      <th>Month 13</th>      <td>176684.794</td>    </tr>    <tr>      <th>1968</th>      <th>Month 13</th>      <td>187204.200</td>    </tr>    <tr>      <th>1969</th>      <th>Month 13</th>      <td>196789.036</td>    </tr>    <tr>      <th>1970</th>      <th>Month 13</th>      <td>203450.360</td>    </tr>    <tr>      <th>1971</th>      <th>Month 13</th>      <td>207781.121</td>    </tr>    <tr>      <th>1972</th>      <th>Month 13</th>      <td>217980.183</td>    </tr>    <tr>      <th rowspan="2" valign="top">1973</th>      <th>Actual months</th>      <td>226948.398</td>    </tr>    <tr>      <th>Month 13</th>      <td>226949.505</td>    </tr>    <tr>      <th rowspan="2" valign="top">1974</th>      <th>Actual months</th>      <td>221780.458</td>    </tr>    <tr>      <th>Month 13</th>      <td>221780.862</td>    </tr>    <tr>      <th rowspan="2" valign="top">1975</th>      <th>Actual months</th>      <td>215793.641</td>    </tr>    <tr>      <th>Month 13</th>      <td>215792.392</td>    </tr>    <tr>      <th rowspan="2" valign="top">1976</th>      <th>Actual months</th>      <td>227807.957</td>    </tr>    <tr>      <th>Month 13</th>      <td>227809.473</td>    </tr>  </tbody></table>

As we can see, the first years did not report monthly information. Let's look at the most recent years:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th></th>      <th>Value</th>    </tr>    <tr>      <th>YYYY</th>      <th>Type of month</th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th>2014</th>      <th>Month 13</th>      <td>294821.081</td>    </tr>    <tr>      <th rowspan="2" valign="top">2015</th>      <th>Actual months</th>      <td>292131.486</td>    </tr>    <tr>      <th>Month 13</th>      <td>292131.480</td>    </tr>    <tr>      <th rowspan="2" valign="top">2016</th>      <th>Actual months</th>      <td>292006.341</td>    </tr>    <tr>      <th>Month 13</th>      <td>292006.328</td>    </tr>    <tr>      <th rowspan="2" valign="top">2017</th>      <th>Actual months</th>      <td>293122.535</td>    </tr>    <tr>      <th>Month 13</th>      <td>293122.524</td>    </tr>    <tr>      <th rowspan="2" valign="top">2018</th>      <th>Actual months</th>      <td>303585.860</td>    </tr>    <tr>      <th>Month 13</th>      <td>303585.839</td>    </tr>    <tr>      <th>2019</th>      <th>Actual months</th>      <td>248786.149</td>    </tr>  </tbody></table>

### Question

From what you see, is month 13 the sum of all the months in each year?

---

Now that we have processed our data (we removed the month 13 rows as well), we can begin visualizing it to see if we can uncover hidden patterns.

## Identifying the relationship between energy consumption and generation

Recall that our boss wants to determine how to optimally allocate the DoE's electricity generation resources given consumption patterns. It makes sense to look at how consumption patterns have generally varied across time and sectors in order to drive electricity generation strategy. Let's analyze this by doing some basic plotting. Given that we represent the DoE and are investigating nuclear energy, one thing that makes sense to look at is the relationship between energy consumption by each major sector and the net energy generation from nuclear electric power.

We will first use a 2D scatterplot to visualize the data. Scatterplots are versatile and are often the first type of plot one uses when visualizing a dataset. We'll start with the electric power sector; we'll build a scatterplot with ```PEC Electric Power Sector``` on the y-axis, and ```ENG Nuclear Electric Power``` on the x-axis. This will allow us to see how electric power energy consumption correlates to nuclear electric power generation:

![Electricity scatterplot](data/images/nuclear_electric_scatter.png)

### Exercise 2

This is a scatterplot of energy consumption in the commercial sector and nuclear electric power net energy generation. Is the relationship between these variables stronger or weaker when compared to the electric power sector's result? What might this mean in terms of a potential DoE recommendation?

![Nuclear, commercial electricity](data/images/nuclear_commercial_scatter.png)


**Answer.**
From these plots we see that commercial sector consumption levels do not track nuclear electric power generation levels very well, while electric power sector levels do. This may mean that it is critical for the DoE to dedicate significant resources towards electricity generation for the electric power sector, as it seems to be a significant driver of marginal demand for nuclear power, whereas it is not so important for the DoE to dedicate resources towards electricity generation for the commercial sector.

-------

## Trends in energy consumption and generation over time

While a scatterplot helps us visualize the relationship between two variables, it does not allow us to look at something across time. For this, we will use a different tool: the **line plot**.

A line plot is excellent for viewing data that evolves over time (called **time series data**) and will help us determine trends and cyclical patterns across time for both electric power sector energy consumption and nuclear electric power energy generation.

Let's build a line plot for each of these series:

![Line plot of PEC Electric Power Sector](data/images/electric_line_plot.png)
![ENG Nuclear Electric Power](data/images/nuclear_line_plot.png)

Notice that both energy consumption and generation exhibit an upward trend over time, with a strong cyclical pattern (observe the oscillating nature of the time series). We can remove the trend and look only at the cycles by plotting the percentage change for each month across time instead of the actual levels:

![Percentage change](data/images/pct_change.png)

We see that the percentage changes from month to month are not growing significantly over time. This indicates that the percentage fluctuations in energy consumption remain relatively constant, even as the total amount of energy used across time has grown.

In light of this, one useful statistic to better understand energy usage relative to electricity generation is the ratio of energy consumed to electricity generated. This may give us insight into how well supply meets demand and how nuclear power may expand and contract electricity generation in high or low demand periods.

## Analyzing the ratio of energy consumed to electricity generated

Let's calculate the ratio of energy consumed to energy generated for each month. Understanding the distribution of the ratio will allow us to see how energy consumption and electricity generation deviate relative to one another. We will continue to look at the ```PEC Electric Power Sector``` energy consumption and ```ENG Nuclear Electric Power``` energy generation for this:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Ratio</th>    </tr>  </thead>  <tbody>    <tr>      <th>count</th>      <td>562.000000</td>    </tr>    <tr>      <th>mean</th>      <td>0.067162</td>    </tr>    <tr>      <th>std</th>      <td>0.037070</td>    </tr>    <tr>      <th>min</th>      <td>0.042228</td>    </tr>    <tr>      <th>25%</th>      <td>0.048402</td>    </tr>    <tr>      <th>50%</th>      <td>0.052376</td>    </tr>    <tr>      <th>75%</th>      <td>0.074305</td>    </tr>    <tr>      <th>max</th>      <td>0.272376</td>    </tr>  </tbody></table>

It's also useful to obtain a graphical representation of the distribution by constructing a histogram:

![Ratio histogram](data/images/ratio_hist.png)

Here, we see that the ```Ratio``` variable is largely clustered around 0.05, with some large values that extend upward to 0.25, though these higher values are not common (hence they have a bar with a smaller height in the histogram).  

Is there something that combines a visual for the distribution of the data with the summary statistics? There is! The **box plot** consists of an inner box, and two "whiskers" on either side. The central horizontal line in the middle of the box corresponds to the median of the data that the box plot represents, while the upper and lower edges of the box represent the 75th percentile and 25th percentile of the data, respectively. The whiskers are drawn at either 1.5 times the **interquartile range (IQR)** from the edges of the box or at the minimum and maximum values in the data (whichever is closer to the median). The IQR is the difference between the 75th percentile value and the 25th percentile value:

![Boxplot diagram](data/images/boxplot_diagram.png)

### Exercise 3

To help you familiarize yourself with the relationship between histograms and box plots, we have created two applets for you to play with. Play around with the parameters and share your thoughts with the class.

In [None]:
import c1applet.boxcomboasymmetry as boxcombosymmetry
import c1applet.boxcombospread as boxcombospread

In [None]:
# Exploring the relationship between assymetry and the shape of a boxplot
# If you need to stop the app, just restart your kernel
bapp = boxcombosymmetry.app
bapp.run_server(port='8050')

In [None]:
# Exploring the relationship between spread and the shape of a boxplot
# This new app will replace the previous one (you can't have both running simultaneously)
bspapp = boxcombospread.app
bspapp.run_server(port='8050')

Since we saw a cyclical pattern in the line plot analysis preceding this section, let's take a closer look at how the distribution of the ratio of consumed to generated energy evolves over the different months of the year. We will do so by creating a series of side-by-side box plots, one per month:

![Boxplots of ratios per month](data/images/ratio_month_boxplots.png)

### Exercise 4

Recall the ratio is the energy consumed (PEC Electric Power Sector) divided by the energy generated (ENG Nuclear Electric Power). What patterns do you notice in this ratio from the plots above? What could be a possible reason for those patterns? What might you recommend to your boss based on this?

**Answer.**

-------

## Are peak consumption and generation months consistent across many years?

Let's now look at consumption and generation levels month-by-month level over time to see if the peak cyclical patterns we see are stable across many decades of data. For this we will use **heat maps**, which allow us to nicely visualize the monthly energy consumed and electricity generated over time:

![Heatmap of consumption](data/images/heatmap_1.png)
![Heatmap of generation](data/images/heatmap_2.png)

The color bar on the right indicates the level of the variable under study. Each colored rectangle, therefore, conveys three numbers: The year (horizontal axis), the month (vertical axis), and the value (energy consumed in the first plot, energy generated in the second plot). The colors are mapped to a gradient scale so that the largest values are always red and the smallest values are always blue (that is, in heat maps, *the color a region takes depends on its value in the dataset*).



### Exercise 5

Take a look at the heat maps. What patterns can you detect?

**Answer.**

-------

These heat maps are enormously useful for identifying big, sweeping trends over time. However one downside of these heat maps is that it is tough to be more granular and see changes in growth rates from year to year. Let's build a box plot to better understand if growth is stable across time, and whether this translates to growth in both the peak and non-peak months.

## Assessing growth stability  differences in peak month energy demands across time

One way to better understand growth stability of energy consumption and electricity generation is to view distributional data over time. We can do so by looking at two box plots for each year, one for peak months and one for non-peak months, and determine how stable the growth, consumption, and generation categories have been:

![Boxplot](data/images/ExampleImage.png)

### Exercise 6

Interpret this box plot. What interesting things do you notice?

**Answer.**

-------

We've looked closely at the changes in nuclear electric power generation and consumption over time. Let's now shift towards the second part of your boss's request by looking at consumption patterns across sectors. This has important implications for how the government should allocate resources for nuclear power.

## Which sectors consume the most energy?

Let's split the data into categories and create a box plot of values for each category:

![Boxplot of energy consumption by sector](data/images/sector_boxplot.png)

Here we see that the PEC Electric Power Sector has the highest energy consumption across all sectors. We also see there are sizable differences in the variability of energy consumption across sectors (e.g. some box plots have much larger interquartile ranges than others).

### A more granular view of energy consumption by sector

Although boxplots give you some insight into the distribution of the underlying data in each category, they are still relatively blunt instruments. For example, how is the data distributed within the interquartile range? Between the edges of the box and the whiskers? Since a boxplot is created from only five values, it cannot answer these fine-grained questions. However, a **strip plot** is able to combine a 1D scatterplot with a split by category to get an even more granular view of the data:

![Strip plot by sector](data/images/sector_stripplot.png)

### Exercise 7 (optional)

From the above graphs, which sector has the widest range of PEC values? Which sector has the smallest range of PEC values? How could this information be useful to energy production companies?

**Answer.**

-------

## Conclusions

We've done an extensive analysis of energy consumption and electricity generation trends over time and across sectors.

We discovered that there is a peak in energy consumption and generation in the summer months of the year, and as time has passed the gap between peak and non-peak consumption has widened. This may present a market opportunity for a power plant which has the ability to expand and contract capacity as needed.

Finally, we saw that different sectors have very different energy consumption profiles. In particular, the electric power sector seems to be a significant driver of marginal demand. This means that the DoE might want to focus its efforts on reducing consumption in this sector or providing additional generation.

## Takeaways

Data visualization is a powerful tool in the data professional's toolkit for effective data exploration and communication. It helps you find interesting trends in your data that you would otherwise not notice. It is also crucial when communicating your findings to non-technical persons. As such, it is an important skill for you to continue to practice and develop.

## Attribution

"Monthly Energy Review", US Energy Information Administration, [Public Domain](https://www.eia.gov/about/copyrights_reuse.php), https://www.eia.gov/totalenergy/data/monthly/index.php.

"Boxplot vs PDF" (modified), Jhguch, Creative Commons Attribution-Share Alike 2.5 Generic license, https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg

"Dash styleguide", Chris P., [MIT License](https://blog.codepen.io/documentation/terms-of-service/), https://codepen.io/chriddyp/pen/bWLwgP