# HW9

### Author: Joseph Wong

## Import Packages and the Data Set

In [9]:
# Some basic package imports
import os
import numpy as np
import pandas as pd

# Visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'
import seaborn as sns

# Datetime packages
from datetime import datetime

In [10]:
df = pd.read_csv('data/opsd_germany_daily.csv', parse_dates = True, index_col=0)
df

Unnamed: 0_level_0,Consumption,Wind,Solar,Wind+Solar
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2006-01-01,1069.18400,,,
2006-01-02,1380.52100,,,
2006-01-03,1442.53300,,,
2006-01-04,1457.21700,,,
2006-01-05,1477.13100,,,
...,...,...,...,...
2017-12-27,1263.94091,394.507,16.530,411.037
2017-12-28,1299.86398,506.424,14.162,520.586
2017-12-29,1295.08753,584.277,29.854,614.131
2017-12-30,1215.44897,721.247,7.467,728.714


## Data Basics and Preparation


#### Variables

In [32]:
df.shape

(4383, 4)

There are 4383 observations and 4 variables.

In [33]:
df.index.name

'Date'

In [34]:
df.dtypes

Consumption    float64
Wind           float64
Solar          float64
Wind+Solar     float64
dtype: object

**Index**
The data set is indexed by date (yyyy-mm-dd format).

**Variables**
Electricity production and consumption are reported as daily totals in gigawatt-hours (GWh). The columns of the data file are:
- Consumption (float): Electricity consumption in GWh
- Wind (float): Wind power production in GWh
- Solar (float): Solar power production in GWh
- Wind+Solar (float): Sum of wind and solar power production in GWh

In [35]:
df.describe()

Unnamed: 0,Consumption,Wind,Solar,Wind+Solar
count,4383.0,2920.0,2188.0,2187.0
mean,1338.675836,164.814173,89.258695,272.663481
std,165.77571,143.692732,58.550099,146.319884
min,842.395,5.757,1.968,21.478
25%,1217.859,62.35325,35.17925,172.1855
50%,1367.123,119.098,86.407,240.991
75%,1457.761,217.90025,135.0715,338.988
max,1709.568,826.278,241.58,851.556


The median consumption is 1367.123 GWh while the median production of wind+solar is 240.991. This suggests there are other electricity production sources that are included in the consumption, but not included as production values in the data set. Although one might think the statistics of wind+solar would be the sum of the statistics of wind and solar, it is not (i.e. max wind+solar =/= max wind + max solar). The statistics of wind+solar are based on data collected daily, so statistics like min and max represent a specific day's value, not a summary of the wind and solar columns. In the data set, it appears that wind produces more GWh daily compared to solar (median of 119.0098 vs 86.407 and mean of 164.814173 vs 89.258695 respectively).

#### NaNs

In [36]:
df.isna().sum()

Consumption       0
Wind           1463
Solar          2195
Wind+Solar     2196
dtype: int64

Among the columns, consumption has 0 NaNs, wind has 1463, solar has 2195, and wind+solar has 2196. Any analysis we perform should take this into consideration because more than half of the observations in solar and wind+solar are NaNs. This could potential affect the outcome of the analysis.

#### Date Range and Frequency

In [46]:
df.index.min()

Timestamp('2006-01-01 00:00:00')

In [47]:
df.index.max()

Timestamp('2017-12-31 00:00:00')

In [48]:
df.index.value_counts().value_counts()

count
1    4383
Name: count, dtype: int64

The data set contains dates from January 1, 2006 to December 31, 2017. Each observation represents one day. There are 4383 observations which means there are 12 years and 3 days represented (accounts for additional leap year days).

#### Add Columns: Year, Month, and Weekday Name

In [68]:
df['Year'] = df.index.year
df['Month'] = df.index.month
df['Weekday Name'] = df.index.strftime("%a")

## Data Exploration - Basic Visualization

Start to make plots and see if you can generate some questions about the data. Make sure that you make observations about each plot - say what you see and what it means in terms of the data.

- Plot the overall consumption over time.
- Plot the wind and solar consumption over time.

- Choose a focal year and redo the plots to look at variability over the year.
- Redo this for a focal month

In [69]:
df

Unnamed: 0_level_0,Consumption,Wind,Solar,Wind+Solar,Year,Month,Weekday Name
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-01-01,1069.18400,,,,2006,1,Sun
2006-01-02,1380.52100,,,,2006,1,Mon
2006-01-03,1442.53300,,,,2006,1,Tue
2006-01-04,1457.21700,,,,2006,1,Wed
2006-01-05,1477.13100,,,,2006,1,Thu
...,...,...,...,...,...,...,...
2017-12-27,1263.94091,394.507,16.530,411.037,2017,12,Wed
2017-12-28,1299.86398,506.424,14.162,520.586,2017,12,Thu
2017-12-29,1295.08753,584.277,29.854,614.131,2017,12,Fri
2017-12-30,1215.44897,721.247,7.467,728.714,2017,12,Sat


## Further Exploration

Now continue exploring the data to see what you can find out. Remember to explain what you are learning from each graph or calculation. Add guiding words in markdown to talk about what your code should be doing and why.

- How does seasonality effect the energy consumption? Consider the consumption grouped on a monthly basis. You could look at max, min, mean, etc. Make an interesting plot of this data (bar plot, box plot, etc). What do you learn?

- How does the day of the week change energy consumption?

- Using downsampling, plot on the same graph the daily (original data) and the average weekly (downsampled data) consumption for both solar and wind.

- Using downsampling plot the yearly rolling average of both wind and solar consumption.

- See if you can come up with a really cool graph of your own!

## Conclusion