<center>
    <font size=5>Home and Cabin 
        Power Consumption 2021</font>
</center>

##### Sources:

# General
Notebook to analyse electricity consumption in my parents house and in the cottage.

<u><font size=4>Motivation / Objective:</font></u>
* Investigate the dataset using interactive plotting tools in python
* Catch trends seasonal and cyclic patterns in data
* Forecast power consumption based on time-series data

<u><font size=4>Data:</font></u>
* **Data type:** Tabular Data
    Hourly Power Consumption Dataset:
    * Data source: Eesti Energia AS, Estonian main electricity prowider company
    * Data download date: 25.01.2021
    * Data range: 01.01.2021 00:00 - 01.01.2022 00:00
    * Data given: hourly consumption rate in **kwh** - kilotwatt-hours

    Monthly Power Consumption Dataset:
    * Data source: Eesti Energia AS, 
    * Data download date: 25.01.2021
    * Data range 2020-2022
    * Monthly consumption summary statistics for years 2020 & 2021:
        * Daily 
        * Nightly
        * Total
* **Problem Type:** Predict Power consumption Supervised Time-Series Regresion

# Imports

In [385]:
import pandas as pd
import numpy as np
import re

import pandas_bokeh
from bokeh.io import curdoc

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

In [386]:
pandas_bokeh.output_notebook()

# default plot theme for bokeh
curdoc().theme = 'dark_minimal'

# Data

## Classes, functions

In [254]:
## CLASSES ##
class DFDtypeMapper(BaseEstimator, TransformerMixin):
    """Remap pandas dataframe dtypes.
    Parameters
    ----------
    dtype_dict : dict, {'dtype':[col_name]}
        Dictionary of dtypes as keys and values as list of column names. 
    
    Returns
    -------
    DataFrame : pd.DataFrame"""
    def __init__(self, dtype_dict : dict):
        self.dtype_dict = dtype_dict
        self.transformed_column_names = None 
    
    def fit(self, X, y=None):
        self.all_columns_ = X.columns
        return self
    
    def get_feature_names_out(self, input_features=None) -> np.ndarray:
        check_is_fitted(self)
        return self.all_columns_
    
    def transform(self, X, y=None) -> pd.DataFrame:
        X_ = X.copy()
        # remove columns that are not in X
        _dtype_dict = {}
        for dtype, val in self.dtype_dict.items():
            if isinstance(val, str):
                if val in X_.columns: 
                    _dtype_dict[dtype] = val
            elif type(val) not in [tuple, list, np.ndarray]:
                raise ValueError(f'Wrong type for {self.dtype_dict} value.')
            else:
                _dtype_dict[dtype] = [col for col in val 
                                           if col in X_.columns]
        
        for dtype in _dtype_dict:
            X_[_dtype_dict[dtype]] = X_[_dtype_dict[dtype]].astype(dtype)
        
        return X_

class DFValueMapper(BaseEstimator, TransformerMixin):
    """Rename values in column based on dictionary.
    Parameters
    ----------
    map_dict : dict 
        Dictionary of old mappings to new.
    cat_only : bool, default True
        - If True: consider category dtype columns only
        - If False: apply to all columns. Computationally more expensve.
    
    Returns
    -------
    DataFrame : pd.DataFrame
        Remapped pandas DataFrame."""
    def __init__(self, map_dict : dict, cat_only=True):
        self.cat_only = cat_only
        self.map_dict = map_dict
    def fit(self, X, y=None):
        self.all_columns_ = X.columns
        return self
    def get_feature_names_out(self, input_features=None) -> np.ndarray:
        check_is_fitted(self)
        return self.all_columns_
    def transform(self, X, y=None) -> pd.DataFrame:
        X_ = X.copy()
        # categorical features
        if self.cat_only:
            cat_cols = X_.columns[(X_.dtypes == 'category').values]
            X_[cat_cols] = X_[cat_cols].apply(
                lambda x: x.cat.rename_categories(self.map_dict))
            return X_
        else:
            return X_.replace(self.map_dict)

## FUNCTIONS ##
def datetime_gaps(df : pd.DataFrame, column : str, freq='D'):
    """Display time series frequencies and gaps.
    
    Parameters
    ----------
    column : str, DataFrame column or index name.
    freq : str, default 'D'
        Predominant frequency of the datetime column/index."""
    
    df = df.reset_index()
    date_range = pd.date_range(df[column][0], df[column].iloc[-1], freq=freq)
    df[column] = df[column].astype(f"period[{freq}]")
    
    # find frequencies
    temp = df.groupby([column]).sum().reset_index()
    freqs = (temp.loc[:,column]# frequencies
        .diff()
        .value_counts(dropna=False)
        .to_frame())
    print(f"Frequencies")
    display(freqs)
    
    # find gaps
    gaps = date_range.difference(df[column])
    if len(gaps) == 0:
        print(f"No gaps in {column}.")
    else:
        print(f"{len(gaps)} gaps in datetime:")
        return gaps



## Hourly Usage

In [91]:
# load the data
hourly = pd.read_csv(
    'data/tarbimine_tund.csv', 
    header=4, sep=';',
    index_col=False,
    names=['start', 'end', 'cabin', 'home'],
    parse_dates=['start', 'end'],
    decimal=',')

hourly.head(3)

Unnamed: 0,start,end,cabin,home
0,2021-01-01 00:00:00,2021-01-01 01:00:00,0.16,0.86
1,2021-01-01 01:00:00,2021-01-01 02:00:00,0.12,0.737
2,2021-01-01 02:00:00,2021-01-01 03:00:00,0.12,1.377


In [92]:
hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   start   8760 non-null   datetime64[ns]
 1   end     8760 non-null   datetime64[ns]
 2   cabin   8759 non-null   float64       
 3   home    8759 non-null   float64       
dtypes: datetime64[ns](2), float64(2)
memory usage: 273.9 KB


## Monthly Usage
Data in summarized tabular form:
* 3 location chunks:
    * cabin
    * home
    * both summarized
* 3 year chuncks per location
    * 2020 fully
    * 2021 fully
    * 2022 partially
* Each year with monthly index and overall summary in the end.

In [4]:
# read in the data with location headings
monthly_use = pd.read_csv(
    'data/tarbimine.csv', 
    header=None, sep=';',
    skiprows=4,
    index_col=False,
    decimal=',',
    names=['month', 'day', 'night', 'total'])

monthly_use.head()

Unnamed: 0,month,day,night,total
0,Mõõtepunkti aadress: Sutu,,,
1,Tarbimine 2020. aastal,,,
2,,Päev (kWh),Öö (kWh),Kokku (kWh)
3,Jaanuar,39153,40772,79925
4,Veebruar,31930,38141,70071


### Clean & Reshape

In [16]:
temp = monthly_use.copy()

# references where to break the temp
locs = ['Sutu', 'Kuressaare', 'summeeritud']

# extract location
temp['locale'] = temp.month.str.extract(fr"({locs[0]}$|{locs[1]}$|{locs[2]})")
temp['locale'] = temp.locale.fillna(method='ffill')
temp['locale'] = temp.locale.rename({'Sutu':'cabin',
                                 'Kuressaare':'home',
                                 'summeeritud': 'total'})
# extract year
temp['year'] = temp.month.str.extract(r"^Tarbimine\s(\d{4})\.\saastal")
temp['year'] = temp.year.fillna(method='ffill')

# drop unneccessary rows
temp.dropna(inplace=True) # get rid of references

# convert comma decimals to points
temp.loc[:, 'day':'total'] = (
    temp.loc[:, 'day':'total']
    .apply(lambda x: x.str.replace(',', '.')))

# conversion dictionaries
dtype_dct = {'category':['month', 'locale', 'year'],
             'float':['day', 'night', 'total']}
map_dct = {  'Jaanuar':'Jan',
             'Veebruar': 'Feb',
             'Märts': 'Mar',
             'Aprill': 'Apr',
             'Mai': 'May',
             'Juuni': 'Jun',
             'Juuli': 'Jul',
             'August': 'Aug',
             'September': 'Sep',
             'Oktoober': 'Oct',
             'November': 'Nov',
             'Detsember': 'Dec',
             'Aasta kokku': 'Yearly',
             'Sutu': 'cabin',
             'Kuressaare': 'home',
             'summeeritud': 'total'}

# convert dtypes & remap values
temp = DFDtypeMapper(dtype_dct).fit_transform(temp)
temp = DFValueMapper(map_dct).fit_transform(temp)
monthly_stacked = temp.reset_index(drop=True)
monthly_stacked.head(3)

Unnamed: 0,month,day,night,total,locale,year
0,Jan,39.153,40.772,79.925,cabin,2020
1,Feb,31.93,38.141,70.071,cabin,2020
2,Mar,27.137,31.569,58.706,cabin,2020


In [17]:
# write cleaned df to csv
# monthly_stacked.to_csv('data/monthly_useage_clean.csv', index=False)

In [24]:
monthly = (
    monthly_stacked
    .query("year in ['2020','2021']")
    .set_index(['year', 'locale', 'month'])
    .sort_index())
monthly

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,day,night,total
year,locale,month,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020,home,Yearly,2598.340,2000.977,4599.317
2020,home,Apr,183.016,140.080,323.096
2020,home,Aug,207.633,164.666,372.299
2020,home,Dec,289.374,176.477,465.851
2020,home,Jan,211.679,162.732,374.411
...,...,...,...,...,...
2021,total,Mar,315.632,228.950,544.582
2021,total,Nov,310.882,187.765,498.647
2021,total,Oct,202.362,196.871,399.233
2021,total,Sep,200.281,163.210,363.491


## Holidays

In [319]:
holidays = pd.read_csv('data/holidays_estonia_2021.csv',
                       parse_dates=['date'])
holidays.head(3)

Unnamed: 0,date,description,type
0,2021-01-01,Uusaasta,"Riigipüha, puhkepäev"
1,2021-02-24,"Iseseisvuspäev, Eesti Vabariigi aastapäev","Rahvuspüha, puhkepäev"
2,2021-04-02,Suur reede,"Riigipüha, puhkepäev"


In [320]:
holidays.dtypes

date           datetime64[ns]
description            object
type                   object
dtype: object

# EDA

## Validate
##### Summary Stats
Validate summary statistics in <code>monthly</code> df.

In [93]:
hourly[['cabin', 'home']].sum()

cabin    1382.036
home     4663.473
dtype: float64

In [94]:
monthly.loc['2021',['cabin', 'home'],'Yearly'].total

year  locale  month 
2021  cabin   Yearly    1382.036
      home    Yearly    4663.473
Name: total, dtype: float64

## NaN-s

In [105]:
hourly_eda = hourly.copy()
hourly_eda[hourly_eda.isna().any(axis='columns')]

Unnamed: 0,start,end,cabin,home
2067,2021-03-28 03:00:00,2021-03-28 04:00:00,,


That NaN correspond to the switch from wintertime to daylight saving time in Estonia.

In [134]:
temp = hourly.copy()
temp = temp.set_index('end')[['cabin', 'home']]

# check duplicates
print(f"Has duplicates: {temp.index.has_duplicates}")

# potential hours missing in the data
print(f"n_rows: {temp.shape[0]}")
temp.index = temp.index.to_period(freq='H')
temp = temp.reset_index().groupby(['end']).cabin.sum().reset_index()
temp.end.diff().value_counts(dropna=False)

Has duplicates: False
n_rows: 8760


<Hour>    8759
NaT          1
Name: end, dtype: int64

No hours missing from the data. Check time series around when switching to wintertime at 4:00, last sunday in October. The the clock is turned back an hour.

In [147]:
temp = hourly.copy().set_index('start')
temp = temp[(temp.index.month==10) & # october
            (temp.index.weekday==6) & # sunday
            ((temp.index.hour>0) & (temp.index.hour<8))] # between 1am-8am
temp[temp.index.day == temp.index.day.max()].reset_index()

Unnamed: 0,start,end,cabin,home
0,2021-10-31 01:00:00,2021-10-31 02:00:00,0.607,0.171
1,2021-10-31 02:00:00,2021-10-31 03:00:00,0.095,0.138
2,2021-10-31 03:00:00,2021-10-31 04:00:00,0.138,0.958
3,2021-10-31 04:00:00,2021-10-31 05:00:00,0.153,0.174
4,2021-10-31 05:00:00,2021-10-31 06:00:00,0.062,0.137
5,2021-10-31 06:00:00,2021-10-31 07:00:00,0.056,0.153
6,2021-10-31 07:00:00,2021-10-31 08:00:00,0.106,0.864


Since there are no duplicated entries in the index we can assume that from 4am-5am holds summed data for 1 hour of sumemrtime and 1 hour of wintertime.

## Feature Engineering
In order to inspect the Time Series data we'are going to add some basic time-related information.

In [364]:
hourly_eda = hourly.copy()
hourly_eda = (
    hourly_eda
    .fillna(0)
    .rename({'end':'time'}, axis='columns')
    .set_index('time') # time as index
    .loc[:, 'cabin': 'home']) # drop start time

# add features
hourly_eda['month'] = hourly_eda.index.month
hourly_eda['day'] = hourly_eda.index.day
hourly_eda['hour'] = hourly_eda.index.hour
hourly_eda['day_of_week'] = hourly_eda.index.weekday
hourly_eda['is_weekend'] = hourly_eda.day_of_week > 4
hourly_eda['is_winter'] = (hourly_eda.month > 11) | (hourly_eda.month < 4)
hourly_eda['is_summer'] = (hourly_eda.month > 5) | (hourly_eda.month < 9)
hourly_eda.head(1)

Unnamed: 0_level_0,cabin,home,month,day,hour,day_of_week,is_weekend,is_winter,is_summer
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-01-01 01:00:00,0.16,0.86,1,1,1,4,False,True,True


##### Summer/Winter Time
* Transition to summertime (DST) : Last Sunday in March at 3:00
* Transition to wintertime : Last Sunday in October at 4:00

In [365]:
# last sunday in march at 3am
to_summer_time = (
    hourly_eda
    .query("month == 3 & day_of_week == 6 & hour == 4")
    .index.max())

# last sunday in october at 4am
to_winter_time = (
    hourly_eda
    .query("month == 10 & day_of_week == 6 & hour == 4")
    .index.max())

hourly_eda['is_dst'] = False
hourly_eda.loc[to_summer_time:to_winter_time, 'is_dst'] = True
hourly_eda.head(1)

Unnamed: 0_level_0,cabin,home,month,day,hour,day_of_week,is_weekend,is_winter,is_summer,is_dst
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021-01-01 01:00:00,0.16,0.86,1,1,1,4,False,True,True,False


##### Day & Night Rate
**Day rate:** 
* 7-23 during wintertime
* 8-24 during daylight saving time (DST)

**Night rate:**
* 23-7 during wintertime
* 24-8 during summertime
* during national holidays if it does not land on weekday

In [366]:
temp = hourly_eda.copy()

temp['rate'] = 'day'
temp.loc[
    temp.query("(hour <= 7 | hour > 23) & (is_dst == False)").index,
    'rate'] = 'night'
temp.loc[temp.query("hour <= 8 & is_dst == True").index, 'rate'] = 'night'
temp.loc[temp.query("is_weekend == True").index, 'rate'] = 'night'

hourly_eda = temp
hourly_eda.head(1)

Unnamed: 0_level_0,cabin,home,month,day,hour,day_of_week,is_weekend,is_winter,is_summer,is_dst,rate
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2021-01-01 01:00:00,0.16,0.86,1,1,1,4,False,True,True,False,night


##### Holidays

In [377]:
holidays.head()

Unnamed: 0,date,description,type
0,2021-01-01,Uusaasta,"Riigipüha, puhkepäev"
1,2021-02-24,"Iseseisvuspäev, Eesti Vabariigi aastapäev","Rahvuspüha, puhkepäev"
2,2021-04-02,Suur reede,"Riigipüha, puhkepäev"
3,2021-04-04,Ülestõusmispühade 1. püha,"Riigipüha, puhkepäev"
4,2021-05-01,Kevadpüha,"Riigipüha, puhkepäev"


In [375]:
temp = hourly_eda.copy()
hol = holidays.copy()
hol = hol.set_index('date')

temp['dummy_index'] = pd.to_datetime(temp.index.date)
temp = temp.reset_index().set_index('dummy_index')

# merge holidays to hourly
temp = temp.merge(hol['description'], how='left', 
                  left_index=True, right_index=True)

temp['is_holiday'] = temp.description.notna()
temp['description'] = temp.description.fillna('normal')

hourly_eda = temp.set_index('time')
hourly_eda.head(1)

Unnamed: 0_level_0,cabin,home,month,day,hour,day_of_week,is_weekend,is_winter,is_summer,is_dst,rate,description,is_holiday
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2021-01-01 01:00:00,0.16,0.86,1,1,1,4,False,True,True,False,night,Uusaasta,True


## Plots

In [403]:
hourly_eda[['cabin', 'home']][hourly_eda[['cabin', 'home']] > 0].describe()

Unnamed: 0,cabin,home
count,8391.0,8759.0
mean,0.164705,0.532421
std,0.210122,0.5184
min,0.001,0.025
25%,0.037,0.171
50%,0.098,0.351
75%,0.182,0.737
max,2.167,5.138


In [419]:
temp = hourly_eda.copy()
temp[['cabin', 'home']] = temp[['cabin', 'home']] * 1000
temp.head()

Unnamed: 0_level_0,cabin,home,month,day,hour,day_of_week,is_weekend,is_winter,is_summer,is_dst,rate,description,is_holiday
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2021-01-01 01:00:00,160.0,860.0,1,1,1,4,False,True,True,False,night,Uusaasta,True
2021-01-01 02:00:00,120.0,737.0,1,1,2,4,False,True,True,False,night,Uusaasta,True
2021-01-01 03:00:00,120.0,1377.0,1,1,3,4,False,True,True,False,night,Uusaasta,True
2021-01-01 04:00:00,160.0,170.0,1,1,4,4,False,True,True,False,night,Uusaasta,True
2021-01-01 05:00:00,120.0,252.0,1,1,5,4,False,True,True,False,night,Uusaasta,True


In [422]:
np.arange(0, 5501, 500)

array([   0,  500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000,
       5500])

In [425]:
temp[['cabin', 'home']].plot_bokeh(
    kind='hist', 
    bins=np.arange(0, 5501, 500),
    histogram_type='sidebyside',
    hovertool=True,
    title="Consumption Distributions",
    line_color='white',
    ylabel='Counts',
    use_index=False,
    xlabel='Consumption [Wh]',
    colormap=['blue', 'green'])

## Observations:
**Feature Engineering:**
* is_winter : 4 months - Dec, Jan, Feb, Mar. Selected by cold weather rather than winter months per se.
* is_winter : Jun, Jul, Aug
* is_dst : Daylight Saving Time (Mar 28 3am - Oct 31 4am)
* rate : daily or nightly price rate
* description : type of day, if ordinary day == "normal"
* is_holiday : if national holiday that day or not
    