# Seasonal Regression

The electricity demand series shows daily, weekly, and annual oscillations.  At a short time scale, I aim to capture the first two of these.

The approaches I've seen suggest Fourier Series, linear regression, and Seasonal ARIMA.
(I got quite stuck on how to detrend the series in a global fashion.)
I will focus on building models on the last two weeks of data, with the goal of predicting electricity demand based on temperature, time of day, and day of week.

## Hyndman's Multiple Seasonal Exponential Smoothing

This follows Rob Hyndman's approach towards multi-seasonal exponential smoothing.  (This generalizes the apparently well-known Holt-Winters smoothing).
His analysis includes forecasts of electricity generation, based on utility data (from well over 10 years ago).

I chose to follow this model since initial attempts at ARIMA rely on removing the seasonality, and I had hoped to just follow best practice with existing libraries.  Initial naive methods gave complete crap, and failed to remove the seasonal pattern, or even worse imposed one.  An initial attempt at Fourier filtering on over a year of data also left a 

Hyndman also seems to be a known author within the field of econometric time-series forecasting.  

The original model for a variable $y_t$, with seasonal pattern with period $m$ is
\begin{align}
  y_t &= l_{t-1}+b_{t-1} +s_{t-m} +\epsilon_t\\
  l_t &= l_{t-1} + \alpha\epsilon_t\\
  b_t &= b_{t-1} + \beta\epsilon_t\\
  s_{t} = s_{t-m} + \gamma \epsilon_t
\end{align}
where $l_t$, b_t,s_t$ are the level, trend and seasonal patterns respectively.
The noise is Gaussian and obeys
$E[\epsilon_t]=0, E[\epsilon_t\epsilon_s]=\delta_{ts}\sigma^2$, and $\alpha,\beta,\gamma$ are constants between zero and one.  (He notes that $m+2$ estimates must be made for the initial values of the level, trend and seasonal pattern).

Hyndman's model allows multiple seasons, and allows the sub-seasonal terms to be updated more quickly than once per large season.  In utility data, the short season is the daily oscillation, while the longer season comes from the weekly oscillation induced by the work week.  For hourly data, the daily cycle has length $m_1=24$, with the weekly cycle taking $m_2=168$.  The ratio between them is $k=m_2/m_1=7.$  The number of seasonal patterns is $r\le k$.  

(I'm going to change Hyndman's notation to use $\mathbf{I}$ to denote indicator/step functions).
\begin{align}
  y_t &= l_{t-1}+b_{t-1} +\sum_{i=1}^r \mathbf{I}_{t,i}s_{i,t-m_1} +\epsilon_t\\
  s_{i,t} = s_{i,t-m_1} + \sum_{j=1}^r\left(\gamma_{ij}\mathbf{I}_{t,j}\right) \epsilon_t  (i=1,2,\ldots,r)
  l_t &= l_{t-1} + b_{t-1}+\alpha\epsilon_t\\
  b_t &= b_{t-1} + \beta\epsilon_t\\
\end{align}
Here the indicator functions $\mathbf{I}_{t,i}$ are unity if $t$ is in the seasonal pattern $i$, and zero otherwise.  For utility data, this will probably be weekday and holiday/weekend.  Here $\gamma_{ij}$ denotes how much one seasonal pattern is updated based on another---Hyndman proposes a number of restrictions on these parameters.

I will extend this to include an external variables for the deviation above a given temperature, so that $y_t\rightarrow y_t+\tau_p\Theta(T_t-T_p)+\tau_{n}\Theta(T_n-T_t)$.

He suggests using the first four weeks of data to estimate the parameters, by minimizing the squared error of the one-step ahead forecast.  Apparently maximum likelihood estimation was not recommended (10 years ago).

So how to fit the parameters?  A really simple approach would be gradient descent?  Intuitively, the level is the average value, the bias is the average gradient.  The seasonality is the average seasonal pattern.  (This is the dumb STL decomposition used earlier?)

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from get_weather_data import convert_isd_to_df, convert_state_isd
from EBA_util import remove_na, avg_extremes

In [7]:
air_df = pd.read_csv('data/air_code_df.gz')

#Just get the weather station data for cities in Oregon.
df_weather=convert_state_isd(air_df,'OR')
#Select temperature for Portland, OR
msk1=np.array(df_weather['city']=='Portland')
msk2=np.array(df_weather['state']=='OR')

df_pdx_weath=df_weather.loc[msk1&msk2]

#get electricity data for Portland General Electric
df_eba=pd.read_csv('data/EBA_time.gz',index_col=0,parse_dates=True)
msk=df_eba.columns.str.contains('Portland')
df_pdx=df_eba.loc[:,msk]

dem=df_pdx.iloc[:,0]
temp=df_joint['Temp']
#Make a combined Portland Dataframe for demand vs weather.
df_joint=pd.DataFrame(dem)
df_joint=df_joint.join(df_pdx_weath)
df_joint['TempShift']=150+abs(temp-150)
df_joint=df_joint.rename(columns={df_joint.columns[0]:'Demand'})


done with Mahlon Sweet Field


done with Salem Municipal Airport/McNary Field


done with Portland International Airport


In [34]:
#clean up data, remove NA
dem = remove_na(dem)
dem = avg_extremes(dem)


NameError: name 'np' is not defined

> [0;32m/home/jonathan/Data-Science/US-Electricity/EBA_util.py[0m(156)[0;36mremove_na[0;34m()[0m
[0;32m    154 [0;31m    [0mReplace[0m [0mall[0m [0mNA[0m [0mvalues[0m [0;32mwith[0m [0mthe[0m [0mmean[0m [0mvalue[0m [0mof[0m [0mthe[0m [0mseries[0m[0;34m.[0m[0;34m[0m[0m
[0m[0;32m    155 [0;31m    """
[0m[0;32m--> 156 [0;31m    [0mna_msk[0m[0;34m=[0m[0mnp[0m[0;34m.[0m[0misnan[0m[0;34m([0m[0mdf[0m[0;34m.[0m[0mvalues[0m[0;34m)[0m[0;34m[0m[0m
[0m[0;32m    157 [0;31m    [0;31m#first pass:replace them all with the mean value - if a whole day is missing.[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    158 [0;31m    [0mprint[0m[0;34m([0m [0;34m"Number of NA values {}"[0m[0;34m.[0m[0mformat[0m[0;34m([0m[0msum[0m[0;34m([0m[0mna_msk[0m[0;34m)[0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0m
[0m


NameError: name 'np' is not defined

In [82]:
t=np.arange(168)
dem_sub=dem[0:4*24*7]
plt.plot(dem_sub.values,'b',temp[0:4*168].values,'r',pred,'k')
plt.show()

<matplotlib.figure.Figure at 0x7fd873bd1748>

In [79]:
def fit_init_params(y):
    """fit_init_params(y)
    Fits initial parameters for Hyndman's multi-seasonal model to
    hourly electricity data.
    (My guess on how to do this, similar to naive STL method used in 
    statstools.timeseries)

    Finds level, bias and seasonal patterns based on first 4 weeks of data.  
    """
    m1 = 24
    k = 7
    ysub = y[0:4*24*7]
    yval = ysub.values
    #average value
    l = np.mean(yval)
    #average shift
    b = np.mean(np.diff(yval))
    #remove mean pattern, subtractin off level, and linear trend.    
    ysub = ysub-l-b*np.arange(4*24*7)

    #mean seasonal pattern.
    #second seasonal pattern is for weekends, with days
    #Saturday/Sunday have dayofweek equal to 5 and 6.
    #make a mask to select out weekends.
    s2 = ysub.index.dayofweek >=5
    #select out weekends, and regular days. 
    y_end = ysub[s2]
    y_week=ysub[~s2]
    n1 = int(len(y_week)/24)
    n2 = int(len(y_end)/24)
    s = np.zeros((2,24))
    print(n1,n2)
    for n in range(n1):
        s[0,:] = s[0,:]+y_week[n*24:(n+1)*24]/n1
    
    for n in range(n2):
        s[1,:] = s[1,:]+y_end[n*24:(n+1)*24]/n2

    return l, b, s

def predict_stl(l,b,s,timeIndex):
    """predict_stl(l,b,s,times)

    """
    trend=l+b*np.arange(len(timeIndex))
    #find weekend/weekedays.  
    msk=timeIndex.dayofweek>=5
    #
    n1 = int(sum(~msk)/24)    
    n2 = int(sum(msk)/24)
    #Use fact that first sub-season is weekdays in first row.
    #Use integer conversion of true/false to 0/1.
    #Then use fact that seasonal patterns are 24 hours long.
    pred=trend+s[msk.astype(int),timeIndex.hour.values]
    return pred

def update_params(param_vec,grad_vec,actual,predicted):
    """update_params
    Change parameters based on gradients, and difference
    between predicted and actual values.
    
    grad_vec - list of functions to evaluate at parameters

    WIP
    """
    param_new=param_vec.copy()
    nparam=len(param_vec)
    for i in range(nparam):
        param_new[i] += grad_vec[i](param_vec)*(actual-predicted)
    return param_new


In [80]:
l,b,s=fit_init_params(dem)

pred=predict_stl(l,b,s,dem_sub.index)


20 8


In [76]:
msk=dem_sub.index.dayofweek>=5

s[msk.astype(int),dem_sub.index.hour.values]

array([ 958.51319104,  996.49083635,  961.16848165,  859.94612696,
        739.37377227,  658.95141757,  432.02906288,  146.20670818,
        -83.86564651, -228.03800121, -318.6103559 , -358.7827106 ,
       -336.85506529, -229.27741998,  -52.64977468,  139.67787063,
        269.55551593,  380.48316124,  483.26080654,  580.63845185,
        669.91609715,  760.99374246,  839.62138777,  916.24903307,
        958.51319104,  996.49083635,  961.16848165,  859.94612696,
        739.37377227,  658.95141757,  432.02906288,  146.20670818,
        -83.86564651, -228.03800121, -318.6103559 , -358.7827106 ,
       -336.85506529, -229.27741998,  -52.64977468,  139.67787063,
        269.55551593,  380.48316124,  483.26080654,  580.63845185,
        669.91609715,  760.99374246,  839.62138777,  916.24903307,
        958.51319104,  996.49083635,  961.16848165,  859.94612696,
        739.37377227,  658.95141757,  432.02906288,  146.20670818,
        -83.86564651, -228.03800121, -318.6103559 , -358.78271

1

Rambling Time!

From a Kalman filter perspective, I think that sometimes the error/innovation terms can be written as $\epsilon_t = y_t-\hat{y}_t$, where $y_t$ is the actual value, and $\hat{y}_t$ is the output of the model with no noise.  The innovation process, then gives a rule for updating (the set of parameters $\alpha,\beta,\Gamma, l, b, s_{i,t}$) how to change in the 

In [35]:
%pdb

Automatic pdb calling has been turned OFF
