In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose

%matplotlib inline
%load_ext autoreload
%autoreload 2

from get_weather_data import convert_isd_to_df, convert_state_isd

pi=np.pi

  from pandas.core import datetools


In [2]:
air_df = pd.read_csv('data/air_code_df.gz')

#Just get the weather station data for cities in Oregon.
df_weather=convert_state_isd(air_df,'OR')

#Read all of the weather data in.
#df_weather=pd.read_csv('data/airport_weather.gz',index_col=0,parse_dates=True)


done with Mahlon Sweet Field


done with Salem Municipal Airport/McNary Field


done with Portland International Airport


In [3]:
#load electricity data
df_eba=pd.read_csv('data/EBA_time.gz',index_col=0,parse_dates=True)
#df_region_eba=pd.read_csv('data/EBA_region_time.gz',index_col=0,parse_dates=True)

In [4]:
#Select temperature for Portland, OR
msk1=np.array(df_weather['city']=='Portland')
msk2=np.array(df_weather['state']=='OR')

df_pdx_weath=df_weather.loc[msk1&msk2]


In [5]:
#get electricity data for Portland General Electric
msk=df_eba.columns.str.contains('Portland')
df_pdx=df_eba.loc[:,msk]


### Anomaly Detection

A quick look at the portland data suggests that there are both real outliers, and ones from errors in the data process (100x surrounding values).  

Tests should be for total interchange = 0, and 
Demand=Net Gen - Net Interchange

In [10]:
vnew=[735567.85,736564,0,10000]
fig=plt.figure(figsize=(15,6))
ax = fig.add_axes([0.1, 0.1, 0.6, 0.75])
ax.plot(df_pdx)
ax.legend(df_pdx.columns.values,loc='upper left',bbox_to_anchor=(1,1),prop={'size':9})

<matplotlib.figure.Figure at 0x7efbc24a8668>

<matplotlib.legend.Legend at 0x7efbc2340b70>

In [None]:
#Check that the energy is balanced for this small subset: Demand = Net Generation - Net Interchange.
#Seems to not be true.  

In [44]:
dem=df_pdx.iloc[:,0]
gen=df_pdx.iloc[:,2]
net=df_pdx.iloc[:,3]
plt.figure()
plt.plot(dem-(-gen+net),'r')

[<matplotlib.lines.Line2D at 0x7efba1317400>]

<matplotlib.figure.Figure at 0x7efb9f648a58>

The data in later 2015 seem pretty crappy.  Looking at the EBA user notes, this seems to be a common complaint.
The other errors seem to involve some anomalous zero points in the temperature series.  For temperature series where huge swings are unlikely
it may be feasible to replace anomalous 0 values with the average of the neighbouring points.  In case of actual zero values, this shouldn't be a large problem?

In [6]:
#Make a combined Portland Dataframe for demand vs weather.
dem=df_pdx.iloc[:,0]
df_joint=pd.DataFrame(dem)
df_joint=df_joint.join(df_pdx_weath)
df_joint.head()
#plt.figure()
x=df_joint.iloc[:,0]
y=df_joint.iloc[:,1]
df_joint['TempShift']=150+abs(df_joint['Temp']-150)
df_joint=df_joint.rename(columns={df_joint.columns[0]:'Demand'})

In [164]:
df_joint.head()

                     Demand  CloudCover  DewTemp  Precip-1hr  Precip-6hr  \
2015-07-01 00:00:00  3648.0         2.0    144.0         0.0         NaN   
2015-07-01 01:00:00  3658.0         0.0    150.0         0.0         NaN   
2015-07-01 02:00:00  3608.0         0.0    156.0         0.0         NaN   
2015-07-01 03:00:00  3493.0         0.0    156.0         0.0         NaN   
2015-07-01 04:00:00  3374.0         0.0    150.0         0.0         NaN   

                     Pressure   Temp  WindDir  WindSpeed      city  \
2015-07-01 00:00:00   10150.0  333.0    310.0       77.0  Portland   
2015-07-01 01:00:00   10146.0  322.0    310.0       93.0  Portland   
2015-07-01 02:00:00   10145.0  317.0    310.0       82.0  Portland   
2015-07-01 03:00:00   10146.0  300.0    320.0       62.0  Portland   
2015-07-01 04:00:00   10148.0  278.0    320.0       51.0  Portland   

                      city, state     region state  TempShift  
2015-07-01 00:00:00  Portland, OR  Northwest    OR      33

In [7]:
plt.figure()
plt.plot(df_joint['Temp'],df_joint.iloc[:,0],'rx')
plt.ylabel('Hourly Demand (kWh)')
plt.xlabel('Temperature (Celcius x10)')
plt.title('Energy Usage vs Temperature in Portland, OR')

<matplotlib.figure.Figure at 0x7ff7cfc3b438>

Text(0.5,1,'Energy Usage vs Temperature in Portland, OR')

In [8]:
plt.figure()
plt.plot(df_joint['WindSpeed'],df_joint.iloc[:,0],'rx')
plt.xlabel('Wind Speed (m/s x10)')
plt.ylabel('Hourly Demand (kWh)')
plt.title('Energy Usage vs Temperature in Portland, OR')

<matplotlib.figure.Figure at 0x7efbb9752ba8>

Text(0.5,1,'Energy Usage vs Temperature in Portland, OR')

In [11]:
plt.figure()
plt.plot(df_joint['Precip-1hr'],df_joint.iloc[:,0],'rx')
plt.ylabel('Demand (kWh)')
plt.xlabel('Precipitation (mm x 10)')
plt.title('Energy Usage vs Precipitation in Portland, OR')

<matplotlib.figure.Figure at 0x7efbb98c6f28>

Text(0.5,1,'Energy Usage vs Precipitation in Portland, OR')

So the scatterplot for temperature versus demand shows a clear (expected) trend as the tempererature becomes excessively hot or cold.
It looks like two blobs with similar slopes for deviations from 15 Celcius.  You can also see anomalous values at zero,
and extremely high values.  I'm skeptical of the 9000kWh value?

Let's also plot the correlation matrix across the whole time series.  Evidently a temperature  deviation from 15 celcius shows the largest correlation, with wind speed being the next most important.
I know the coldest temperatures in some places emerge in inversions (with absolutely no air movement).

In [None]:
My naive model for how energy usage would vary is a factor for deviation from some ideal temperature, as well as daily and yearly oscillations.
\begin{equation}
    \text{Demand}= A_0+A_T|T-T_0|+A_\text{day}\sin\left( \frac{2\pi t}{24}+\phi_{\text{day}}\right)+A_\text{year}\sin\left(\frac{2\pi d}{365}+\phi_{\text{year}}\right)
\end{equation}
where $t$ is the hour of the day in 24 hour time, and $d$ is the number of days since the start of the year.

To get a sense of those oscillations, let's look at the autocorrelation function for demand, as a function of time.  (Alternatively, the power spectrum?)

# Removing Extremes

Lets try to clean up some of this data.
My strategy is to find missing (or zero values) or excessive data.  Find values larger than 3x standard deviations from the mean.
Those extreme values are replaced with the mean of the two neighbouring points.
This is also carried out for points with zero. Under the assumption that the data are otherwise continuous, the smoothing should not be a large distortion.


In [8]:
def avg_extremes(df,window=2):
    """avg_extremes(df)
    Replace extreme outliers, or zero values with the average on either side.
    Suitable for occasional anomalous readings.
    """
    mu=df.mean()
    sd=df.std()
    msk1=(df-mu)>4*sd
    msk2 = df==0
    msk=msk1|msk2
    print( "Number of extreme values {}. Number of zero values {}".format(sum(msk1),sum(msk2)))
    ind= np.arange(len(df))[msk]
    for i in ind:
        df.iloc[i]=(df.iloc[i-window]+df.iloc[i-window])/2

    return df

def remove_na(df,window=2):
    """remove_na(df)
    Replace all NA values with the mean value of the series.
    """
    na_msk=np.isnan(df.values)
    #first pass:replace them all with the mean value - if a whole day is missing.
    df[na_msk]=df.mean()

    ind= np.arange(len(df))[na_msk]
    #for isolated values, replace by the average on either side.    
    for i in ind:
        df.iloc[i]=(df.iloc[i-window]+df.iloc[i-window])/2
    return df




# Auto regressive modelling

A popular approach assumes that the current demand is probably the same as the previous demand, with some noise.
This is the auto-regressive, integrated, moving average (ARIMA) class of models that are popular linear models within econometric forecasting.

In [9]:
def make_seasonal_plots(dem,temp,per,nlags):
    """Make seasonal decomposition of temperature, and demand curves.
    Plots those decompositions, and their correlation/autocorrelation plots.
    dem- input demand series
    temp-input temperature series
    per - input date to index on for plotting, e.g. '2016-03'
    nlags - number of lags for correlation plots.
    """
    #Carry out the "demand" and "temperature" seasonal decompositions.
    dem_decomposition = seasonal_decompose(dem,two_sided=False)
    dem_mu=dem.mean()
    dem_trend = dem_decomposition.trend/dem_mu  #Find rolling average over most important period.
    dem_seasonal = dem_decomposition.seasonal/dem_mu  #Find the dominant frequency components
    dem_residual = dem_decomposition.resid/dem_mu  #Whatever is left.

    temp_decomposition = seasonal_decompose(temp,two_sided=False)
    temp_mu=temp.mean()
    temp_trend = temp_decomposition.trend/temp_mu  #Find rolling average over most important period.
    temp_seasonal = temp_decomposition.seasonal/temp_mu  #Find the dominant frequency components
    temp_residual = temp_decomposition.resid/temp_mu  #Whatever is left.

    #Plot out the decompositions
    plt.figure(figsize=(15,9))
    plt.title('Normalized Seasonal Decomposition')
    plt.subplot(411)
    plt.plot(dem_trend[per],'b',temp_trend[per],'k')
    plt.ylabel('Trend')
    plt.subplot(412)
    plt.plot(dem_seasonal[per],'b',temp_seasonal[per],'k')
    plt.ylabel('Seasonal Oscillation')
    plt.subplot(413)
    plt.plot(dem_residual[per],'b',temp_residual[per],'k')
    plt.ylabel('Residuals')
    plt.subplot(414)
    plt.plot(dem[per]/dem_mu,'b',temp[per]/temp_mu,'k')
    plt.ylabel('Data')
    plt.show()

    #Plot the auto-correlation plots.
    nlags=np.min([len(dem[per])-1,nlags,len(temp[per])-1])
    print('Nlags',nlags)
    plt.figure(figsize=(10,6))
    plot_acf(temp_residual[per],'b-x','Temp Residual',nl=nlags)
    plot_acf(dem_residual[per],'r-+','Demand Residual',nl=nlags)
    plt.legend()
    plt.show()

    plt.figure(figsize=(10,6))
    plot_acf(temp[per],'b-x','Temp',nl=nlags)
    plot_acf(dem[per],'r-+','Demand',nl=nlags)
    plt.legend()
    plt.show()

    return None

def plot_acf(ts,ls,line_label,nl=50):
    """plot_acf(ts,ls,nl)
    Plot the auto-correlation plots for a timeseries (ts) up to a given number of lags (nl)
    Give a specific linestyle (ls), and label.
    """
    #Actually do those auto-corellations, on the series, and its absolute value.
    lag_acf = acf(ts,nlags=nl)
    lag_pacf=pacf(ts,nlags=nl,method='ols')
    #5% confidence intervals.
    sd = 1.96/np.sqrt(len(ts))
    #Make some purty subplots.
    plt.subplot(121)
    plt.ylabel('Auto Correlation')
    plt.plot(lag_acf,ls,label=line_label)
    plt.axhline(y=sd,color='gray')
    plt.axhline(y=-sd,color='gray')
    plt.ylabel('Auto Correlation')
    plt.xlabel('Lag')
    plt.subplot(122)
    plt.ylabel('Partial Auto Correlation')
    plt.xlabel('Lag')
    plt.axhline(y=sd,color='gray')
    plt.axhline(y=-sd,color='gray')
    plt.plot(lag_pacf,ls,label=line_label)
    return None


In [88]:
dem=df_joint['Demand'].asfreq('H')
dem=avg_extremes(dem)
dem=remove_na(dem)

temp=df_joint['Temp'].asfreq('H')
temp=avg_extremes(temp)
temp=remove_na(temp)

make_seasonal_plots(dem,temp,'2016-01',50)


<matplotlib.figure.Figure at 0x7ff7b438be80>

<matplotlib.figure.Figure at 0x7ff7b43d7940>



Nlags 50


<matplotlib.figure.Figure at 0x7ff7cf9a2f98>

Number of extreme values 0. Number of zero values 148
(JBM) Freq is  24
(JBM) Freq is  24


Number of extreme values 1. Number of zero values 3


Evidently, this finds the day timescale.  I'm a bit skeptical of these plots, and this approach (trying simple seasonality reduction on the whole data set at once).  I think the seasonal component has not been completely removed.
The "seasonal_decompose" method works by estimating the frequency of the data.  The trend is found by taking averages within each period, and the seasonality is found by averages over multiple periods.  The remainder once these are subtracted is the "noise" process.

There is an additional year-long oscillations are still buried in the trend.  Of course, this data has only two years worth of data. 

In [42]:
#Do some tests for stationarity
ad_results=adfuller(dem['2016-11'],autolag='BIC')
names=["Test statistic","p-value","#Lags","Num observed","Critical Values"]

for i in range(0,5):
    print( names[i],ad_results[i])


Test statistic -1.28011239076
p-value 0.638221098471
#Lags 20
Num observed 699
Critical Values {'1%': -3.4397398095543279, '5%': -2.8656836898038098, '10%': -2.5689766074363334}


The above plot is the raw auto-correlation between the demand and temperature.  I think there is a substantive daily oscillation left by the naive seasonal approach.  This assumes a single oscillation, repeated for all cases.  In this data however, there is a clear daily signal, which it picks out.  However, this will vary over the course of the year.

Diebold's text "Elements of Forecasting" suggests putting in dummy variables for seasonality.  So hour of day, and day of year.  The resulting series.  

In [86]:
#Compare series at noon
msk=df_joint.index.hour==12

dem=df_joint[msk]['Demand'].asfreq('D')
dem=avg_extremes(dem)
dem=remove_na(dem)

temp=df_joint[msk]['Temp'].asfreq('D')
temp=avg_extremes(temp)
temp=remove_na(temp)
make_seasonal_plots(dem,temp,'2016-03',40)


<matplotlib.figure.Figure at 0x7ff7b4224978>

<matplotlib.figure.Figure at 0x7ff7a3ba7940>



Nlags 30


<matplotlib.figure.Figure at 0x7ff7a3b44a20>

Number of extreme values 7. Number of zero values 0
Number of extreme values 0. Number of zero values 5
(JBM) Freq is  7
(JBM) Freq is  7


In [None]:
So looking at just an hour of the day, the seasonal split manages to work fairly well at making the residual series a stationary one.
The "trend" is effectively picking out the anticipated annual shifts, and the "seasonality" is pulling out a small week long oscillation (the amplitude is much smaller than the trend).  The residuals also seem to be stationary now.  

The autocorrelation plots also show some oscillations (I think the seasonal reduction is pretty crap), but here they decay to within error after
6 days.  
The raw demand auto-correlations might be showing annual oscillations in temperature and electricity usage that would get stronger from 120-240 days.

Turns out the "seasonal" part 

If we look at the correlation plots for various hours there are a couple clear trends.  Looking at 6pm, shows a really clear weekly (7 day) signal.  This is not as obvious at other times of day (6am, 9am, 12pm).  Note that I have not selected out weekends, or holidays here.  Weekends might be strongly contributing to the weekly oscillation.  


## Fourier Plots

I'm curious about the power spectrum for this series.  I'm also unfamiliar with Python's FFT routine, so this is a good time to play around.

In [124]:
#clean up the data
dem_t=df_joint['Demand']['2015-07':'2016-06'].copy()
dem_t=avg_extremes(dem_t)
dem_t=remove_na(dem_t)
dem_tv=dem_t.values


#set up FFT time/frequency scales
Nt = len(dem_tv)
#scale time to days.
Tmax = Nt/24
dt = 1/24
t = np.arange(0,Tmax,dt)
df = 1/Tmax
fmax=0.5/dt
f = np.arange(-fmax,fmax,df)

#carry out fft 
dem_f=np.fft.fftshift(dem_tv)
dem_f=np.fft.fft(dem_f)
dem_f=np.fft.ifftshift(dem_f)



Number of extreme values 0. Number of zero values 0


In [66]:
plt.figure(figsize=(15,10))
spec=abs(dem_f)**2
spec/=sum(spec)
plt.semilogy(f,spec)
fcut=1/7
plt.axis([-10*fcut,10*fcut,1E-10,1])
plt.xlabel('Frequency (1/day)')
plt.ylabel('Normalized Demand Power Spectrum')
plt.show()


<matplotlib.figure.Figure at 0x7ff7cdd870b8>

This is a normalized power spectrum for the demand data.  You can clearly see the peaks arising from daily and weekly oscillations.
There is a small peak at very low frequencies, which corresponds to the annual oscillation.  However, given we only have 2 years of data, this
is almost exactly the Nyquist frequency (lowest frequency that can be resolved).  Let's examine both the high (intra-day) and low (year-long) frequency scales.

The top figure, shows the low frequency (year-long) data.  The lower plot shows nearly the whole frequency spectrum.  Note the peaks at 1,2,3,etc.  These are the daily frequency oscillations.  They also share correlations with other frequencies fo

In [81]:
def remove_square_peak(Y,f,center,width):
    """remove_yearly
    Assumes there is a yearly trend.
    Subtracts off everything on a monthly or longer timescale. (around 1/30)
    Replaces that with the average of the neighbouring points.
    
    inputs:
    Y - initial centered Fourier transform
    f - list of frequencies Fourier transform is evaluated a
    shape - function to use to define the window.  Takes a position input, and width. 
    center - frequency to center filter at, to remove        
    width - width of the filter.

    return:
    detrended -transform after subtracting off this component.  
    trend     -the subtracted portion.
    """ 
    #find stuff within +/- 1 width
    trend_msk= abs(f-center)<width
    #find stuff within +/- 1.5 widths, and not inside 1 ith
    mean_msk = abs(f-center)<1.5*width
    mean_msk = mean_msk & ~trend_msk

    replace_avg = Y[mean_msk].mean()
    replace_std = Y[mean_msk].std()
    trend=np.zeros(len(f))+0j
    trend[trend_msk] = Y[trend_msk]-replace_avg
    detrend = Y-trend
    return trend, detrend

def remove_sinc_peak(Y,f,center,width):
    """remove_sinc_peak
    Assumes there is a peak described by a sinc (fro mthe truncated FFT)
    Tries to set the peak height based on the value of the FFT at the peaks
    Subtracts off a sinc function with that amplitude. 
    Replaces that with the average of the neighbouring points.
    
    inputs:
    Y - initial centered Fourier transform
    f - list of frequencies Fourier transform is evaluated a
    shape - function to use to define the window.  Takes a position input, and width. 
    center - frequency to center filter at, to remove        
    width - width of the filter.

    return:
    detrended -transform after subtracting off this component.  
    trend     -the subtracted portion.
    """ 
    #find stuff within +/- 1 width
    trend_msk= abs(f-center)<width
    #find stuff within +/- 1.5 widths, and not inside 1 ith
    replace_avg = Y[trend_msk].mean()
    trend = replace_avg*sinc((f-center)/width)
    detrend = Y-trend
    return trend, detrend

def sinc(x):
    """sinc(x)
    Computes sin(x)/x, with care to take correct limit at x=0
    """
    msk=(abs(x)>1E-16)
    s=np.zeros(len(x))
    s[~msk]=1
    s[msk]=np.sin(x[msk])/x[msk]
    return s


In [125]:
def fft_detrend(F,f,width,remove_func):
    """detrend(dem_f,f,width,remove_func)
    
    Removes mean, annual, daily and weekly trends in data
    by filtering the FFT.

    inputs:
    F - Fourier transformed function
    f - frequency list (assumed to be scaled so 1 = 1/day)
    width - frequency width to apply on filter
    remove_func - functional form of the filter.

    return:
    F_trend_tot - total trend removed
    F_detrend   - detrended function.
    """ 

    F_detrend=dem_f
    F_trend_tot=np.zeros(len(dem_f))+0j

    F_trend,F_detrend=remove_func(dem_f,f,0,width)
    F_trend_tot+=F_trend
    #remove daily oscillations
    for k in [1,2]:
        #positive peak
        F_trend,F_detrend=remove_func(F_detrend,f,k,width)
        F_trend_tot+=F_trend
        #negative peak
        F_trend,F_detrend=remove_func(F_detrend,f,-k,width)
        F_trend_tot+=F_trend

    # #remove weekly oscillations
    for i in range(1,6):
        f0=i/7
        F_trend,F_detrend=remove_func(F_detrend,f,f0,width)
        F_trend_tot+=F_trend
        F_trend,F_detrend=remove_func(F_detrend,f,-f0,width)
        F_trend_tot+=F_trend

    return F_trend_tot,F_detrend



In [122]:
def moving_avg(Y,width):
    """moving_avg(Y, width)
    Compute moving average by differencing the cumulative sum.
    """
    Ycum = np.cumsum(Y)
    Ysmooth=np.zeros(len(Y))+0j
    Ysmooth[width:-width]=(Ycum[2*width:]-Ycum[:-2*width])/(2*width)
    return Ysmooth    

width=3
n=np.arange(0,11)

print(len(n[0:-2*width]),len(n[width:-width]),len(n[2*width:]))
print((n[0:-2*width]),(n[width:-width]),(n[2*width:]))

dem_f_s=moving_avg(dem_f,2)


5 5 5
[0 1 2 3 4] [3 4 5 6 7] [ 6  7  8  9 10]


In [123]:
y = np.exp(-0.5*f*f)+ 0.15*np.random.randn(len(f))

plt.figure()
plt.plot(f,y,f,moving_avg(y,5))
plt.show()

<matplotlib.figure.Figure at 0x7ff7b4101198>

  return array(a, dtype, copy=False, order=order)


In [83]:
f_trend_tot,f_detrend = fft_detrend(dem_f,f,2/365,remove_square_peak)

In [107]:
plt.figure(figsize=(12,9))
plt.axis([-0.2,2,1E3,1E8])
plt.semilogy(f,abs(f_trend_tot),f,abs(f_detrend),f,abs(dem_f_s))
plt.show()


<matplotlib.figure.Figure at 0x7ff7a3ba2128>

In [21]:
#check out what this detrending looks like.
#
def invert_fft(Y):
    #undo the fftshifts, invert fft, and take the real part
    y=np.fft.fftshift(Y)
    y=np.fft.ifft(y)
    y=np.fft.fftshift(y)
    y=np.real(y)
    return y

t_trend=invert_fft(f_trend_tot)
t_detrend=invert_fft(f_detrend)

# t_trend=pd.Series(t_trend,index=dem_t.index)
# t_detrend=pd.Series(t_detrend,index=dem_t.index)


In [18]:
plt.figure(figsize=(12,9))
plt.plot(t,dem_t,'b',t,t_trend,'r',t,t_detrend,'g')
#plt.axis([550,560,min(t_detrend),max(dem_t)])

[<matplotlib.lines.Line2D at 0x7efbb9708f60>,
 <matplotlib.lines.Line2D at 0x7efbb9708518>,
 <matplotlib.lines.Line2D at 0x7efbb9708940>]

<matplotlib.figure.Figure at 0x7efbb7caa780>

So that used just July/2015-June/2016 data to find the trend.  Let's now see how this does when applied to the next year's data.
The trend can be appended to itself.  

In [47]:
dem_t2=df_joint['Demand']['2015-07':'2017-06'].copy()
dem_t2=avg_extremes(dem_t2)
dem_t2=remove_na(dem_t2)

#need to ditch a day due to leap year in 2016 elongating the year
t_trend2=np.append(t_trend,t_trend[:-24])
t_trend2 = pd.Series(t_trend2,index=dem_t2.index)

Number of extreme values 1. Number of zero values 3


In [110]:
plt.figure(figsize=(12,9))
per='2016'
plt.plot(-t_trend2[per]+dem_t2[per],'b-')

[<matplotlib.lines.Line2D at 0x7efb994a5ef0>]

<matplotlib.figure.Figure at 0x7efb99285c18>

In [104]:
plt.figure(figsize=(12,9))
plot_acf((t_detrend),'r-x','Manually Detrended',nl=30)
plot_acf(dem_residual['2016-01'],'b-x','Detrended',nl=30)
plt.legend()
plt.show()

NameError: name 'dem_residual' is not defined

<matplotlib.figure.Figure at 0x7efb99e6acf8>

NameError: name 'dem_residual' is not defined

In [None]:
So, that was a waste of time.  My manual detrend-everything-at-once approach seems to have failed.  Considering that both of the remaining series have long-lived correlations, but weak partial correlations, it might be better to take a difference.  That will amplify the noise. 

In [106]:
ad_results=adfuller(t_detrend,autolag='BIC')
names=["Test statistic","p-value","#Lags","Num observed","Critical Values"]

for i in range(0,5):
    print( names[i],ad_results[i])



Test statistic -8.17261024485
p-value 8.54236198904e-13
#Lags 32
Num observed 8751
Critical Values {'1%': -3.4310974824840628, '5%': -2.8618703376017911, '10%': -2.5669458337323605}


In [53]:
?pd.date_range

## Goals

What is my goal here?  To develop a model for day-ahead electricity forecasts, that optimizes the mean square error.  I have been playing with trying to capture an entire year's data.  (I wanted to explore the seasonal patterns, and try fitting a basic model.)

However, trying to forecast a year's power (at daily resolution) is a fool's errand.  What is a smaller task, I can play with?
I could try fitting day-ahead curves, using the last week's data.  Each day is then its own problem, with much more manageable requirements.
To finish the ARMA stuff, I can estimate the expected ARMA parameters from a bunch of separate two-week periods. Once the model parameters
are set, I can fit the model for each period, and forecast the next day's behaviour. Those parameters can the nbe used in the future, perhaps
with feedback based on how they worked in the past.

I also want to fit a Long Short-Term Memory neural network to this data.  This will be done in TensorFlow,  where I will try to build the network using the lower-level instructions, rather than any built-in operations .  This problem seems a good match for this technique, since there are clear correlations, and some scope for nonlinearities.  In this case we must select parameters for the size and depth of
the network.

# Appendices

I've accumulated things I was playing with here, such as the distinction between auto-correlation, and partial auto-correlation plots, and numpy's fft syntax.

## ACF vs PACF

The following example helped me understand the distinction between the ACF and PACF.  The PACF tries to remove the correlation due to the intermediate variables, to find how the innovation/noise a step $k$ in the past, affects the present.   The following model models a random walk, and adds on a delayed copy of itself.  You can see the peaks in the PACF at lags corresponding to the enforced lag.  So the ACF tells us the order of the auto-regression, and PACF tells us the order of the moving average.  

In [114]:
Nx=10000
s=2
x = np.arange(0,Nx)
z= np.random.randn(Nx)
z1=np.zeros(Nx)

z1[s:Nx] = z[0:Nx-s]
y = 2*x +2*z - .5*z1

tindex = pd.date_range('2015-01-01',periods=Nx)
ts = pd.Series(y,index=tindex)
plt.figure()
plot_acf(ts,'r-+','T0',nl=10)
plt.show()

<matplotlib.figure.Figure at 0x7efb99c70358>