# Deep Gaussian Processes using Spectral Kernels

We will be experimenting different number of mixtures of spectral kernels in the doubly stochastic variational inference framework for `deep Gaussian processes`.

Comparison of following will be investigated:

1) deep GP with gaussian kernel

2) deep GP with SM kernel [wilson’13]

3) deep GP with GSM kernel [remes’17]

We also compare to the standard SVI-GP without deepness (gpflow/SVGP) as a baseline.

We already know that Gaussian kernel can be reproduced by a 1-component SM kernel. Hence, the interesting research question is to see how the DGP behaves when we go from gaussian kernel (Q=1) to very spectral kernels (eg Q=10). How does the runtime / accuracy / optimization / kernel behave, do we get overfitting?


`The core would then be 5x10=50 experiments with 1..5 layers, and 1..10 spectral components. These will all be run with the deep GP of Salimbeni.`

Also, you should run baselines:

- [5] deepGP with gaussian kernel with 1..5 layers
- [1] SVGP with gaussian kernel
- [10] SVGP with spectral kernel with 1..10 components

This will in total give us 50+5+1+10=66 experiments.

You should use the “double stochastic” paper’s fig1 datasets for this, start from eg. the “power” dataset, and follow the same experimental procedure as Salimbeni.

Initialise the inducing locations by k-means, and initialise the inducing values to N(0,0.1). For the spectral kernels you need to do random restarts with different initial spectral components following the strategy of Wilson’13 at

https://people.orie.cornell.edu/andrew/code/

where check the steps 7+8. However the first spectral component should always be initialised at mu=0. Thus only do the `random restarts` for the q=2..10.

Also you need to try different step sizes in the Adam optimiser, while mini batch can probably be fixed to some sensible value (maybe 100 throughout?). Record the trace plots over epochs over both training/test performance. It would be convenient to have only single train/test folds (eg. 70/30) that are fixed in the very beginning.


In [3]:
import numpy as np
import tensorflow as tf
import time

import matplotlib.pyplot as plt
#matplotlib.use("Agg")
%matplotlib inline 

import sys
print(sys.path)

['', '/m/home/home1/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/site-packages/spyder/utils/ipython', '/l/gadichs1/gitrepos/aalto/Doubly-Stochastic-DGP', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python35.zip', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/plat-linux', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/lib-dynload', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/site-packages', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/site-packages/IPython/extensions', '/m/home/home1/18/gadichs1/unix/.ipython']


In [4]:
from gpflow.likelihoods import Gaussian
from gpflow.kernels import RBF, White
from gpflow.mean_functions import Constant
from gpflow.models.sgpr import SGPR, GPRFITC
from gpflow.models.svgp import SVGP
from gpflow.models.gpr import GPR
from gpflow.training import AdamOptimizer, ScipyOptimizer
from gpflow.params import Parameter

from scipy.cluster.vq import kmeans2
from scipy.stats import norm
from scipy.special import logsumexp

from doubly_stochastic_dgp.dgp import DGP
from datasets import Datasets


### Using the `power` dataset

In [8]:
datasets = Datasets(data_path='data/')

data = datasets.all_datasets['power'].get_data()
X, Y, Xs, Ys, Y_std = [data[_] for _ in ['X', 'Y', 'Xs', 'Ys', 'Y_std']]
print('N: {}, D: {}, Ns: {}'.format(X.shape[0], X.shape[1], Xs.shape[0]))

N: 8611, D: 4, Ns: 957


## Spectral Mixture Kernel

The *spectral mixture kernel* is the generalization of stationary kernels (or stationary covariance functions). The spectral mixture kernels are the scale-location Gaussian mixture (Wilson, 2013) of the spectral density of a given kernel. Using the principles of Fourier transforms, we can recover the original kernel by simply taking the inverse Fourier transform of the spectral density. The hyperparameters of the spectral mixture kernel can be tuned by optimizing the marginal likelihood but with an additional caution of proper initialization.

This kernel representation is statistically powerful as it gives immense flexibility to model spatio-temporal data. Applications have been found in long range crime prediction, time series, image and video extrapolation (Wilson, 2013). This kernel reperesentation also helps us to gain novel intuitions about modelling problems.  

In [None]:
# Generating the spectral mixture kernel
# TODO: do we need slicing here? 
class SpectralMixtureKernel(gpflow.kernels.Kernel):
    def __init__(self, num_mixtures=1,input_dim=1, mixture_weights=[],\
                 mixture_scales=[],mixture_means=[],\
                 variance=1.0, lengthscales=1.0,\
                 active_dims=None,ARD=False,name=None):
        '''
        - num_mixtures is the number of mixtures; denoted as Q in
        Wilson 2013.
        - input_dim is the dimension of the input to the kernel.
        - mixture_variance is 
        - mixture_means is the list (or array) of means of the mixtures.
        - active_dims is the dimension of the X which needs to be used.
        - ARD (don't know whether relevant here) specifies whether the
        kernel has one weight_variance per dimension (ARD=True) or a 
        single weight_variance (ARD=False)
        '''
        super().__init__(input_dim,variance,lengthscale,active_dims,\
                 ARD,name=name)
        # Q(num_of_mixtures)=1 then SM kernel is SE Kernel.
        self.num_mixtures = num_mixtures # not a parameter
        
        # need to put a bound of [-100,100] 
        self.mixture_weights = mixture_weights
        self.mixture_scales = Parameter(mixture_scales,\
                                        transform=transforms.positive)
        # need to put a bound of [-100,100] but zeroth component should
        # have value '0'
        self.mixutre_means = Parameter(mixture_means)
        
        # parameter or gp.params?
   
    @params_as_tensors
    def K(self, X, X2=None):
        if X2 is None:
            X2 = X
        #dist = self.scaled_euclid_dist(X,X2)
        
        X1 = tf.transpose(tf.expand_dims(X1,-1),perm=[1,2,0])#D x 1 x N1
        X2 = tf.expand_dims(tf.transpose(X2,perm=[1,0]),-1)#D x N2 x 1
        
        t = tf.subtract(X1,X2)
        
        
        return self.variance * tf.minimum(X, tf.transpose(X2))

    def Kdiag(self, X):
        return self.variance * tf.reshape(X, (-1,))

In [None]:
## TASK: Implement Spectral kernel  # check the links and code 
# https://people.orie.cornell.edu/andrew/code/
# gpflow kernel implementation http://gpflow.readthedocs.io/en/latest/notebooks/kernels.html

# gpytorch spectral mixture kernel
# https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/kernel.py
# https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/spectral_mixture_kernel.py



In [None]:
#Baseline: Take SVGP in 1 layer/node GP with Spectral Kernel as baseline it is common for
# all experiments.

In [14]:
# Experiments are defined as increasing number of layers.
# For each deep architecture we have #Baseline and we test for increasing number of spectral
# mixtures (i.e., from Q=1 to Q=10). 
# Compare runtime / accuracy / optimization / kernel behaviour.


# Doubly Stochastic help: 
#https://github.com/ICL-SML/Doubly-Stochastic-DGP/blob/master/demos/demo_regression_UCI.ipynb