# Deep Gaussian Processes using Spectral Kernels

We will be experimenting different number of mixtures of spectral kernels in the doubly stochastic variational inference framework for `deep Gaussian processes`.

Comparison of following will be investigated:

1) deep GP with gaussian kernel

2) deep GP with SM kernel [wilson’13]

3) deep GP with GSM kernel [remes’17]

We also compare to the standard SVI-GP without deepness (gpflow/SVGP) as a baseline.

We already know that Gaussian kernel can be reproduced by a 1-component SM kernel. Hence, the interesting research question is to see how the DGP behaves when we go from gaussian kernel (Q=1) to very spectral kernels (eg Q=10). How does the runtime / accuracy / optimization / kernel behave, do we get overfitting?


`The core would then be 5x10=50 experiments with 1..5 layers, and 1..10 spectral components. These will all be run with the deep GP of Salimbeni.`

Also, you should run baselines:

- [5] deepGP with gaussian kernel with 1..5 layers
- [1] SVGP with gaussian kernel
- [10] SVGP with spectral kernel with 1..10 components

This will in total give us 50+5+1+10=66 experiments.

You should use the “double stochastic” paper’s fig1 datasets for this, start from eg. the “power” dataset, and follow the same experimental procedure as Salimbeni.

Initialise the inducing locations by k-means, and initialise the inducing values to N(0,0.1). For the spectral kernels you need to do random restarts with different initial spectral components following the strategy of Wilson’13 at

https://people.orie.cornell.edu/andrew/code/

where check the steps 7+8. However the first spectral component should always be initialised at mu=0. Thus only do the `random restarts` for the q=2..10.

Also you need to try different step sizes in the Adam optimiser, while mini batch can probably be fixed to some sensible value (maybe 100 throughout?). Record the trace plots over epochs over both training/test performance. It would be convenient to have only single train/test folds (eg. 70/30) that are fixed in the very beginning.

**Why go deep?**

(Deep Probabilistic Modeling- Niel Lawrence, NIPS 2017)

In a single layer GP we would need *generalized spectral mixture kernel* to learn input dependent lengthscales but the having the hierarchical architecture we can achieve this with simpler kernel increasing the interpretability of the models.


In [1]:
import numpy as np
import tensorflow as tf
import time
import math

import matplotlib.pyplot as plt
#matplotlib.use("Agg")
%matplotlib inline 

from gpflow.likelihoods import Gaussian
from gpflow.kernels import RBF, White, Kernel
from gpflow.mean_functions import Constant
from gpflow.models.sgpr import SGPR, GPRFITC
from gpflow.models.svgp import SVGP
from gpflow.models.gpr import GPR
from gpflow.training import AdamOptimizer, ScipyOptimizer
from gpflow.params import Parameter
from gpflow.decors import params_as_tensors, autoflow

from scipy.cluster.vq import kmeans2
from scipy.stats import norm
from scipy.special import logsumexp

from doubly_stochastic_dgp.dgp import DGP
from doubly_stochastic_dgp.spectralmixture import SpectralMixture
from datasets import Datasets

#import sys
#print(sys.path)

### Using the `power` dataset

In [2]:
datasets = Datasets(data_path='data/')

data = datasets.all_datasets['power'].get_data()
X, Y, Xs, Ys, Y_std = [data[_] for _ in ['X', 'Y', 'Xs', 'Ys', 'Y_std']]
#print('N: {}, D: {}, Ns: {}'.format(X.shape[0], X.shape[1], Xs.shape[0]))

## Spectral Mixture Kernel

The *spectral mixture (SM) kernel* is the generalization of stationary kernels (or stationary covariance functions). The spectral mixture kernels are the scale-location Gaussian mixture (Wilson, 2013) of the spectral density of a given kernel. Using the principles of Fourier transforms, we can recover the original kernel by simply taking the inverse Fourier transform of the spectral density. The hyperparameters of the spectral mixture kernel can be tuned by optimizing the marginal likelihood but with an additional caution of proper initialization.

This kernel representation is statistically powerful as it gives immense flexibility to model spatio-temporal data. Applications have been found in long range crime prediction, time series, image and video extrapolation (Wilson, 2013). This kernel reperesentation also helps us to gain novel intuitions about modelling problems.  

**A short note on initialization of the hyperparameters:**
There are three hyperparameters namely the mixture weights, mixture variances and mixture means. The initialization of these parameters is vital as we might get stuck in an non-optimal solution due to the non-convexity of the problem. The initialization of the parameters is a topic of contention and we discuss next the intuition about these parameters and reasons for their initializations.

1. Mixture Weights: The weights represent the variance of the signal (target variable) analogous to $\sigma_{f}^2$ in the RBF kernel. It is evident from $k_{SM}(x,x)$, where $x$ is the input data, which reduces to the sum of the of the weights. The mixture weights are equally initialized using the standard deviation of the target variable divided by the number of the components. For example, if a weight for a particular mixture is high, it means that the particular frequency in the data explains maximum variance.

2. Mixture Variances: It is easier to interpret the variances as inverse length-scales. The mixture means represent a particular frequency in the signal, but the mixture variance  represents the range of the mixture before it changes it frequency or in the data space it represents significant change in the function. These parameters are usually initialized by sampling from truncated Normal distribution with std as the range of the data in each (data) dimension.

3. Mixture means: These represent the different frequencies in the data. It would be easier to view them as period (1/freuquency). Optimizing this parameter is plagued by multimodality of the marginal likelihood. This parameter is initialized by inverse of the smallest distance between data points in each dimension. If the Nyquist frequency $(f_n)$ is present, we can sample from uniform distribution from $[0,f_n]$.

**Experiments on the hyperparameters:**
- ~~*Mixture weights:* What happens to the other parameters if weights are thought to be drawn from a distribution? Can we use PSIS to smooth the weights? What effect does it have on the signal? How does this affect other parameters?~~

- *Mixture means:* If we decide on the number of mixtures, then one way to initialize the means is by finding the *fast fourier transform (FFT)* of the signal and using the lowest two frequencies jittered with some noise? Lowest frequencies because smoothness assumption kicks in. ~~As we use spectral kernels for learning the non-stationarity in the data, this would be a bad idea.~~

- *Number of Mixtures:* Why don't we select the number of mixtures based on the frequencies from the FFT of the signal.


In [None]:
## TASK: Implement Spectral kernel  # check the links and code 
# https://people.orie.cornell.edu/andrew/code/
# gpflow kernel implementation http://gpflow.readthedocs.io/en/latest/notebooks/kernels.html

# gpytorch spectral mixture kernel
# https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/kernel.py
# https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/spectral_mixture_kernel.py

# plotting functions for the kernel
def plotkernelsample(k, ax, xmin=-3, xmax=3):
    xx = np.linspace(xmin, xmax, 100)[:,None]
    K = k.compute_K_symm(xx)
    ax.plot(xx, np.random.multivariate_normal(np.zeros(100), K, 3).T)
    ax.set_title(k.__class__.__name__)

def plotkernelfunction(K, ax, xmin=-3, xmax=3, other=0):
    xx = np.linspace(xmin, xmax, 100)[:,None]
    K = k.compute_K_symm(xx)
    ax.plot(xx, k.compute_K(xx, np.zeros((1,1)) + other))
    ax.set_title(k.__class__.__name__ + ' k(x, %f)'%other)


In [4]:
# Test the spectral mixture kernel
kern = SpectralMixture(num_mixtures=3,input_dim=4)

#sess=tf.InteractiveSession()

In [5]:
#Baseline: Take SVGP in 1 layer/node GP with Spectral Kernel as baseline it is common for
# all experiments.
kern.as_pandas_table()


Unnamed: 0,class,prior,transform,trainable,shape,fixed_shape,value
SpectralMixture/mixture_scales,Parameter,,+ve,True,"(3, 4)",True,"[[1.54514445931844, 3.011511010970395, 0.25073..."
SpectralMixture/mixutre_means,Parameter,,(none),True,"(3, 4)",True,"[[8.409922044506956, 5.707194625543655, 3.3423..."
SpectralMixture/mixture_weights,Parameter,,(none),True,"(3,)",True,"[0.33556560842250027, 0.33556560842250027, 0.3..."


Unnamed: 0,class,prior,transform,trainable,shape,fixed_shape,value
SpectralMixture/mixture_scales,Parameter,,+ve,True,"(3, 4)",True,"[[1.54514445931844, 3.011511010970395, 0.25073..."
SpectralMixture/mixutre_means,Parameter,,(none),True,"(3, 4)",True,"[[8.409922044506956, 5.707194625543655, 3.3423..."
SpectralMixture/mixture_weights,Parameter,,(none),True,"(3,)",True,"[0.33556560842250027, 0.33556560842250027, 0.3..."


In [None]:
# Experiments are defined as increasing number of layers.
# For each deep architecture we have #Baseline and we test for increasing number of spectral
# mixtures (i.e., from Q=1 to Q=10). 
# Compare runtime / accuracy / optimization / kernel behaviour.


# Doubly Stochastic help: 
#https://github.com/ICL-SML/Doubly-Stochastic-DGP/blob/master/demos/demo_regression_UCI.ipynb

In [None]:
kern = GPflow.ekernels.RBF(Ee, ARD=True)

@GPflow.param.AutoFlow((tf.float64,))
def eval_K(kernel, input):
    return kernel.K(input)

k = eval_K(kern, z)

@GPflow.param.AutoFlow((tf.float64,),(tf.float64,),(tf.float64,))
def eval_exKxz(kernel, z, x, xx):
    return kernel.exKxz(z, x, xx)

exkxz = eval_exKxz(kern, z, xmu, xcov)  