# Deep Gaussian Processes using Spectral Kernels

We will be experimenting different number of mixtures of spectral kernels in the doubly stochastic variational inference framework for `deep Gaussian processes`.

Comparison of following will be investigated:

1) deep GP with gaussian kernel

2) deep GP with SM kernel [wilson’13]

3) deep GP with GSM kernel [remes’17]

We also compare to the standard SVI-GP without deepness (gpflow/SVGP) as a baseline.

We already know that Gaussian kernel can be reproduced by a 1-component SM kernel. Hence, the interesting research question is to see how the DGP behaves when we go from gaussian kernel (Q=1) to very spectral kernels (eg Q=10). How does the runtime / accuracy / optimization / kernel behave, do we get overfitting?


`The core would then be 5x10=50 experiments with 1..5 layers, and 1..10 spectral components. These will all be run with the deep GP of Salimbeni.`

Also, you should run baselines:

- [5] deepGP with gaussian kernel with 1..5 layers
- [1] SVGP with gaussian kernel
- [10] SVGP with spectral kernel with 1..10 components

This will in total give us 50+5+1+10=66 experiments.

You should use the “double stochastic” paper’s fig1 datasets for this, start from eg. the “power” dataset, and follow the same experimental procedure as Salimbeni.

Initialise the inducing locations by k-means, and initialise the inducing values to N(0,0.1). For the spectral kernels you need to do random restarts with different initial spectral components following the strategy of Wilson’13 at

https://people.orie.cornell.edu/andrew/code/

where check the steps 7+8. However the first spectral component should always be initialised at mu=0. Thus only do the `random restarts` for the q=2..10.

Also you need to try different step sizes in the Adam optimiser, while mini batch can probably be fixed to some sensible value (maybe 100 throughout?). Record the trace plots over epochs over both training/test performance. It would be convenient to have only single train/test folds (eg. 70/30) that are fixed in the very beginning.


In [44]:
import numpy as np
import tensorflow as tf
import time
import math

import matplotlib.pyplot as plt
#matplotlib.use("Agg")
%matplotlib inline 

import sys
print(sys.path)

['', '/m/home/home1/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/site-packages/spyder/utils/ipython', '/l/gadichs1/gitrepos/aalto/Doubly-Stochastic-DGP', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python35.zip', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/plat-linux', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/lib-dynload', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/site-packages', '/u/18/gadichs1/unix/.conda/envs/deepgps_2_2/lib/python3.5/site-packages/IPython/extensions', '/m/home/home1/18/gadichs1/unix/.ipython']


In [50]:
from gpflow.likelihoods import Gaussian
from gpflow.kernels import RBF, White, Kernel
from gpflow.mean_functions import Constant
from gpflow.models.sgpr import SGPR, GPRFITC
from gpflow.models.svgp import SVGP
from gpflow.models.gpr import GPR
from gpflow.training import AdamOptimizer, ScipyOptimizer
from gpflow.params import Parameter
from gpflow.decors import params_as_tensors, autoflow

from scipy.cluster.vq import kmeans2
from scipy.stats import norm
from scipy.special import logsumexp

from doubly_stochastic_dgp.dgp import DGP
from datasets import Datasets


### Using the `power` dataset

In [3]:
datasets = Datasets(data_path='data/')

data = datasets.all_datasets['power'].get_data()
X, Y, Xs, Ys, Y_std = [data[_] for _ in ['X', 'Y', 'Xs', 'Ys', 'Y_std']]
print('N: {}, D: {}, Ns: {}'.format(X.shape[0], X.shape[1], Xs.shape[0]))

N: 8611, D: 4, Ns: 957


## Spectral Mixture Kernel

The *spectral mixture (SM) kernel* is the generalization of stationary kernels (or stationary covariance functions). The spectral mixture kernels are the scale-location Gaussian mixture (Wilson, 2013) of the spectral density of a given kernel. Using the principles of Fourier transforms, we can recover the original kernel by simply taking the inverse Fourier transform of the spectral density. The hyperparameters of the spectral mixture kernel can be tuned by optimizing the marginal likelihood but with an additional caution of proper initialization.

This kernel representation is statistically powerful as it gives immense flexibility to model spatio-temporal data. Applications have been found in long range crime prediction, time series, image and video extrapolation (Wilson, 2013). This kernel reperesentation also helps us to gain novel intuitions about modelling problems.  

**A short note on initialization of the hyperparameters:**
There are three hyperparameters namely the mixture weights, mixture variances and mixture means. The initialization of these parameters is vital as we might get stuck in an non-optimal solution due to the non-convexity of the problem. The initialization of the parameters is a topic of contention and we discuss next the intuition about these parameters and reasons for their initializations.

1. Mixture Weights: The weights represent the variance of the signal (target variable) analogous to $\sigma_{f}^2$ in the RBF kernel. It is evident from $k_{SM}(x,x)$, where $x$ is the input data, which reduces to the sum of the of the weights. The mixture weights are equally initialized using the standard deviation of the target variable divided by the number of the components. For example, if a weight for a particular mixture is high, it means that the particular frequency in the data explains maximum variance.

2. Mixture Variances: It is easier to interpret the variances as inverse length-scales. The mixture means represent a particular frequency in the signal, but the mixture variance  represents the range of the mixture before it changes it frequency or in the data space it represents significant change in the function. These parameters are usaually initialized by sampling from truncated Normal distribution with mean as the range of the 

3. Mixture means: These represent the different frequencies in the data. It would be easier to view them as period (1/freuquency). Optimizing this parameter is plagued by multimodality of the marginal likelihood. This parameter is initialized by inverse of the smallest distance between data points in each dimension. If the Nyquist frequency ($f_n$) is present, we can sample from uniform distribution from $[0,f_n]$.

**Experiments on the hyperparameters:**
- *Mixture weights:* What happens to the other parameters if weights are thought to be drawn from a distribution? Can we use PSIS to smooth the weights? What effect does it have on the signal? How does this affect other parameters? 

- *Mixture means:* If we decide on the number of mixtures, then one way to initialize the means is by finding the *fast fourier transform (FFT)* of the signal and using the lowest two frequencies jittered with some noise? Lowest frequencies because smoothness assumption kicks in. As we use spectral kernels for learning the non-stationarity in the data, this would be a bad idea.

- *Number of Mixtures:* Why don't we select the number of mixtures based on the frequencies from the FFT of the signal.


In [51]:
# Generating the spectral mixture kernel
# TODO: do we need slicing here? 
class SpectralMixtureKernel(Kernel):
    def __init__(self, num_mixtures=1, mixture_weights=[],\
                 mixture_scales=[],mixture_means=[],\
                 input_dim=1,active_dims=None,name=None):
        '''
        - num_mixtures is the number of mixtures; denoted as Q in
        Wilson 2013.
        - mixture_weights
        - mixture_variance is 
        - mixture_means is the list (or array) of means of the 
        mixtures.
        - input_dim is the dimension of the input to the kernel.
        - active_dims is the dimension of the X which needs to be used.
        '''
        super().__init__(input_dim,active_dims,name=name)
        # Q(num_of_mixtures)=1 then SM kernel is SE Kernel.
        if num_mixtures == 1:
            self.num_mixtures = num_mixtures # not a parameter
            print("Using default mixture = 1")
        
        # need to put a bound of [-100,100] 
        self.mixture_weights = Parameter(mixture_weights)
        self.mixture_scales = Parameter(mixture_scales,\
                                        transform=transforms.positive)
        # need to put a bound of [-100,100] but zeroth component should
        # have value '0'
        self.mixutre_means = Parameter(mixture_means)
       
    @params_as_tensors
    def initialize(self, train_x, train_y, **kwargs):
        if tf.shape(train_x)[1] == 1:
            train_x = tf.expand_dims(train_x,-1) 
        if train_x.ndimension() == 2:
            train_x = tf.expand_dims(train_x,0) 

        train_x_sort = train_x.sort(1)[0]
        max_dist = train_x_sort[:, -1, :] - train_x_sort[:, 0, :]
        min_dist = torch.min(train_x_sort[:, 1:, :] - train_x_sort[:, :-1, :], 1)[0]

        #Inverse of lengthscales should be drawn from truncated Gaussian | N(0, max_dist^2)
        self.log_mixture_scales.data.normal_().mul_(max_dist).abs_().pow_(-1).log_()
        # Draw means from Unif(0, 0.5 / minimum distance between two points)
        self.log_mixture_means.data.uniform_().mul_(0.5).div_(min_dist).log_()
        # Mixture weights should be roughly the stdv of the y values divided by 
        # the number of mixtures
        self.log_mixture_weights.data.fill_(train_y.std() / self.n_mixtures).log_()
   
    @params_as_tensors
    def K(self, X, X2=None):
        
        if self.mixture_weights == [] or self.mixture_means == [] \
                                      or self.mixture_scales == []:
                raise RuntimeError('Parameters of spectral mixture kernel not initialized.\
                                    Run `kern_object.initialize(train_x,train_y)`.')
                
        if X2 is None:
            X2 = X
        #dist = self.scaled_euclid_dist(X,X2)
        
        X1 = tf.transpose(tf.expand_dims(X1,-1),perm=[1,2,0])
                                                         #D x 1 x N1
        X2 = tf.expand_dims(tf.transpose(X2,perm=[1,0]),-1)#D x N2 x 1
        
        t = tf.abs(tf.subtract(X1,X2)) # D x N2 x N1
        # we will optimize the standard deviations.
        
        exp_term = tf.multiply(tf.square(tf.matmul(t,\
                                self.mixture_scales)),-2.*math.pi**2)
        cos_term = tf.multiply(tf.square(tf.matmul(t,\
                                self.mixture_means)),2.*math.pi)
        res = tf.squeeze(tf.reduce_prod(tf.multiply(tf.exp(exp_term),\
                                        tf.cos(cos_term)),axis=0))
        res = tf.squeeze(tf.reduce_sum(tf.multiply(res,\
                                        self.mixture_weights,axis=0)))
        return res
    
    @params_as_tensors
    def Kdiag(self, X):
        
        # just the sum of weights. Weights represent the signal
        # variance. 
        return tf.fill(tf.stack([tf.shape(X)[0]]),\
                              tf.sum(self.mixture_weights))


In [52]:
## TASK: Implement Spectral kernel  # check the links and code 
# https://people.orie.cornell.edu/andrew/code/
# gpflow kernel implementation http://gpflow.readthedocs.io/en/latest/notebooks/kernels.html

# gpytorch spectral mixture kernel
# https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/kernel.py
# https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/spectral_mixture_kernel.py

# plotting functions for the kernel
def plotkernelsample(k, ax, xmin=-3, xmax=3):
    xx = np.linspace(xmin, xmax, 100)[:,None]
    K = k.compute_K_symm(xx)
    ax.plot(xx, np.random.multivariate_normal(np.zeros(100), K, 3).T)
    ax.set_title(k.__class__.__name__)

def plotkernelfunction(K, ax, xmin=-3, xmax=3, other=0):
    xx = np.linspace(xmin, xmax, 100)[:,None]
    K = k.compute_K_symm(xx)
    ax.plot(xx, k.compute_K(xx, np.zeros((1,1)) + other))
    ax.set_title(k.__class__.__name__ + ' k(x, %f)'%other)


In [None]:
# Test the spectral mixture kernel


In [None]:
#Baseline: Take SVGP in 1 layer/node GP with Spectral Kernel as baseline it is common for
# all experiments.

In [14]:
# Experiments are defined as increasing number of layers.
# For each deep architecture we have #Baseline and we test for increasing number of spectral
# mixtures (i.e., from Q=1 to Q=10). 
# Compare runtime / accuracy / optimization / kernel behaviour.


# Doubly Stochastic help: 
#https://github.com/ICL-SML/Doubly-Stochastic-DGP/blob/master/demos/demo_regression_UCI.ipynb