## Defining a KDE Function to Evaluate Six Dimensional Position-Velocity Space

**What is a KDE?**

In a kernel density estimation, each point of data is transformed into individual kernels which combine together to give a smooth probability density function for the input parameter(s). This proves useful as a regular histogram-style density estimate would not be differentiable but a kernel density estimate is.

**Using `sklearn.neighbors.KernelDensity`:**

There are six kernels that are currently avaliable with this module (gaussian, tophat, epanechnikov, exponential, linear, cosine). The methods from this module that were used are as follows:

`fit` takes in an NxM matrix of N data points and M parameters and fits them to the specified kernel and bandwidth.

`score_samples` takes in an array of points that are being queried and applies these points to the previously `fit` data. The input would be a QxM matrix of Q sets of points and M parameters and the output gives a 1xQ array of logarithmic probabilities at each of the points.

For more practical purposes the function converts the logarithmic probabilities to standard by taking an exponential. For the complete original documentation:

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity

**Optimizing the bandwidth and `scipy.stats.iqr`:**

The bandwidth is optimized in terms of Scott's Rule of Thumb, which follows the model: 

$bw = {1.059}{(A)}{(N)}^{(-1/5)}$, where $A = min(std(X),\frac{IQR}{1.34})$

In this case X is the inputs (an NxM matrix of N points and M parameters) and the IQR is the difference between the 75th and 25th percentile of the data. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.

To compute the IQR, we have used `scipy.stats.iqr` which takes inputs N and X, with N being the number of objects and X similarly defined as above. This outputs the IQR of our input data which is then used to compute the optimized bandwidth based on Scott's Rule of Thumb as defined above.

**Regarding inputs and `v_scale`:**

All inputs will have `v_scale` applied to their velocities before being put through the KDE function. This normalizes the vecloities and positions to similar magnitudes in 6 dimensional space. It is required as the velocities magnitudes are generally much larger and span a larger range than the positions as well as differing units.

In [15]:
#Importing the required modules
import numpy as np
from sklearn.neighbors import KernelDensity
from scipy.stats import iqr

#Defining a KDE function to quickly compute probabilities for the data set
def generate_KDE(inputs, ker, v_scale):
    """
    NAME:
        generate_KDE
    
    PURPOSE:
        Given an NxM matrix for inputs, one of six avaliable ker strings 
        and a float value for v_scale to output a function `input_DKE` 
        that treats the density estimate as a black box function that 
        can be sampled.
    
    INPUT:
        inputs (ndarray) = An NxM matrix where N is the number of data 
                           points and M is the number of parameters.
        ker (string) = One of the 6 avaliable kernel types (gaussian, 
                       tophat, epanechnikov, exponential, linear, cosine).
        v_scale (float) = A float value to scale velocities for the kde.
    
    OUTPUT:
        input_KDE (function) = A blackbox function for the density estimate
                               used for sampling data.
                               
    HISTORY:
        2018-06-14 - Updated - Ayush Pandhi
    """
    #Scaling velocities with v_scale
    positions, velocities = np.hsplit(inputs, 2)
    velocities_scaled = velocities*v_scale
    inputs = np.hstack((positions, velocities_scaled))
    
    #Optimizing bandwidth in terms of Scott's Rule of Thumb
    shape_string = str(inputs.shape)
    objects, parameters = shape_string.split(', ')
    N_string = objects[1:]
    N = int(N_string)
    IQR = iqr(inputs)
    A = min(np.std(inputs), IQR/1.34)
    bw = 1.059 * A * N ** (-1/5.)
    
    #Fit data points to selected kernel and bandwidth
    kde = KernelDensity(kernel=ker, bandwidth=bw).fit(inputs)  

    def input_KDE(samples):
        """
        NAME:
            input_KDE
    
        PURPOSE:
            Given a QxM matrix for samples, evaluates the blackbox density
            estimate function at those points to output a 1xQ array of 
            density values.
    
        INPUT:
            samples (ndarray) = A QxM matrix where Q is the number of points 
                                at which the kde is being evaluated and M is 
                                the number of parameters.
                                
        OUTPUT:
            dens (ndarray) = A 1xQ array of density values for Q data points.
                               
        HISTORY:
            2018-06-14 - Updated - Ayush Pandhi
        """
        #To correct the type of information from other functions into acceptable input
        samples = np.array([samples])
        
        #Scaling samples with v_scale
        samp_positions, samp_velocities = np.hsplit(samples, 2)
        samp_velocities_scaled = samp_velocities*v_scale
        samples = np.hstack((samp_positions, samp_velocities_scaled))
        
        #Get the log density for selected samples and apply exponential to get normal probabilities
        log_dens = kde.score_samples(samples)
        dens = np.exp(log_dens)
        
        #Return a 1xQ array of normal probabilities for the selected sample
        return dens
    
    #Return a black box function for sampling
    return input_KDE

The following section tests this function on some mock data. For more details about the setup for the mock data, see `sampling_R^6_to_R^6.ipynb` by Michael Poon. It is serves as a good example to see the shapes and types of inputs to avoid getting dimensional errors with the function.

In [16]:
#Testing with mock data
import random

mock_data3 = []
for i in range(10):
    select_random = np.linspace(1.0, 10.0, 100)
    x4 = random.choice(select_random)
    x5 = random.choice(select_random)
    x6 = random.choice(select_random)
    x1 = 1
    x2 = 3 + x4 + 2*x5 + 3*x6 
    x3 = -3 - 2*x2 - 3*x2 - 4*x2
    mock_data3.append([x1, x2, x3, x4, x5, x6])
print(mock_data3)

[[1, 33.0, -300.0, 9.090909090909092, 2.0, 5.636363636363637], [1, 24.363636363636367, -222.2727272727273, 6.636363636363637, 5.454545454545455, 1.2727272727272727], [1, 46.727272727272734, -423.5454545454546, 7.818181818181818, 5.2727272727272725, 8.454545454545455], [1, 45.45454545454545, -412.0909090909091, 4.0, 4.909090909090909, 9.545454545454545], [1, 39.09090909090909, -354.81818181818187, 6.0, 8.09090909090909, 4.636363636363637], [1, 23.27272727272727, -212.45454545454544, 3.4545454545454546, 5.0, 2.2727272727272725], [1, 46.0, -417.0, 9.272727272727273, 2.0, 9.90909090909091], [1, 45.45454545454546, -412.0909090909091, 5.909090909090909, 4.090909090909091, 9.454545454545455], [1, 41.54545454545455, -376.90909090909093, 2.8181818181818183, 7.363636363636364, 7.0], [1, 56.54545454545455, -511.90909090909093, 5.363636363636363, 9.090909090909092, 10.0]]


In [17]:
a = np.array(mock_data3)
samples_random = [[1, 2, 3, 4, 5, 6]]
b = np.array(samples_random)
samples_all = [[1, 44.45454545454545, -403.0909090909091, 2.0, 6.363636363636364, 8.90909090909091], [1, 44.0, -399.0, 1.0, 7.181818181818182, 8.545454545454547], [1, 47.90909090909091, -434.18181818181813, 1.8181818181818183, 9.545454545454545, 8.0], [1, 26.0, -237.0, 5.0, 3.8181818181818183, 3.4545454545454546], [1, 37.09090909090909, -336.81818181818187, 1.5454545454545454, 2.909090909090909, 8.90909090909091], [1, 44.45454545454545, -403.0909090909091, 7.090909090909091, 9.0, 5.454545454545455], [1, 32.81818181818182, -298.3636363636364, 1.0909090909090908, 4.272727272727273, 6.7272727272727275], [1, 35.81818181818182, -325.3636363636364, 2.7272727272727275, 9.454545454545455, 3.7272727272727275], [1, 23.272727272727273, -212.45454545454544, 8.09090909090909, 3.6363636363636362, 1.6363636363636362], [1, 26.81818181818182, -244.3636363636364, 7.2727272727272725, 1.7272727272727273, 4.363636363636363]]
c = np.array(samples_all)
c

array([[   1.        ,   44.45454545, -403.09090909,    2.        ,
           6.36363636,    8.90909091],
       [   1.        ,   44.        , -399.        ,    1.        ,
           7.18181818,    8.54545455],
       [   1.        ,   47.90909091, -434.18181818,    1.81818182,
           9.54545455,    8.        ],
       [   1.        ,   26.        , -237.        ,    5.        ,
           3.81818182,    3.45454545],
       [   1.        ,   37.09090909, -336.81818182,    1.54545455,
           2.90909091,    8.90909091],
       [   1.        ,   44.45454545, -403.09090909,    7.09090909,
           9.        ,    5.45454545],
       [   1.        ,   32.81818182, -298.36363636,    1.09090909,
           4.27272727,    6.72727273],
       [   1.        ,   35.81818182, -325.36363636,    2.72727273,
           9.45454545,    3.72727273],
       [   1.        ,   23.27272727, -212.45454545,    8.09090909,
           3.63636364,    1.63636364],
       [   1.        ,   26.81818182,

In [18]:
blackbox = generate_KDE(a, 'epanechnikov')
blackbox(c)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])