# Learning on sets using the Bhattacharyya kernel

In this experiment, we test the expressivity of the model based on the Bhattacharyya kernel. 
Our plan is to create 1000 sets of numbers. Each set generated using Normal distribution with random parameters.

The task is to predict if the 60% quantile is greater or less than the specified value. 
In order to do that we first fit Normal distribution for each set, and then compute the Bhattacharyya kernel matrix based on the inferred parameters.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
%matplotlib inline

In [2]:
n_sets = 1000

We will randomly select sample size, mean and std for each set

In [3]:
min_mu = -1.0
max_mu = 1.0
min_std = 0.1
max_std = 0.9
min_size = 5
max_size = 50
quantile = 0.6
quantile_comparison = max_mu * 0.15

In [4]:
np.random.seed(2128506)
data = pd.DataFrame({
    "mu": np.random.uniform(min_mu, max_mu, n_sets),
    "sigma": np.random.uniform(min_std, max_std, n_sets),
    "size": np.random.randint(min_size, max_size, n_sets)
})

Now we take samples of the distributions in each sets.

In [5]:
samples = [np.random.normal(data.loc[i, "mu"], data.loc[i, "sigma"], data.loc[i, "size"]) for i in range(n_sets)]

In [6]:
# Compute the target field
quantile_values = np.array([np.quantile(sample, quantile) for sample in samples])
data['target'] = quantile_values - quantile_comparison > 0
data['target'] = data.target.astype("int")
# Fit the Normal distribution
data["x_bar"] = [np.mean(sample) for sample in samples]
data["s"] = [np.std(sample, ddof=1) for sample in samples]

In [7]:
data.head()

Unnamed: 0,mu,sigma,size,target,x_bar,s
0,-0.141728,0.586581,14,1,0.154003,0.657748
1,-0.310797,0.727996,31,0,-0.178226,0.536234
2,-0.098976,0.427965,10,0,-0.223937,0.350883
3,0.594182,0.276093,19,1,0.499583,0.211205
4,-0.027752,0.827064,47,1,0.118751,0.809009


The kernel function in this case can be computed analytically. The kernel function is given two arguments
Let $m_1$ and $s_1$ are parameters of the first distribution and $m_1$ and $s_1$ - parameters of the second distribution.

Then the Kernel is computed as:

$$ K(m_1, s_1, m_2, s_2) = \left(\frac{1}{4} \left(\frac{s_1^2}{s_2^2} + \frac{s_2^2}{s_1^2} + 2\right)\right)^{-\frac{1}{4}} exp\left(- \frac{1}{4} \frac{(m_1 - m_2)^2}{s_1^2 + s_2^2}\right) $$

In [13]:
# Define kernel function
def kernel_function(row1, row2):
    m1 = row1.x_bar
    s1 = row1.s
    m2 = row2.x_bar
    s2 = row2.s
    m1 = m1.values.reshape(-1, 1)
    s1 = s1.values.reshape(-1, 1)
    m2 = m2.values.reshape(1, -1)
    s2 = s2.values.reshape(1, -1)
    return (1/4*(s1**2/s2**2 + s2**2/s1**2 + 2))**(-1/4)\
        * np.exp(-1/4*(m1 - m2)**2/(s1**2+s2**2))    

In [9]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(data, data["target"], test_size=0.2)

In [10]:
model = SVC(kernel=kernel_function)
model.fit(X_train, y_train);

Model accuracy on training and test sets

In [11]:
model.score(X_train, y_train, )

0.9775

In [12]:
model.score(X_test, y_test, )

0.975

The accuracy is pretty high, but not 100% even on the training set. Using more general Probability Product Kernel may help to get more accuracy by tuning the hyperparameter $\rho$