# Fitting: Computing an NLL

We will be using  CuPy to compute a negative log likelihood, for an unbinned fit (not performed). Like before, let's set up the data and then try a solution with Numpy:

In [None]:
!nvidia-smi

## Dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import math

np.random.seed(42)

dist = np.hstack([
    np.random.normal(loc=1, scale=2., size=500_000),
    np.random.normal(loc=1, scale=.5, size=500_000)
])

In [None]:
plt.hist(dist, bins='auto');

## Numpy

In [None]:
def gaussian(x, μ, σ):
    return 1/np.sqrt(2*np.pi*σ**2) * np.exp(-(x-μ)**2/(2*σ**2))

def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

def nll(dist, f_0, mean, sigma, sigma2):
    return -np.sum(np.log(add(dist, f_0, mean, sigma, sigma2)))

In [None]:
%%timeit
nll(dist, *np.random.rand(4))

We may get a divide by 0 error, since we are randomly setting parameters. That's okay.

## CuPy: simple

In [None]:
import cupy as cp

In [None]:
d_dist = cp.asarray(dist)

In [None]:
%%timeit
nll(d_dist, *cp.random.rand(4))
cp.cuda.get_current_stream().synchronize()

Because CuPy supports the Numpy 1.13 ufunc dispatch, we didn't even need to replace the `np.*` in the lines above!

## CuPy: Fuse

We can get even a *little* better by using fuse:

In [None]:
@cp.fuse()
def gaussian(x, μ, σ):
    return 1/cp.sqrt(2*cp.pi*σ**2) * cp.exp(-(x-μ)**2/(2*σ**2))

@cp.fuse()
def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

#@cp.fuse() # Actually slower; it seems to reorder the sum into a linear reduction
def nll(dist, f_0, mean, sigma, sigma2):
    return -cp.sum(cp.log(add(dist, f_0, mean, sigma, sigma2)))

In [None]:
%%timeit
nll(d_dist, *cp.random.rand(4))
cp.cuda.get_current_stream().synchronize()

## CuPy: Custom kernels

Let's try a custom reduction kernel:

In [None]:
device_fns = '''
#define POW2(x) ((x)*(x))
__device__
double gaussian(double x, double mu, double sigma) {
    return rsqrt(2*M_PI*POW2(sigma)) * exp(-POW2(x-mu)/(2*POW2(sigma)));
}

__device__ double add(double x, double f_0, double mean, double sigma, double sigma2) {
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2);
}
'''

In [None]:
nll_kernel = cp.ReductionKernel(
    in_params = 'T dist, T f_0, T mean, T sigma, T sigma2',
    out_params = 'T y',
    map_expr = f"log(add(dist, f_0, mean, sigma, sigma2))",
    reduce_expr = 'a + b',
    post_map_expr = 'y = -a',
    identity = '0',
    name = 'nll_kernel',
    preamble = device_fns
)

And, when we run, we get a nice speedup combined with the large linear reduction slowdown:

In [None]:
%%timeit
nll_kernel(d_dist, *cp.random.rand(4))
cp.cuda.get_current_stream().synchronize()

#### CuPy Elementwise + sum algorithm

This is the best we can do (without implementing a RawKernel with a smart reduction, anyway):

In [None]:
inside_nll = cp.ElementwiseKernel(
    in_params = 'T dist, T f_0, T mean, T sigma, T sigma2',
    out_params = 'T y',
    operation = 'y = log(add(dist, f_0, mean, sigma, sigma2))',
    name = 'inside_nll',
    preamble = device_fns
)

In [None]:
%%timeit
-cp.sum(inside_nll(d_dist, *cp.random.rand(4)))
cp.cuda.get_current_stream().synchronize()

# Exercise

Take one or more of the above examples, and convert them to 32 bit floats. How does the performance compare? (Pay attention to the GPU you get when running the example).

Be careful when you do so not to let 64 bits sneak in. Check the output and/or in-between steps regularly!
