# Integration and Sampling

In this notebook we first investigate sampling random numbers other than uniform, and then use random number sampling to calculate integrals.

## Requirements

We need a random number generator. We could use one of the RNGs implemented in [`rng.ipynb`](rng.ipynb), but instead we will use the default `numpy` RNG. We also need the `math` module and `matplotlib`.

In [None]:
# Import the `numpy` and `math` modules.
import numpy as np
import math

# Import the `matplotlib` module.
import matplotlib.pyplot as plt

# Create an RNG, with a seed of 10.
rng = np.random.default_rng(10)

## Introduction

Typical events produced within the Large Hadron Collider (LHC) from colliding protons have $\mathcal{O}(100)$ or more particles produced. When calculating a cross-section for a two-to-two process we typically only need to integrate over two variables, $\theta$ and $\phi$. A two-to-$n$ process requires integrating over $3n -4$ variables, so a typical LHC event would require integrating over $\mathcal{O}(300)$ variables. This is numerically challenging, at best, and with current technology is just simply not possible. To calcululate LHC events, we can instead factorise the problem into more manageable parts using probabilistic methods. Even still, calculating a perturbative cross-section for a $4$-body final state requires integrating over $8$ variables which is a challenging numerical integration. The bottom line is that performing high dimension integrals quickly and efficiently is a core problem in particle physic, and is very numerically challenging.

However, before we tackle integration with MC, we need to first discuss how we can efficiently sample distributions. In the [`rng.ipynb`](mc/rng.ipynb) notebook, we have hard to make a good generator for uniformly-distributed random variates. In practice, however, the probability distributions of interest are not uniform. Fortunately, uniform random variates can either be transformed into a different distribution or used as part of an accept/reject algorithm that converges to the desired probability distribution. Random variates -- uniform or not -- are also a primary part of the Monte Carlo integration method, so it is worthwhile to know how to transform uniform into complicated.

In this notebook, we only consider continous distributions, but everything that we say can be applied, with some modification, to discrete distributions.

## Analytic Sampling

Analytic, or inverse cumulative distribution function (CDF) sampling allows us to transform a uniform distribution into our target distribution, $f(x)$. However, this is not possible for every $f(x)$. To sample $f(x)$ the following must generally be fulfilled.

1. The sampling of $f(x)$ is bounded, where over this range $f(x)$ is positive.

$$
f(x) \geq 0 \text{ for } x_\min < x < x_\max
$$

2. The integral of $f(x)$ can be calculated.

$$
F(x) = \int \text{d}x\, f(x)
$$

3. The integral of $f(x)$ can be inverted, which we label $F^{-1}(x)$.

With these three conditions met we can then sample a distribution for $f(x)$ as follows. First, we can consider integrating a distribution from $x_\min$ to $x$, as shown in the figure below.

![Schematic of analytic sampling.](https://github.com/mcgen-ct/tutorials/blob/main/.full/mc/figures/sample_analytic.png?raw=1)

We then draw a uniform random number $R$ which gives us the following relation.

$$
\int_{x_{\min}}^x \text{d}x'\, f(x') = R \int_{x_{\min}}^{x_{\max}} \text{d}{x'}\, f(x')
$$

We then perform the integration, where $F(x)$ is the indefinite integral of $f(x)$.

$$
F(x) - F(x_{\min}) = R(F(x_\max) - F(x_\min))
$$

We can then write $F(x_\max) - F(x_\min)$ as $A$, the area under the integral.
$$
F(x) - F(x_{\min}) = R A
$$

We then solve for $x$.

$$
x = F^{-1}(F(x_{\min}) + R A)
$$

So, we can uniformly sample $R$ and then use the final relation to transform this into $x$, as sampled from $f(x)$.

### Exercise: generic sampler

Before we try to generate any specific distributions using this method, let us first set up a generic sampler class which uses the steps above.

In [None]:
### START_EXERCISE
class SampleAnalytic:
    """
    Base class to analytically sample a distribution from a random
    distribution.
    """

    def __init__(self, rng, xmin, xmax):
        """
        Initialize the sampler, given the limits on f(x).

        rng:  uniform random number generator, should have method `uniform()`.
        xmin: lower bound of the sampling region.
        xmax: upper bound of the sampling region.
        """
        self.rng = rng
        self.xmin = xmin
        self.xmax = xmax
        self.F_xmin = self.F(xmin)
        self.area = self.F(xmax) - self.F(xmin)

    def f(self, x):
        """
        Return the function being sampled, f(x). This method is not necessary,
        but very useful for importance sampling and checking the distribution.

        x: value to calculate f(x) for.
        """
        # Implment f(x) here.
        return 0.0

    def F(self, x):
        """
        Returns F(x), the indefinite integral for f(x).

        x: value to calculate the indefinite integral for f(x).
        """
        # Implement F(x) here.
        return 0.0

    def F_inv(self, f):
        """
        Returns the inverse of the F(x).

        F: the value of F(x) to calculate the inverse.
        """
        # Implement F^-1(x) here.
        return 0.0

    def __call__(self):
        """
        Return the sampled value.
        """
        # Define the function from above that transforms a uniformly sampled
        # random number to the desired distribution.
        return 0.0

In [None]:
### START_SOLUTION
class SampleAnalytic:
    """
    Base class to analytically sample a distribution from a random
    distribution.
    """

    def __init__(self, rng, xmin, xmax):
        """
        Initialize the sampler, given the limits on f(x).

        rng:  uniform random number generator, should have method `uniform()`.
        xmin: lower bound of the sampling region.
        xmax: upper bound of the sampling region.
        """
        self.rng = rng
        self.xmin = xmin
        self.xmax = xmax
        self.F_xmin = self.F(xmin)
        self.area = self.F(xmax) - self.F(xmin)

    def f(self, x):
        """
        Return the function being sampled, f(x). This method is not necessary,
        but very useful for importance sampling and checking the distribution.

        x: value to calculate f(x) for.
        """
        return 0.0

    def F(self, x):
        """
        Returns F(x), the indefinite integral for f(x).

        x: value to calculate the indefinite integral for f(x).
        """
        return 0.0

    def F_inv(self, f):
        """
        Returns the inverse of the F(x).

        F: the value of F(x) to calculate the inverse.
        """
        return 0.0

    def __call__(self):
        """
        Return the sampled value.
        """
        # Sample the uniform random number.
        r = self.rng.uniform()
        return self.F_inv(self.F_xmin + r * self.area)

In [None]:
class SampleLinear(SampleAnalytic):
    """
    Class to analytically sample a linear function.
    """

    def __init__(self, rng, xmin, xmax, m, b):
        """
        Initialize the sampler, given the limits on f(x) and the linear
        parameters.

        f(x) = mx + b

        rng:  uniform random number generator, should have method `uniform()`.
        xmin: lower bound of the sampling region.
        xmax: upper bound of the sampling region.
        m:    slope of the linear distribution.
        b:    intercept of the linear distribution.
        """
        # Set the linear parameters. This must be done before the base class
        # is initialized.
        self.m = m
        self.b = b
        # Initialize the base class.
        super().__init__(rng, xmin, xmax)

    def f(self, x):
        """
        Return the function being sampled, f(x).

        x: value to calculate f(x) for.
        """
        return self.m * x + self.b

    def F(self, x):
        """
        Returns F(x), the indefinite integral for f(x).

        x: value to calculate the indefinite integral for f(x).
        """
        return self.m * x**2 / 2 + self.b * x

    def F_inv(self, f):
        """
        Returns the inverse of the F(x).

        F: the value of F(x) to calculate the inverse.
        """
        # Handle the special case of no slope.
        if self.m == 0:
            return f / self.b
        else:
            return abs(((self.b**2 + 2 * self.m * f) ** 0.5 - self.b) / self.m)

In [None]:
# Create the sampler.
sampler = SampleLinear(rng, 0, 1, 3, 2)

# Plot the comparison.
plot_sampler(sampler);

In [None]:
class SampleCauchy(SampleAnalytic):
    """
    Class to analytically sample a Cauchy function.
    """

    def __init__(self, rng, xmin, xmax, x0, gamma):
        """
        Initialize the sampler, given the limits on f(x) and the linear
        parameters.

        f(x) = 1/pi * (gamma/(x - x0)^2 + gamma^2)

        rng:   uniform random number generator, should have method `uniform()`.
        xmin:  lower bound of the sampling region.
        xmax:  upper bound of the sampling region.
        x0:    location parameter.
        gamma: scale parameter.
        """
        # Set the parameters.
        self.x0 = x0
        self.gamma = gamma
        # Initialize the base class.
        super().__init__(rng, xmin, xmax)

    def f(self, x):
        """
        Return the function being sampled, f(x).

        x: value to calculate f(x) for.
        """
        return 1 / math.pi * self.gamma / ((x - self.x0) ** 2 + self.gamma**2)

    def F(self, x):
        """
        Returns F(x), the indefinite integral for f(x).

        x: value to calculate the indefinite integral for f(x).
        """
        return 1 / math.pi * math.atan((x - self.x0) / self.gamma) + 1 / 2

    def F_inv(self, f):
        """
        Returns the inverse of the F(x).

        F: the value of F(x) to calculate the inverse.
        """
        return self.x0 + self.gamma * math.tan(math.pi * (f - 1 / 2))

In [None]:
# Create the sampler.
sampler = SampleCauchy(rng, 0, 40, 20, 5)

# Plot the comparison.
plot_sampler(sampler);

In [None]:
class SampleGaussian:
    """
    Class to sample a Gaussian distribution.
    """

    def __init__(self, rng, xmin, xmax, mu, sigma):
        """
        Initialize the sampler. Note, the limits `xmin` and `xmax` here only
        define the limits when used for drawing with the `plot_sampler` method.
        Sampling is performed without any limits.

        rng:   uniform random number generator, should have method `uniform()`.
        xmin:  minimum x for plotting (not sampling).
        xmax:  maximum x for plotting (not sampling).
        mu:    mean of Gaussian.
        sigma: width of Gaussian.
        """
        # Set the parameters.
        self.rng = rng
        self.xmin = xmin
        self.xmax = xmax
        self.mu = mu
        self.sigma = sigma

        # Set the area being sampled. This distribution is normalized.
        self.area = 1

    def f(self, x):
        """
        Return the function being sampled, f(x).

        x: value to calculate f(x) for.
        """
        return (
            1
            / (2 * math.pi * self.sigma**2) ** 0.5
            * math.exp(-((x - self.mu) ** 2) / (2 * self.sigma**2))
        )

    def __call__(self):
        """
        Return the sampled value.
        """
        # Sample the two uniform random numbers.
        r1 = self.rng.uniform()
        r2 = self.rng.uniform()
        # Return only one of the two transformed values.
        return (
            self.sigma * (-2 * math.log(r1)) ** 0.5 * math.cos(2 * math.pi * r2)
            + self.mu
        )

In [None]:
class Histogram:
    """
    Histogram for binned sampling.
    """

    def __init__(self, f, xmin, xmax, bins=50):
        """
        Initialize the histogram.

        f:     function to sample, should be callable, `f(x)`.
        xmin:  minimum x value.
        xman:  maximum x value.
        nbins: number of bins.
        """
        # Store the x minimum and maximum. Not necessary, but useful to
        # keep track of.
        self.xmin = xmin
        self.xmax = xmax

        # Define the histogram edges, use `numpy.linspace`.
        self.edges = np.linspace(xmin, xmax, bins)

        # Define the PDF and CDF.
        self.pdf = []
        self.cdf = []
        pdf_sum = 0
        for i, xmax in enumerate(self.edges[1:]):
            # Calculate the PDF and append.
            xmin = self.edges[i]
            pdf = f((xmin + xmax) / 2)
            self.pdf += [pdf]

            # Calculate the CDF and append.
            pdf_sum += pdf
            self.cdf += [pdf_sum]

        # We store the normalization for the PDF. This is just `pdf_sum` times
        # bin width.
        self.norm = (self.edges[1] - self.edges[0]) * pdf_sum

        # The PDF is not yet a PDF, so we normalize it.
        self.pdf = [bin / self.norm for bin in self.pdf]

        # The CDF is also not yet a CDF, so we normalize it.
        self.cdf = [bin / pdf_sum for bin in self.cdf]

        # We now turn the CDF into edges for the inverse CDF, by prepending
        # the CDF by 0, since the CDF must start at 0.
        self.edges_icdf = [0] + self.cdf

    def bin(self, x, edges=None):
        """
        Return the bin for a given x.

        x:     value to find the bin for.
        edges: optionally, edges of the histogram. The default is to use the
               edges of the histogram. This allows us to use this method when
               finding the bin for the inverse CDF.
        """
        # If no edges are provided, default to `self.edges`.
        if edges == None:
            edges = self.edges

        # Loop over the edges. Skip the
        for bin, edge in enumerate(edges[1:]):
            # Return if `x` is less than the edge.
            if x < edge:
                # For underflow, just return the first bin.
                return bin

        # Return an overflow, just return the last bin.
        return bin

    def bin_icdf(self, r):
        """
        Return the bin from the inverse CDF.
        """
        # Set the `edges` argument to `self.edges_icdf` and use the `bin`
        # method.
        return self.bin(r, self.edges_icdf)

In [None]:
# Create the linear sampler for its function and integral.
line = SampleLinear(rng, 0, 1, 3, 2)

# Create a histogram for the function of this sampler.
hist = Histogram(line.f, line.xmin, line.xmax)

# Create a figure.
fig, ax = plt.subplots()

# Plot the binned PDF.
xs = np.linspace(hist.xmin, hist.xmax, 1000)
bys = [hist.pdf[hist.bin(x)] for x in xs]
ax.plot(xs, bys, label="binned PDF")

# Plot the analytic PDF. We need to make sure to normalize the integral here.
ays = [line.f(x) / line.area for x in xs]
ax.plot(xs, ays, label="analytic PDF")

# Create the legend.
ax.legend();

In [None]:
# Create a figure.
fig, ax = plt.subplots()

# Plot the binned CDF.
xs = np.linspace(hist.xmin, hist.xmax, 1000)
bys = [hist.cdf[hist.bin(x)] for x in xs]
ax.plot(xs, bys, label="binned PDF")

# Plot the analytic CDF.
ays = [line.F(x) / line.area for x in xs]
ax.plot(xs, ays, label="analytic PDF")

# Create the legend.
ax.legend();

In [None]:
class SampleBinned:
    """
    Sampler using binned sampling.
    """

    def __init__(self, rng, hist):
        """
        Initialize the sampler.

        rng:  uniform random number generator, should have method `uniform()`.
        hist: histogram to sample from.
        """
        # Store the RNG, histogram, xmin, and xmax.
        self.rng = rng
        self.hist = hist
        self.xmin = hist.xmin
        self.xmax = hist.xmax

        # Set the area for `plot_sampler`. Since the binned PDF is normalized,
        # this is just 1.
        self.area = 1

    def f(self, x):
        """
        Return the sampled function. This is the binned PDF being sampled.
        Needed for `plot_sampler`.

        x: value to calculate f(x) for.
        """
        # Return 0 if outside the range.
        if x < self.xmin or x > self.xmax:
            return 0.0
        # Return the binned PDF otherwise.
        return self.hist.pdf[self.hist.bin(x)]

    def __call__(self):
        """
        Return the sampled value.
        """
        # Sample a uniform random number.
        r = self.rng.uniform()

        # Get the bin from the inverted CDF.
        i = hist.bin_icdf(r)

        # Get the edges for this bin.
        xmin = hist.edges[i]
        xmax = hist.edges[i + 1]

        # Uniformly sample between these values and return.
        r = self.rng.uniform()
        return r * (xmax - xmin) + xmin

In [None]:
# Create the linear sampler for its function and integral.
line = SampleLinear(rng, 0, 1, 3, 2)

# Create the histogram.
hist = Histogram(line.f, line.xmin, line.xmax)

# Create the sampler.
sampler = SampleBinned(rng, hist)

# Plot the comparison.
plot_sampler(sampler);

In [None]:
# Create the linear sampler for its function and integral.
line = SampleLinear(rng, 0, 1, 3, 2)

# Create the histogram.
hist = Histogram(line.f, line.xmin, line.xmax, bins=5)

# Create the sampler.
sampler = SampleBinned(rng, hist)

# Plot the comparison.
plot_sampler(sampler);