# Classification with a control channel

_(working title)_

Gilles Louppe -- [@glouppe](https://twitter.com/glouppe) <br />
Tim Head -- [@betatim](https://twitter.com/betatim)

Last changes on July 30, 2015.

# Setting

## Notations

Let us assume a set of _cases_ or _objects_ taken from a universe $\Omega$. Let us further assume that each object is described by a set of _measurements_ and let us arrange these measurements in some pre-assigned order, i.e., take the input values to be $x_1, x_2, ..., x_p$, where $x_j \in {\cal X}_j$ (for $j=1, ..., p$) corresponds to the value of the input variable $X_j$. Together, the input values $(x_1, ..., x_p)$ form a $p$-dimensioal input vector ${\bf x}$ taking its values in ${\cal X}_1 \times ... \times {\cal X}_p = {\cal X}$, where ${\cal X}$ is defined as the input space. Similarly, let us define as $y \in {\cal Y}$ the value of the output variable $Y$, where ${\cal Y}$ is defined as the output space. By definition, both the input and the output spaces are assumed to respectively contain all possible input vectors and all possible output values. 

_Note._ Input variables are also known as _features_ or _descriptors_, input vectors as _instances_ or _samples_ and the output variable as _target_ or _response_.

## Supervised learning

Let us assume a learning set ${\cal L}$ composed of $N$ pairs of input vectors and output values $({\bf x}_1, y_1), ..., ({\bf x}_N, y_N)$, where ${\bf x}_i \in {\cal X}$ and $y_i \in {\cal Y}$. In this framework, the supervised learning task can be stated as learning a function (or _model_) $\varphi : {\cal X} \mapsto {\cal Y}$ from ${\cal L}$. In particular, the objective is to find a model such that its predictions $\varphi({\bf x})$, also denoted by the variable $\hat{Y}$, are as good as possible.

In the statistical sense, input and output variables $X_1, ..., X_p$ and $Y$ are _random variables_ taking jointly their values from ${\cal X} \times {\cal Y}$ with respect to the joint probability distribution $P(X, Y)$, where $X$ denotes the random vector $(X_1, ..., X_p)$. That is, $P(X={\bf x}, Y=y)$ is the probability that random variables $X$ and $Y$ take values ${\bf x}$ and $y$ from ${\cal X}$ and ${\cal Y}$ when drawing an object uniformly at random from the universe $\Omega$.

Accordingly, trying to learn a model $\varphi_{\cal L}$ whose predictions are as good as possible can be stated as finding a model which minimizes its expected prediction error, also known as _generalization error_,

$$Err(\varphi_{\cal L}) = \mathbf{E}_{X,Y} L(Y,\varphi_{\cal L}(X)). $$

In practice, simplifying assumptions are made to solve supervised learning. In particular, one often assumes that the very best model, or at least a good approximation of it, lives in a family ${\cal H}$ of candidate models, also known as _hypotheses_, of restricted structure (e.g., the family of linear models or the family of decision trees). In this sense, learning amounts to construct or find a model in ${\cal H}$ for which the generalization error is (supposedly) as low as possible.

# Classification of events in high energy physics

_[GL: Tim, You might proofcheck this introductory paragraph :)]_

In high energy physics, experimentalists aim at building detectors for the observation and discovery of a phenomenon predicted by some theoretical model (e.g., the discovery of the Higgs boson, as predicted by the Standard Model). To achieve this, classifiers are built on simulated data and then used to evaluate real data as observed and recorded through the detector. Provided a classifier trained on simulated data properly transfers to real data, the goal is then to assess whether the predicted phenomenon does actually exist, with high probability. 

In machine learning terms, let us assume a universe of objects, or _events_, each described by a vector of physical input values ${\bf x} = (x_1, ..., x_p)$. Let us further assume that some of these events correspond to _signal_ ($y=1$), i.e. the predicted phenomenon, while the others correspond to _background_ ($y=0$), i.e. known and verified physical processes. 

In this setting, given a learning set ${\cal L}$ of simulated events, supervised learning algorithms can be used to find model $\varphi : {\cal X} \mapsto {\cal Y}$, where ${\cal Y} = \{0, 1\}$, capable of distinguishing signal from background events given physical input values.

## Control channel

Because simulation in itself might not be exempt of inaccuracies, caution should be taken when learning a classifier $\varphi$ not to exploit simulation artefacts -- which do not exist on real data -- to separate signal from background events. Exploiting these discrepancies would indeed lead to a model whose simulation performance might significantly differ from its actual performance on real data, therefore making it far less reliable in an actual experiment pipeline.

To control for this effect, ...

- Equivalent to restrict the space of models
- Question is how to best explore this restricted space

# Negative result: passing the agreement while exploiting simulation artefacts

- From the physics point of view, the test may induce false positives. It is not because you pass the test that you are guaranteed from not using simulation artefacts
- Repeatedly controlling the agreement might make things even worse.

# Toy example