## Introduction

Key problems for deep learning in the natural sciences include:

* understanding model bias: we have very detailed, but imperfect, simulations for training - how does this impact a machine-learning-based approach to data analysis?
* how do we propagate uncertainties through a deep learning algorithm?
* how can we use data directly to train a deep net?

The idea of _weak supervision_ - not telling a network what any given event is but rather providing a probability the event is in a given category - helps deal with some of these issue. The authors stress this approach also better reflects the quantum mechanical nature of reality, but that isn't necessary to make the paper useful.

The most surprising result of the paper is that weakly supervised nets are largely insensitive to the correctness of the probability labels assigned to training batches.

It is important to note that in this context weak supervision was first discussed by Dery, Nachman, Rubbo, and Schwartzman [(DNRS)](https://arxiv.org/abs/1702.00414) (perhaps we should have read that paper first, but here we are...)

The DNRS work was also discussed at DS@HEP2017 by [B. Nachman](https://indico.fnal.gov/contributionDisplay.py?contribId=16&confId=13497) and
[F. Rubbo](https://indico.fnal.gov/contributionDisplay.py?contribId=46&confId=13497).

## Loss Functions

The paper contains a reasonably pedagogic introduction to neural networks that we will skip in this discussion. However, the _loss functions_ (or, "training objective functions") are central to the paper.

For the **fully supervised network**, the loss function employed is the "classic" binary cross entropy:

\begin{equation}
\mathcal{l}_{\text{BCE}}(\{y_t\}, \{y_p\}) = \sum_{i \in \text{samples}} \left[ y_{t,i} \log \frac{1}{y_{p,i}} + (1 - y_{t,i}) \log \frac{1}{1 - y_{p,i}} \right]
\end{equation}

For more on the cross-entropy and why it is used as a loss function, see [this notebook](https://github.com/gnperdue/JupyterNotebooks/blob/master/loss_functions.ipynb). One nice feature (discussed in the paper) about the function is _it penalizes incorrect guesses more when the network is more confident in them_.

For the **weakly supervised network**, instead of using the "true" label for each event, we use a batch average, where the numbers averaged are the population averages for the groups from which the event was drawn:

<img src="./cfo_fig2.png"></img>

Datasets A and B are naturally formed in an HEP analysis (signal-rich and background-rich (i.e., "sideband" regions), but it is obvious (by inspection) we could have more than two - we could also "manufacture" multiple sets from one, so the requirement of two or more is not actually stringent in any sense.

Here, the loss function is:

\begin{equation}
\mathcal{l}_{\text{weak}}(\{f_t\}, \{y_p\}) = \left| \langle f_{t,i} \rangle - \langle y_{p,i} \rangle \right|
\end{equation}

where $y_p$ is still the network prediction, and $f_t$ is the batch average pictured above.

## Toy Models

[This notebook](https://github.com/bostdiek/PublicWeaklySupervised/blob/master/WeakSupervisionDemo.ipynb) provides demo code for the toy model (we will go through it, slightly modified).

First, we have the _very surprising_ result that the performance as characterized by the AUC for the ROC curveis largely insensitive to the accuracy of the fraction label (there are large difference in the region of very low false positive rate).

<img src="./cfo_fig5.png"></img>

In figure 6 we collapse to a specific point on the ROC curve by using a 40% signal efficiency as a tight selection and 70% efficiency as a "medium" selection:

<img src="./cfo_fig6.png"></img>

In the right (tight) selection, we even find some mislabeled data sets outperforming the correctly labeled dataset (0.7). The authors attribute this to a feature of the (sigmoid) activation function keeping the network from making "confident" judgements except when the target is near 0 or 1.

Does this make sense?

<img src="./cfo_fig7.png"></img>

<img src="./cfo_fig8.png"></img>

## LHC Physics

The repository contains some LHC data to play with:

In [1]:
ls Data

Gluino_1stGen.txt               STOPS_1TeV.txt
Gluino_3rdGen.txt               [1m[34mTrueFalsePositiveRates[m[m/
[1m[34mKerasModelWeights[m[m/              Z_Jets_data.txt
KerasModelWeights.tgz           Z_Jets_data_with_weights_1.txt
README.md


In [4]:
!head Data/Z_Jets_data.txt

MET,PT(J1),PT(J2),PT(J3),PT(J4),PT(J5),PT(J6),PT(J7),PT(J8),PT(J9),PT(J10),weight
253.087783813,250.32081604,0,0,0,0,0,0,0,0,0,5.72718e-05
261.409820557,229.624542236,45.158996582,0,0,0,0,0,0,0,0,5.72718e-05
232.673736572,277.371551514,68.8504104614,0,0,0,0,0,0,0,0,5.72718e-05
217.952270508,221.36416626,0,0,0,0,0,0,0,0,0,5.72718e-05
271.402984619,247.162765503,97.5233764648,45.0991973877,0,0,0,0,0,0,0,5.72718e-05
250.773895264,253.875366211,0,0,0,0,0,0,0,0,0,5.72718e-05
248.885314941,244.906677246,0,0,0,0,0,0,0,0,0,5.72718e-05
218.724914551,304.211883545,116.179588318,0,0,0,0,0,0,0,0,5.72718e-05
260.246124268,242.647232056,0,0,0,0,0,0,0,0,0,5.72718e-05


However, we won't go through that here, but we will note two interesting comments:

<img src="./cfo_fig9.png"></img>

The above figure was computed using a special selection of background events drawn from the tails of the generator-level distributions, and performance is still very strong even though we are now sampling events from outside the training phase space. This is from the Gluino vs Z + jets study. Are these events different from the training sample in truly "unexpected" ways? Or do they show the same trends as the background used for training? The performance is impressive.

Below they performed a pair of different mismodelings - random label swaps (left) and a "phase space swap" (right), where they change the label on the most signal-like background and the most background-like signal.

<img src="./cfo_fig10.png"></img>

We find the random swap impacts the fully supervised classifier at low false positive rate, which seems plausible (not obvious _a priori_ why the degradation would be restricted to the low false positive region, but _a posteriori_ it seems reasonable to believe the swap would punish network behavior when we tried to move into a zero background regime). In line with the previous sections of the paper, the weakly supervised classifier was not impacted by this swap.

The phase space swap has very little impact on the weakly supervised network, which is a bit surprising. It also appears to degrade the fully supervised model across all false positive rates - this is reasonable from the perspective that a phase-space swap like this would lead to a more systematic confusion in the network.

It is also worth commenting that the authors point out that the fully supervised and weakly supervised networks are complementary - they may be reasonably expected to be relying on somewhat different features, so _combining them_ should be advantageous.