# Appendix: BCE loss

Here we introduce the **BCE loss** which you might encounter [in the wild](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html). This turns out to be just cross-entropy but for binary classification with scalar-valued models. Another goal of this section is to show that conceptually simple things in ML can be confusing due to implementation details.

For binary classification, since $p_0 + p_1 = 1$, it suffices to compute the probability for the positive class $p_1$. Hence, we should be able to train a scalar valued NN to compute the probabilities. In this case, the cross-entropy loss can be calculated using $p_1$:

$$
\begin{aligned}
\ell_{\text{CE}} 
= -(1 - y)\log (1 - p_1) - y \log p_1
\; = \begin{cases} 
    -\log \;(1 - p_1)  \quad &{y = 0} \\ 
    -\log \; p_1 \quad &{y = 1}.
\end{cases}
\end{aligned}
$$

Let $\boldsymbol{\mathsf{s}} = f(\boldsymbol{\mathsf{x}}) \in \mathbb{R}^2$ be the network output. Recall that
the softmax probabilities are given by:

$$
\begin{aligned}
\boldsymbol{p} = \text{Softmax}(\boldsymbol{\mathsf{s}})
&= \left(\frac{e^{s_0}}{e^{s_0} + e^{s_1}}, \frac{e^{s_1}}{e^{s_0} + e^{s_1}}\right) \\
&= \left(\frac{1}{1 + e^{(s_1 - s_0)}}, \frac{1}{1 + e^{-(s_1 - s_0)}}\right).
\end{aligned}
$$

Then, the probability of the positive class can be written as:

$$p_1 = \text{Sigmoid}(\Delta s)$$

where $\Delta s := s_1 - s_0.$ 
This can now be used to calculate the cross-entropy by using 

$$-\log\,\text{Sigmoid}(\Delta s) = \log\left(1 + e^{-\Delta s}\right)$$

which is more numerically stable than calculating the two operations sequentially. 
Note that $\Delta s = (\boldsymbol{\theta}_1  - \boldsymbol{\theta}_0)^\top \boldsymbol{\mathsf{z}} + (b_1 - b_0)$ since the logits layer is linear.
Thus, we can train an equivalent scalar-valued model with these fused weights that models $\Delta s.$ This model predicts the positive class whenever $\Delta s \geq 0,$ i.e. $s_1 \geq s_0.$ The scalar-valued model can then be converted to the two-valued model by assigning zero weights and bias to the negative class.


**Remark.** This is another nice property of using the exponential to convert scores to probabilities, i.e. it converts a sum to a product, allowing fusing the weights of the logits layer to get one separating hyperplane.