- https://medium.com/towards-data-science/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a
- https://www.youtube.com/results?search_query=binary+cross+entropy+loss+explained

- https://www.youtube.com/watch?v=Md4b67HvmRo&t=102s

- https://cs231n.github.io/neural-networks-2/#losses

chatgpt:
- https://chatgpt.com/share/67b8221f-31ac-8004-836b-44d8bbbc0991

### The Formula of Binary Cross Entropy (BCE) Loss:

The formula for binary cross-entropy (logistic regression loss) is:

$$
L_i = - y \log(\sigma(f)) - (1 - y) \log(1 - \sigma(f))
$$

Where:
- $ L_i $ is the loss for a single training example $ i $.
- $ y $ is the true label of the example (either 0 or 1).
- $ \sigma(f) $ is the predicted probability (output of the sigmoid function), where $ f = w^T x + b $ is the raw model output (the logit).
  
### Explanation:

- **For a True Positive (y = 1):**  
  When the true label $ y = 1 $, the loss becomes:
  
  $$
  L_i = - \log(\sigma(f))
  $$
  
  This is because $ (1 - y) $ becomes zero, so the second term vanishes. The model wants to maximize the probability of the true class being 1. The closer $ \sigma(f) $ is to 1, the smaller the loss.
  
- **For a True Negative (y = 0):**  
  When the true label $ y = 0 $, the loss becomes:
  
  $$
  L_i = - \log(1 - \sigma(f))
  $$
  
  This is because $ y $ becomes zero, so the first term vanishes. The model wants to maximize the probability of the true class being 0. The closer $ \sigma(f) $ is to 0, the smaller the loss.

### Example Walkthrough:

Let’s say we have the following:

- **True label** $ y = 1 $ (positive class).
- **Logit (raw model output)** $ f = 1.5 $ (this is the value $ w^T x + b $).
- **Predicted probability** $ \sigma(f) = \sigma(1.5) = \frac{1}{1 + e^{-1.5}} \approx 0.817 $.

We want to compute the loss for this example.

#### Step 1: Apply the sigmoid function

The sigmoid function is given by:

$$
\sigma(f) = \frac{1}{1 + e^{-f}}
$$

For $ f = 1.5 $:

$$
\sigma(1.5) = \frac{1}{1 + e^{-1.5}} \approx \frac{1}{1 + 0.223} \approx 0.817
$$

So, the predicted probability for class 1 is approximately **0.817**.

#### Step 2: Compute the loss for $ y = 1 $

Since $ y = 1 $, the formula simplifies to:

$$
L_i = - \log(\sigma(f)) = - \log(0.817)
$$

We calculate $ \log(0.817) $:

$$
\log(0.817) \approx -0.202
$$

Thus, the loss for this example is:

$$
L_i = -(-0.202) = 0.202
$$

### Interpretation of the Loss:

- The loss $ L_i = 0.202 $ means that the model predicted a probability of 0.817 for the true class (class 1), which is fairly close to the correct value of 1.
- A lower loss would indicate a better prediction (closer to 1 for class 1), while a higher loss would indicate a poorer prediction.

### Why This Loss is Important:

- **Minimizing the loss**: The goal of training is to adjust the model parameters such that the predicted probability for the true class (class 1) is as high as possible for positive examples ($ y = 1 $), and as low as possible for negative examples ($ y = 0 $).
- The **logarithmic nature** of the loss function means that large mistakes (i.e., predicting a very low probability for class 1 when $ y = 1 $) result in a much higher loss than smaller mistakes.

For example:
- If $ \sigma(f) $ were 0.1 instead of 0.817, the loss would be:

$$
L_i = - \log(0.1) \approx 2.302
$$

This would be a much higher loss, indicating a worse prediction.

Binary Cross Entropy (BCE) Loss sample calculation

**cross-entropy loss** is computed for logistic regression. We will use the formula:

$$
L_i = - y \log(\sigma(f)) - (1 - y) \log(1 - \sigma(f))
$$

### Scenario 1: True Label $ y = 1 $ (positive class)

**Sample Data:**
- **True label $ y = 1 $**
- **Raw score (logit) $ f = 2.0 $**
- **We want to calculate the predicted probability $ \sigma(f) $ using the sigmoid function.**

#### Step 1: Apply the sigmoid function

The sigmoid function is:

$$
\sigma(f) = \frac{1}{1 + e^{-f}}
$$

For $ f = 2.0 $:

$$
\sigma(2.0) = \frac{1}{1 + e^{-2.0}} = \frac{1}{1 + 0.1353} \approx 0.881
$$

So, the predicted probability for class 1 is **0.881**.

#### Step 2: Calculate the loss

Since the true label $ y = 1 $, the formula simplifies to:

$$
L_i = - \log(\sigma(f)) = - \log(0.881)
$$

Now, we calculate:

$$
\log(0.881) \approx -0.127
$$

Thus, the loss for this example is:

$$
L_i = -(-0.127) = 0.127
$$

So, the loss for this example when the true label is **1** is **0.127**.

---

### Scenario 2: True Label $ y = 0 $ (negative class)

**Sample Data:**
- **True label $ y = 0 $**
- **Raw score (logit) $ f = -2.0 $**
- **We want to calculate the predicted probability $ \sigma(f) $ using the sigmoid function.**

#### Step 1: Apply the sigmoid function

For $ f = -2.0 $:

$$
\sigma(-2.0) = \frac{1}{1 + e^{2.0}} = \frac{1}{1 + 7.389} \approx 0.119
$$

So, the predicted probability for class 1 is **0.119**.

#### Step 2: Calculate the loss

Since the true label $ y = 0 $, the formula simplifies to:

$$
L_i = - \log(1 - \sigma(f)) = - \log(1 - 0.119)
$$

Now, we calculate:

$$
1 - 0.119 = 0.881
$$

$$
\log(0.881) \approx -0.127
$$

Thus, the loss for this example is:

$$
L_i = -(-0.127) = 0.127
$$

So, the loss for this example when the true label is **0** is **0.127**.

---

### Recap of Results:

- **For true label $ y = 1 $ and logit $ f = 2.0 $:**  
  The predicted probability is $ \sigma(2.0) \approx 0.881 $, and the loss is **0.127**.
  
- **For true label $ y = 0 $ and logit $ f = -2.0 $:**  
  The predicted probability is $ \sigma(-2.0) \approx 0.119 $, and the loss is **0.127**.

### Interpretation:
- The loss for both cases is the same, **0.127**, because in both cases the model's predicted probability is relatively close to the true label (for $ y = 1 $, the probability is 0.881, and for $ y = 0 $, the probability is 0.119).
- A smaller value of the logit results in a smaller predicted probability for class 1, and a larger value of the logit results in a larger predicted probability for class 1.

This shows how the **binary cross-entropy loss** works in both positive and negative classes, ensuring that the loss penalizes wrong predictions based on how far the predicted probabilities are from the true labels.

In [3]:
import numpy as np

# Sigmoid function
def sigmoid(f):
    return 1 / (1 + np.exp(-f))

# Binary cross-entropy loss function
def binary_cross_entropy_loss(y_true, f):
    prob = sigmoid(f)  # Compute sigmoid
    loss = - (y_true * np.log(prob) + (1 - y_true) * np.log(1 - prob))
    return prob, loss

# Scenario 1: True label = 1 (Positive class)
f_pos = 2.0
y_true_pos = 1
prob_pos, loss_pos = binary_cross_entropy_loss(y_true_pos, f_pos)

# Scenario 2: True label = 0 (Negative class)
f_neg = -2.0
y_true_neg = 0
prob_neg, loss_neg = binary_cross_entropy_loss(y_true_neg, f_neg)

# Print results
print(f"Scenario 1: True label = {y_true_pos}, Logit f = {f_pos}")
print(f"Sigmoid output: {prob_pos:.3f}, Cross-entropy loss: {loss_pos:.3f}\n")

print(f"Scenario 2: True label = {y_true_neg}, Logit f = {f_neg}")
print(f"Sigmoid output: {prob_neg:.3f}, Cross-entropy loss: {loss_neg:.3f}")


Scenario 1: True label = 1, Logit f = 2.0
Sigmoid output: 0.881, Cross-entropy loss: 0.127

Scenario 2: True label = 0, Logit f = -2.0
Sigmoid output: 0.119, Cross-entropy loss: 0.127


### Gradient of binary cross entropy loss

For positive case, given:

$$
p = \frac{1}{1 + e^{-x}}
$$

$$
L = \log p
$$

What is the gradient of the loss L with regard to x (usually called as logit):

$$
\frac{dL}{dx}
$$

---

### **Step 1: Compute $ \frac{dp}{dx} $**

We rewrite $ p $:

$$
p = (1 + e^{-x})^{-1}
$$

Differentiate both sides using the chain rule:

$$
\frac{dp}{dx} = - (1 + e^{-x})^{-2} \cdot \frac{d}{dx} (1 + e^{-x})
$$

Since:

$$
\frac{d}{dx} (1 + e^{-x}) = -e^{-x}
$$

we get:

$$
\frac{dp}{dx} = - (1 + e^{-x})^{-2} \cdot (-e^{-x})
$$

Simplifying:

$$
\frac{dp}{dx} = \frac{e^{-x}}{(1 + e^{-x})^2}
$$

Rewriting in terms of $ p $:

$$
p = \frac{1}{1 + e^{-x}}, \quad 1 - p = \frac{e^{-x}}{1 + e^{-x}}
$$

So:

$$
\frac{dp}{dx} = p(1 - p)
$$

---

### **Step 2: Compute $ \frac{dL}{dx} $**

Since $ L = \log p $, we differentiate:

$$
\frac{dL}{dx} = \frac{1}{p} \cdot \frac{dp}{dx}
$$

Substituting $ \frac{dp}{dx} = p(1 - p) $:

$$
\frac{dL}{dx} = \frac{p(1 - p)}{p}
$$

Cancel $ p $:

$$
\frac{dL}{dx} = 1 - p
$$

---

### **Final Answer**
$$
\frac{dL}{dx} = 1 - p
$$

In [3]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def loss(x):
    p = sigmoid(x)
    return np.log(p)

def analytical_derivative(x):
    p = sigmoid(x)
    return 1 - p

# Define x and a small step h
x = 1.0  # Example value
h = 1e-5

# Compute L at x and x + h
L_x = loss(x)
L_x_h = loss(x + h)

# Compute numerical derivative
numerical_derivative = (L_x_h - L_x) / h

# Compute analytical derivative
analytical_result = analytical_derivative(x)

# Print results
print(f"Numerical derivative: {numerical_derivative}")
print(f"Analytical derivative: {analytical_result}")

# Check if they are approximately equal
print(f"Difference: {abs(numerical_derivative - analytical_result)}")


Numerical derivative: 0.26894043831382497
Analytical derivative: 0.2689414213699951
Difference: 9.830561701340557e-07


For negative case, we are given:  

$$
p = \frac{1}{1 + e^{-x}}
$$

$$
L = \log(1 - p)
$$

We need to compute:

$$
\frac{dL}{dx}
$$

---

### **Step 1: Compute $ \frac{dp}{dx} $**  

From previous derivations, we know:

$$
\frac{dp}{dx} = p(1 - p)
$$

---

### **Step 2: Compute $ \frac{dL}{dx} $**  

Since $ L = \log(1 - p) $, we differentiate using the chain rule:

$$
\frac{dL}{dx} = \frac{1}{1 - p} \cdot \frac{d(1 - p)}{dx}
$$

Since:

$$
\frac{d(1 - p)}{dx} = -\frac{dp}{dx} = -p(1 - p)
$$

we substitute:

$$
\frac{dL}{dx} = \frac{1}{1 - p} \cdot (-p(1 - p))
$$

Cancel $ 1 - p $:

$$
\frac{dL}{dx} = -p
$$

---

### **Final Answer**
$$
\frac{dL}{dx} = -p
$$

In [1]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def loss(x):
    p = sigmoid(x)
    return np.log(1 - p)

def analytical_derivative(x):
    p = sigmoid(x)
    return -p

# Define x and a small step h
x = 1.0  # Example value
h = 1e-5

# Compute L at x and x + h
L_x = loss(x)
L_x_h = loss(x + h)

# Compute numerical derivative
numerical_derivative = (L_x_h - L_x) / h

# Compute analytical derivative
analytical_result = analytical_derivative(x)

# Print results
print(f"Numerical derivative: {numerical_derivative}")
print(f"Analytical derivative: {analytical_result}")

# Check if they are approximately equal
print(f"Difference: {abs(numerical_derivative - analytical_result)}")

Numerical derivative: -0.7310595617093795
Analytical derivative: -0.7310585786300049
Difference: 9.830793745724264e-07


Multi-class Cross-Entropy Loss Explained with Example

Here's an example with some sample data showing how cross-entropy loss is calculated.

### Example:
Suppose we have a classification problem with 3 classes, and for a particular sample, the predicted scores (logits) and true label are as follows:

- **Predicted logits (scores)**: $ f = [2.0, 1.0, 0.1] $
- **True label**: $ y_i = 0 $ (this means the true class is the first class)

### Step-by-step Calculation:

1. **Apply the softmax function** to the logits to get the predicted probabilities.

The softmax function is given by:
$ \text{softmax}(f_j) = \frac{e^{f_j}}{\sum_{k} e^{f_k}}$

For each class $ j $, we compute $ e^{f_j} $, then normalize the result by dividing by the sum of all the exponentiated values.

- Exponentiating the logits:
  - $ e^{f_0} = e^{2.0} = 7.389 $
  - $ e^{f_1} = e^{1.0} = 2.718 $
  - $ e^{f_2} = e^{0.1} = 1.105 $

- Sum of the exponentiated logits:
  - $ \text{Sum} = 7.389 + 2.718 + 1.105 = 11.212 $

- Predicted probabilities (softmax values):
  - $ p_0 = \frac{7.389}{11.212} = 0.659 $
  - $ p_1 = \frac{2.718}{11.212} = 0.242 $
  - $ p_2 = \frac{1.105}{11.212} = 0.098 $

2. **Calculate the cross-entropy loss**.

The cross-entropy loss for this sample is computed as:

$
L_i = - \log(p_{y_i})
$

Since the true label is class 0 ($ y_i = 0 $), we use the predicted probability for class 0:

$
L_i = - \log(0.659) = -(-0.417) = 0.417
$

### Final Result:

So, the cross-entropy loss for this sample is **0.417**. 

This value indicates the penalty for the incorrect predictions, with a lower value meaning better predictions.

In [None]:
import numpy as np

# Given logits
logits = np.array([2.0, 1.0, 0.1])

# Step 1: Compute softmax
exp_logits = np.exp(logits)  # Exponentiate each logit
softmax_probs = exp_logits / np.sum(exp_logits)  # Normalize

# Step 2: Compute cross-entropy loss
true_label = 0  # Given that the true class is 0
loss = -np.log(softmax_probs[true_label])

# Print results
print(f"Exponentiated logits: {exp_logits}")
print(f"Softmax probabilities: {softmax_probs}")
print(f"Cross-entropy loss: {loss:.4f}")

