<h2 style="text-align: center;"><strong>Segment 2: Integrals</strong></h2>

* Integration in Machine Learning
* Numeric Integration with Python
* Binary Classification
* The Confusion Matrix
* The Receiver-Operating Characteristic (ROC) Curve 
* Area Under the ROC Curve

## **Integration in Machine Learning**

Integration is a fundamental mathematical tool that measures the **total accumulation of a quantity** over a continuous domain. While derivatives measure instantaneous change, integrals measure **aggregate effects**, which is crucial in probabilistic reasoning, expected values, continuous loss computation, and more in ML.

### **Basic Integration Rules**

1. **Power Rule**  
$$
\int x^n \, dx = \frac{x^{n+1}}{n+1} + C, \quad n \neq -1
$$

2. **Constant Multiple Rule**  
$$
\int k \cdot f(x) \, dx = k \int f(x) \, dx + C
$$

3. **Sum Rule**  
$$
\int [f(x) + g(x)] \, dx = \int f(x) \, dx + \int g(x) \, dx + C
$$

4. **Definite Integral** (area under the curve from $a$ to $b$)  
$$
\int_a^b f(x) \, dx = F(b) - F(a)
$$
where $F(x)$ is the antiderivative of $f(x)$. Note that the **constant of integration $C$ cancels** in definite integrals.

### **Applications of Integration in Machine Learning**

**1. Expected Value and Probability**

For a continuous random variable $X$ with probability density function $p(x)$:

$$
\mathbb{E}[f(X)] = \int_{-\infty}^{\infty} f(x) \, p(x) \, dx
$$

- Used in Bayesian inference, probabilistic neural networks, and expectation-maximization algorithms.  
- **Example:** If $X \sim \text{Uniform}[0,1]$, then

$$
\mathbb{E}[X] = \int_0^1 x \, dx = \left[ \frac{x^2}{2} + C \right]_0^1 = \frac{1}{2}
$$

**2. Definite Integrals for Continuous Loss**

Some loss functions may be defined continuously over the input domain:

$$
L = \int (f_\theta(x) - y(x))^2 \, dx
$$

- Measures **total error** across all inputs.  
- Often approximated numerically in high-dimensional spaces.

**3. Probability Distribution Normalization**

For a probability density function $p(x)$:

$$
\int_{-\infty}^{\infty} p(x) \, dx = 1
$$

- Ensures total probability sums to 1, as used in Gaussian distributions, Bayesian priors, and generative models.

**4. Cumulative Distribution Function (CDF)**

The CDF is the integral of the PDF:

$$
F(x) = \int_{-\infty}^{x} p(t) \, dt
$$

- Represents the probability that a variable is less than or equal to $x$.  
- Useful in sampling and thresholding decisions.

**5. Continuous-Time Models**

Neural ODEs and other continuous-time models use integration to evolve states:

$$
h(t_1) = h(t_0) + \int_{t_0}^{t_1} f(h(t), t, \theta) \, dt
$$

- Accumulates infinitesimal changes in the hidden state over time.

**6. Area Under the Curve (AUC)**

In classification, integration is used to compute **Area Under the ROC Curve (AUC)**:

$$
\text{AUC} = \int_0^1 TPR(FPR) \, dFPR
$$

- Quantifies overall model performance across thresholds.

##### **Example: Area Under a Polynomial Curve**

Compute the area under:

$$
f(x) = 3x^2 + 2x + 1, \quad x \in [0,2]
$$

**Solution:**

1. Apply the **sum rule** and **constant multiple rule** with the **power rule**:

$$
\int (3x^2 + 2x + 1) dx = \int 3x^2 \, dx + \int 2x \, dx + \int 1 \, dx
$$

- $\int 3x^2 \, dx = 3 \cdot \frac{x^3}{3} + C = x^3 + C$  
- $\int 2x \, dx = 2 \cdot \frac{x^2}{2} + C = x^2 + C$  
- $\int 1 \, dx = x + C$

Combine:

$$
\int (3x^2 + 2x + 1) dx = x^3 + x^2 + x + C
$$

2. Evaluate the **definite integral** from 0 to 2:

$$
\int_0^2 (3x^2 + 2x + 1) dx = \left[ x^3 + x^2 + x \right]_0^2 = (8 + 4 + 2) - (0 + 0 + 0) = 14
$$

**Interpretation:** The total accumulated quantity under the curve is **14**.

##### **Key Concepts**

- Integration **accumulates contributions** over a domain.  
- Essential in **expected value computation, probabilistic modeling, continuous loss, and continuous-time dynamics**.  
- Definite integrals correspond to **area under curves**, interpreted as total probability, total loss, or accumulated contribution.  
- **The constant of integration $C$** appears in indefinite integrals but cancels in definite integrals.  
- Numerical methods (trapezoidal, Monte Carlo, `scipy.integrate.quad`) are used when analytical solutions are not feasible.

##### **Summary**

> Integration in ML is a cornerstone for **probability, expectation, continuous loss, state evolution, and performance metrics**. Mastery of integration rules and applications bridges **calculus concepts** with practical ML tasks.

---

## **Numeric Integration with Python**

*Computing Definite Integrals Numerically with* `scipy.integrate.quad`

In [141]:
from scipy.integrate import quad

$$ \int_1^2 \frac{x}{2} dx = \frac{3}{4} $$

In [142]:
def g(x):
    return x/2

In [143]:
area, error = quad(g, 1, 2) 

In [144]:
area

0.75

`quad` *returns an estimated error due to numerical approximation, but it is typically so small that it can be ignored.*


In [145]:
error

8.326672684688674e-15

---

## **Binary Classification**

**Binary classification** is a supervised learning task where the goal is to assign input data to **one of two classes**, typically labeled 0 and 1.

##### **Problem Setup**

Given a dataset of $n$ examples:

$$
\{(x^{(i)}, y^{(i)})\}_{i=1}^n, \quad y^{(i)} \in \{0, 1\}
$$

The model $f_\theta(x)$ predicts:

$$
\hat{y}^{(i)} = f_\theta(x^{(i)}) \in [0, 1]
$$

Here, $\hat{y}^{(i)}$ is interpreted as the **probability of belonging to class 1**. The final class can be assigned using a threshold, e.g., 0.5.

##### **Model Example: Logistic Regression**

A simple linear model with sigmoid activation:

$$
\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = \theta^T x
$$

- Maps any real-valued input to a probability.  
- Suitable for modeling **linear decision boundaries**.

##### **Cost Function**

The **binary cross-entropy (log loss)** measures the difference between predicted probabilities and true labels:

$$
J(\theta) = - \frac{1}{n} \sum_{i=1}^n \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]
$$

- Penalizes wrong predictions more heavily when they are confident.  
- Gradients of $J(\theta)$ are used to **update parameters via gradient descent**.

##### **Key Concepts**

1. **Decision Boundary:** Separates the two classes in feature space.  
2. **Probability Interpretation:** Outputs $\hat{y} \in [0,1]$; thresholding gives class labels.  
3. **Gradient-Based Training:** Gradients guide updates to reduce the cost function.  
4. **Evaluation Metrics:** Accuracy, precision, recall, F1-score, ROC-AUC.

##### **Conceptual Understanding**

- Illustrates **how the model predicts probabilities**.  
- Shows **how the cost function guides learning**.  
- Explains **how predictions are converted to binary decisions**.  
- Demonstrates **how gradients drive parameter updates**.

---

## **The Confusion Matrix**

A **confusion matrix** is a table used to evaluate the performance of a classification model by comparing **predicted labels** against **true labels**.

For binary classification:

|                | Predicted 0 | Predicted 1 |
|----------------|-------------|-------------|
| **Actual 0**   | True Negative (TN) | False Positive (FP) |
| **Actual 1**   | False Negative (FN) | True Positive (TP) |

##### **Key Metrics Derived**

- **Accuracy:** Overall correctness  

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

- **Precision:** Correctness of positive predictions  

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

- **Recall / Sensitivity / True Positive Rate (TPR):** Ability to correctly identify positives  

$$
\text{TPR} = \text{Recall} = \frac{TP}{TP + FN}
$$

- **False Positive Rate (FPR):** Proportion of negatives misclassified as positive  

$$
\text{FPR} = \frac{FP}{FP + TN}
$$

- **F1-Score:** Harmonic mean of precision and recall  

$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

##### **Conceptual Understanding**

- The confusion matrix shows **where the model makes errors**.  
- **TPR** indicates how well the model detects positive cases.  
- **FPR** indicates how often negative cases are incorrectly labeled as positive.  
- Metrics like precision, recall, and F1-score are particularly important for **imbalanced datasets**, where accuracy alone can be misleading.

---

## **The Receiver-Operating Characteristic (ROC) Curve**

The **ROC curve** is a graphical tool to evaluate the performance of a binary classifier across **all possible thresholds**. It plots the trade-off between **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

##### **Axes of the ROC Curve**

- **X-axis:** False Positive Rate (FPR)  

$$
\text{FPR} = \frac{FP}{FP + TN}
$$

- **Y-axis:** True Positive Rate (TPR = Recall)  

$$
\text{TPR} = \frac{TP}{TP + FN}
$$

Each point on the curve corresponds to a **different classification threshold**. As the threshold changes, the TPR and FPR change, producing the curve.

##### **Key Concepts**

1. **Threshold Variation:** By lowering the threshold, more examples are predicted as positive → increases TPR but also increases FPR.  
2. **Area Under the Curve (AUC):**  
   - $\text{AUC} = 1$ → perfect classifier  
   - $\text{AUC} = 0.5$ → random guessing  
   - Higher AUC indicates better overall performance.  
3. **Comparison Tool:** ROC curves are useful for comparing classifiers, especially with **imbalanced datasets**.  
4. **Connection to Confusion Matrix:** Each point on the ROC curve can be derived from the **confusion matrix** at a specific threshold, using TPR and FPR.

##### **Conceptual Understanding**

- ROC curves show the **trade-off between sensitivity and specificity**.  
- They help **select an optimal threshold** depending on the cost of false positives vs false negatives.  
- A classifier closer to the **top-left corner** of the ROC space is generally better.

---

## **Area Under the ROC Curve**

When we have discrete $(x, y)$ coordinates instead of a continuous function, we can compute the area under the curve using scikit-learn's `auc()` method. This method employs a numerical approach based on the [trapezoidal rule](https://en.wikipedia.org/wiki/Trapezoidal_rule) to approximate the integral of the curve defined by the coordinates.

In [146]:
from sklearn.metrics import auc

The $(x, y)$ coordinates of the ROC curve are:

* $(0, 0)$
* $(0, 0.5)$
* $(0.5, 0.5)$
* $(0.5, 1)$
* $(1, 1)$

In [147]:
FPR = [0, 0, 0.5, 0.5, 1]
TPR = [0, 0.5, 0.5, 1, 1]

In [148]:
auc(FPR, TPR)

0.75

---