# Chapter 3

## 3.1 PAC Learning

#### Definition 3.1 (PAC Learnability)
> A hypothesis class $\mathcal{H}$ is PAC learnable if there exists a function $m_\mathcal{H}: (0,1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following property: for every $\epsilon, \delta \in (0,1)$, for every distribution $D$ over $\mathcal{X}$, and for every labeliing function $f: \mathcal{X} \rightarrow \{0,1\}$, if the realizable assumption holds with respect to $\mathcal{H}, D, f$, then when running the learning algorithm on $m \ge m_{\mathcal{H}}(\epsilon, \delta)$ i.i.d. examples generated by D and labeled by f, the algorithm return a hypothesis $h$ such that, with probability at least $1 - \delta$, $L_{(D,\ f)}(h) \le \epsilon$.

The accuracy parameter determines how far the output classifier can be from the optimal one (*approximately* correct).

The confidence parameter indicates how likely the classifer will meet that accuracy requirement (*probably*).

The function $m_\mathcal{H}$ determines the **sample complexity** of learning $\mathcal{H}$. It is the number of examples required to guarantee a PAC solution.

#### Corollary 3.2
> Every finite hypothesis class is PAC learnable with a bounded sample complexity:
$$
m_{\mathcal{H}}(\epsilon, \delta) \le \frac{\log(|\mathcal{H}|/d)}{\epsilon}
$$

## 3.2 A More General Learning Model
### Motivation
The model described can be generalised so that we can tackle a wider scope of learning tasks. We would like to:

- Remove the Realisability Assumption
- Learn problems beyond binary classification

### Agnostic PAC Learning
First, we try to remove the realisability assumption. This assumption does not really hold in real life. Also, it is more realistic not to assume that the labels are fully determined by the measured features of the input. (For example, two examples with the same input features can now have different labels.)

We relax the realisability assumption by replacing the "target labeling function" with a more flexible data-labels generating distribution.

Formally
> Let $D$ now be a probability distribution over $\mathcal{X} \times \mathcal{Y}$. So, $D$ is a joint distribution over domain points and labels. Thus, we have a distribution made up of two parts, a distribution over domain points and a conditional probability over labels for each domain point.

Our goal is to find some hypothesis that (probably approximately) minimises the true risk, $L_D(h)$.

#### Bayes Optimal Predictor
If one knows the true distribution, it is possible to construct a classifier which predicts 1 if the $\Pr(y=1 \mid x) \ge 1/2$ and 0 otherwise. For all classifiers, the Bayes Optimal Predictor provides a lower bound on the true error.

#### Definition 3.3 (Agnostic PAC Learnability)
> A hypothesis class $\mathcal{H}$ is agnostic PAC learnable if there exists a function $m_\mathcal{H}$ and a learning algorithm with the following property: $\forall \epsilon, \delta \in (0,1)$ and for every distribution $D$ over $\mathcal{X} \times \mathcal{Y}$, when running the learning algorithm on sufficient ($m\ge m_\mathcal{H}(\epsilon,\delta)$ i.i.d examples generated by $D$, the algorithm returns a hypothesis $h$ such that with probability of at least $1-\delta$ (over the choice of the $m$ training examples),
$$
L_D(h) \le \min_{h’\in\mathcal{H}}L_D(h’) + \epsilon
$$

If the realisability assumption holds, agnostic PAC learnability provides the same guarantee as PAC learning, so agnostic PAC learning is more general. If the realisability assumption does not hold, one cannot guarantee any arbitrarily small error. But a learner can still be considered successful if its error is not much larger than the best error obtained by a predictor from the same hypothesis class.

However, in PAC learning, the learner is required to achieve a small error in absolute terms.

### Generalized Loss Functions
Given a hypothesis class $\mathcal{H}$ and some domain $Z$ let $\ell$ be any function from $\mathcal{H} \times Z$ to the set of nonnegative real numbers, $\ell: \mathcal{H} \times Z \rightarrow \mathbb{R}^+$. such functions are called **loss functions**.

Risk function: the expected loss of a classifier with respect to a probability distribution.

Empirical risk: the expected loss over a given sample.

Agnostic PAC learning for general loss functions is simply formally defining $L_D(h) = E_{z\sim D}[\ell(h,z)]$