# PAC Learning Framework

In this section, I will concisely cover the Probably Approximately Correct (PAC) learning framework. 

## Motivation
In a sentence, the PAC framework quantifies the requirements (number of training points, time complexity, and space complexity) to approximately learn patterns in the data. THIS IS HUGE! For instance, a simple example is learning linear versus nonlinear maps between the input and outputs - linear patterns intuitively require less points to learn the pattern. Concretely, we could look at predicting height with weight data. If we desired a linear pattern, then we only need to search over $\mathbb{R}_+$ to find a parameter $\alpha$ such that $\text{height}=\alpha\cdot \text{weight}$. If instead we desire a nonlinear pattern, then think about the expansion of the space! We could fit polynomials, trigonometric, and the list goes on. I believe this should provide sufficient motivation for why we care deeply about quantifying the number of samples we need depending on the relationships we deem important (linear, nonlinear, indicator functions, exponentials, etc.). Now, I turn to mathematics for precision and clarity.

## Some Notations

- Let $\mathcal{X}$ denote the input space (includes all data excluding the output variable)

- Let $\mathcal{Y}$ denote the set of labels/output/target variables (as in [1], I limit discussion to binary outputs right now)

- Examples $(x,y)\in(\mathcal{X},\mathcal{Y})$ are assumed to be drawn independently and identically distributed (i.i.d.) according to some unknown distribution $\mathcal{D}$ 


## Definitions

**Concept**: A concept $c:\mathcal{X}\rightarrow\mathcal{Y}$ is a mapping from $\mathcal{X}$ to $\mathcal{Y}$.

**Concept Class**: A concept class $\mathcal{C}$ is a set of concepts we may wish to learn.  

## Examples

- Let us consider the features to be height, weight, age, ethnicity, and cancer (boolean) with the target being death (boolean). Then let us discuss concepts: for instance, all instances of Germans over 6' with weight >300lbs with cancer and over the age 55 get mapped to 1 i.e. they're dead (this is made up). This concept is thus an indicator function. We could consider a concept class to be a union of similar indicator variables. This type of concept class forms the basis for rule-based algorithms which will be discussed in a separate notebook.

- In the book, they discuss out of the ordinary examples - such as the concept class of triangles or circles or other geometric figures. While yes these are technically concept classes, they are sparsely used in practice.

## Definitions

**Generalization Error**: Given a hypothesis $h\in\mathcal{H}$, a target concept $c\in\mathcal{C}$, and an underlying distribution $D$, the generalization error or risk of $h$ is defined by 

$$R(h) = P_{x\sim D}\left[h(x)\neq c(x)\right] = \mathbb{E}_{x\sim D}\left[\mathbb{I}_{h(x)\neq c(x)}\right].$$

In words, it is the probability that the hypothesis does not align with the target concept. This is of course unknown as we do not know $D$; however, we can measure the empirical error of the hypothesis based on a labeled sample of the data.

**Empirical Error**: Given a hypothesis $h\in\mathcal{H}$, a target concept $c\in\mathcal{C}$, and a sample $S=(x_1,\dots,x_m)$, the empirical error or risk of $h$ is defined by 

$$\hat{R}(h) = \frac{1}{m}\sum\limits_{i=1}^m \mathbb{I}_{h(x_i)\neq c(x_i)}.$$

It is an emsemble of the error derived from the sample $S$.

**PAC-learning**: A concept class $C$ is said to be PAC-learnable if there exists an algorithm $\mathcal{A}$ and a polynomial function $poly(\cdot,\cdot,\cdot,\cdot)$ such that for any $\epsilon>0$ and $\delta>0$, for all distributions $D$ on $\mathcal{X}$ and for any target concept $c\in\mathcal{C}$, the following holds for any sample size $m\geq poly(\epsilon^{-1},\delta^{-1},n,size(c))$:

$$\mathbb{P}_{S\sim D^m} \left[R(h_S)\leq \epsilon\right]\geq 1-\delta.$$ 

If $\mathcal{A}$ further runs in $poly(\epsilon^{-1},\delta^{-1},n,size(c))$, then $C$ is said to be efficiently PAC-learnable. When such an algorithm $\mathcal{A}$ exists, it is called a PAC-learning algorithm for $C$.

*Note on PAC-learning*: This definition is CRAZY!! I can draw the samples from any distribution of my choosing - the exponential, Cauchy, uniform, a crazy convolution of all three, etc. and it still holds that the generalization error of a hypothesis on the samples is bounded above by $\epsilon$ with some high probability $1-\delta$. The definition stattes that after enough samples (whose exact number is based on some polynomial of the error, bound, the dimension of the data, and the size of the concept representation). HOWEVER, the test data must be distributed according to the same as the training set - in other words, we impose a stationary condition that is all too common and all too unrealistic in practice. FURTHERMORE, we make a statement about the entire concept class - so we must cast quite a wide net to catch all the possible fish (target concepts). I wonder if this can be relaxed to just focus on a subset of a concept class...

## Theorems

**Learning bounds - finite H, consistent case**: Let $H$ be a finite set of functions mapping from $\mathcal{X}$ to $\mathcal{Y}$. Let $\mathcal{A}$ be an algorithm that for any target concept $c\in H$ and i.i.d sample $S$ returns a consistent hypothesis $h_S$: $\hat{R}(h_S)=0$. Then, for any $\epsilon,\delta>0$, the inequality $P_{S\sim D^m}[R(h_S)\leq \epsilon]\geq 1-\delta$ holds if 

$$m\geq \frac{1}{\epsilon}\left(\log|H|+\log\frac{1}{\delta}\right).$$

**Proof**: Let $\epsilon > 0$ and $h$ be a consistent hypothesis: $\hat{R}(h)=0$. Then we bound the following probability:

$$\mathbb{P}\left\{\exists h\in H: \hat{R}(h)=0\cap R(h)>\epsilon\right\}$$

$$=\mathbb{P}\left\{\bigcup\limits_{h\in H} \hat{R}(h)=0\cap R(h)>\epsilon\right\}$$

$$\leq^1 \sum\limits_{h\in H} \mathbb{P}\left\{\hat{R}(h)=0\cap R(h)>\epsilon\right\}$$

$$=^2 \sum\limits_{h\in H} \mathbb{P}\left\{\hat{R}(h)=0|R(h)>\epsilon\right\}\mathbb{P}\left\{R(h)>\epsilon\right\}$$

$$\leq^3 \mathbb{P}\left\{\hat{R}(h)=0|R(h)>\epsilon\right\}$$

$$=^4 \sum\limits_{h\in H} \mathbb{P}\left\{\bigcap\limits_{i=1}^m \mathbb{I}\{h(x_i)\neq c(x_i)\}=0|R(h)>\epsilon\right\}$$

$$=^5 \sum\limits_{h\in H}\prod\limits_{i=1}^m \mathbb{P}\left\{h(x_i)=c(x_i)|R(h)>\epsilon\right\}$$

$$\leq^6 \sum\limits_{h\in H}\prod\limits_{i=1}^m (1-\epsilon)$$

$$=|H|\cdot (1-\epsilon)^m.$$

where $\leq^1$ follows by subadditivity, $=^2$ follows by definition of conditional probability, $\leq^3$ follows as the distribution function is bounded above by one, $=^4$ follows by definition of the empirical error, $=^5$ follows by independence of the data points sampled, and $\leq^6$ follows as 

$$R(h)>\epsilon\Rightarrow P(h(x)\neq c(x))\geq \epsilon\Rightarrow P(h(x)=c(x))\leq 1-\epsilon.$$

*Intuition Check*: Let us check what we just said. This is intuitively correct as our $\epsilon$ is inversely proportional to the number of points collected. We made the statement that this holds for ANY distribution, so of course this relationship makes sense. Likewise, it makes sense that more hypotheses implies we need more training points - more candidate functions (a larger function space) means we need more points to ensure our algorithm correctly approximates the concepts. Finally, it trivially follows the first check that the number of points we neded to approximate is inversely proportional to the probability lower bound we desire to obtain.

**Hoeffding's Lemma**: Let $X$ be a random variable with $\mathbb{E}[X]=0$ and $X\in [a,b]$. Then for any $t>0$, 

$$\mathbb{E}[e^{tX}]\leq e^{\frac{t^2(b-a)^2}{8}}.$$

**Proof**: We first note that as $X$ is a bounded random variable that

$$\mathbb{V}[X]=\mathbb{V}\left[X-\frac{b+a}{2}\right]$$

$$=\mathbb{E}\left[\left(X-\frac{b+a}{2}\right)^2\right]$$

$$\leq \mathbb{E}\left[\left(b-\frac{b+a}{2}\right)^2\right]=\frac{(b-a)^2}{4}.$$

Then let $\psi_X(\lambda)=\log\mathbb{E}[e^{tX}]$ denote the logarithmic MGF. We can then deduce that 

$$\psi''_X(t)=e^{-\psi_X(t)}\mathbb{E}\left[X^2e^{tX}\right]-e^{-2\psi_X(t)}\left(\mathbb{E}\left[Xe^{tX}\right]\right)^2$$
$$=\mathbb{V}[X]\leq \frac{(b-a)^2}{4}.$$

Then ultimately as $\psi_X(0)=\psi'_X(0)=0$, we have by Taylor's theorem for some $\theta\in[0,t]$ that 

$$\psi_X(t)=\psi_X(0)+\psi'_X(0)+\frac{t^2}{2}\psi_X''(\theta)=\frac{t^2}{2}\psi_X''(\theta)$$

$$\leq \frac{t^2(b-a)^2}{8}$$

$$\iff e^{\psi_X(t)}=\mathbb{E}[e^{tX}]\leq e^{\frac{t^2(b-a)^2}{8}}.$$

*Intuition Check/Recap*: This neat proof evades the dumb argument found in most textbooks - here we take the logarithm of the left hand side of the inequality in the lemma and Taylor expand and find the only term left is the variance which we can upper bound as it is a bounded random variable. Very cool proof - credited to Boucheron, Lugosi, and Massart in "Concentration Inequalities".

**Hoeffding's Inequality**: Let $X_1,\dots,X_m$ be independent random variables with $X_i$ taking values in $[a_i,b_i]$ for all $i\in[m]$. Then for any $\epsilon > 0$, the following inequalities hold for $S_m=\sum_{i=1}^m X_i$:

$$\begin{cases} \mathbb{P}\left[S_m-\mathbb{E}[S_m]\geq \epsilon\right] \leq e^{-2\epsilon^2/\sum_{i=1}^m (b_i-a_i)^2} \\ \mathbb{P}\left[S_m - \mathbb{E}[S_m]\leq -\epsilon\right]\leq e^{-2\epsilon^2/\sum_{i=1}^m (b_i-a_i)^2}\end{cases}$$

# Sources

[1] "Foundations of Machine Learning" by Mohri et al., 2012