# Chapter 1 The Learning Problem

Learning from data is used in situations where we don't have an analytic solution, but we do have data that we can use to construct an empirical solution.

## 1.1 Problem Setup
Consider the problem of predicting how a movie viewer would rate the various movies out there. The power of learning from data is that we don't need to analyze movie content or viewer taste. The learning algorithm 'reverse-engineers' the factors (which can impact rating) based solely on previous ratings.

### 1.1.2 A Simple Learning Model

#### The Perceptron Classifier
* The output space: $\mathcal{Y}=\{+1,-1\}$
* The Hypothesis space $\mathcal{H}$ is called the perceptron if the functions are of the form $h(x) = sign\left(\left(\sum^{d}_{i=1}w_ix_i\right)+b\right)$ or $sign(w^Tx)$ if we merge $b$ into $w$
* The perceptron learning algorithm (PLA)
  * Update rule: at each iteration $t$, for each misclassified point $\left(x(t),y(t)\right)$, update $w(t+1) = w(t) + y(t)x(t)$
  * The update rule moves in the direction of classifying $x(t)$ correctly.
* If the data set if **linearly separable**, it is guaranteed that **PLA** will find some parameters $w,b$ that classifies all the training examples correctly.

### 1.1.3 Learning versus Design

* Learning process constructs a model based on data
* Deisgn process constructs analytically a physical model based on specifications and does not use data

## 1.2 Types of Learning
### 1.2.1 Supervised Learning

The form of training examples: (input, correct output)

#### Variations
* Active Learning: The data set is acquired through queries that we make. Thus, we get to choose a point $x$ in the input space, and the supervisor reports to us the target value for $x$. 

* Online Learning: The data set is given to the algorithm one example at a time.

### 1.2.2 Reinforcement Learning

* The training sample does not contain the target output, but instead contains some possible output together with a measure of how good that output is. Importantly, the example does not say how good other outputs would have been for this particular input. 

* The form of training examples: (input, some output, grade for this output)

### 1.2.3 Unsupervised Learning
* The form of traing examples: (input)
* Unsupervised learning can be viewed as the task of spontaneously finding patterns and structure in input data.
* Unsupervised learning can also be viewed as a way to create a higher level representation of the data. 

### 1.2.4 Other Views of Learning
* Machine Learning
* Statistical Learning: It uses a set of observations to uncover an underlying process. The process is a probability distribution and the observations are samples from that distribution. It focuses on somewhat idealized models and analyzes them in great detail. 
* Data Mining: It's a practical field that focuses on finding patterns, correlations or anomalies in large relational databases. It emphasizes on data analysis than on prediction. 

## 1.3 Is Learning Feasible?
### 1.3.1 Outside the Data Set

* Does the data set $\mathcal{D}$ tell us anything outside of $\mathcal{D}$ that we didn't know before? If the answer is yes, then we have learned something. If the answer is no, we can conclude that learning is not feasible.
* Dilemma: As long as the true model $f$ is an unknown function, knowing the data set $\mathcal{D}$ cannot exclude any pattern of values for $f$ outside of $\mathcal{D}$. Therefore, the predictions (of the final hypothesis $g$) outside of $\mathcal{D}$ are meaningless.

#### 1.3.2 Probability to the Rescue

##### Hoeffding Inequality
For any sample size $N$, let $\mu$ be the population and $\nu$ be the sample variable, then we have

\begin{align}
P\left[|\nu - \mu| \gt \epsilon \right] \le 2 e^{-2\epsilon^2N}\;\;\;\text{for any}\;\epsilon \gt 0
\end{align}

It says that as the sample size $N$ grows, it becomes exponentially unlikely that $\nu$ will deviate from $\mu$ by more than our 'tolerance' $\epsilon$.

5.521545144074388e-06