# What does it mean to learn?


## Intuitively

### Generalization

Take for example a student which has to answer some questions on a test. Generalizing means that, if he saw some specific questions with the associated (corrected) answers, he is able to treat correctly new, unseen (related) questions.

But what does this mean in terms of Machine Learning?

### Induction

Take a recommender system. This system has to predict how much (on a scale) a student likes a course. The induction framework does this:
1. It looks at previous years' **examples** (course student pairs) taken from the so called **training set** and **induces** a function f that will map new examples to a **predicted** rating
2. It **evaluates** the induced function against the **test set**

## Formalizing

- The sample set is X $ \rightarrow $ we extract every sample x from this set
- The evaluation/objective function is f $ \rightarrow $ goal to reach, constructed over known data
- The learned function is $ \hat f $ $ \rightarrow $ has to emulate the unknown function as best as possible
- The loss function is $ l(f,\hat f) $ $ \rightarrow $ quantifies the error 
- The dataset D $ \rightarrow $ is extracted from data distribution *D*
- The expected error $ \epsilon = \sum [D(x,f(x))*l(f(x),\hat f(x))] $ $ \rightarrow $ expected value of l over distribution *D*

![induction framework](images/b9b18ab2b16b29486abb53e8c5258f47be06603bd42e1794851a237e531a5efe.png)

Step 1 is contained in the red box, aka the learning algorithm. It is executed as a cycle on all the training samples. In the figure below we zoom in. After having trained $ \hat f(x) $, we test it, hoping that $ \hat f(x) \approx f(x) \space \forall x \in X / D $. For that reason, the test set can never be taken from the training set. It is fundamental to avoid overfitting

![Learning algorithm](images/05b31c1af1208989ecae2cc06d88a187151eaa00f02caef5dc16af2b94e15b5b.png)


### In one sentence
Machine learning is the process of computing a function that has as input samples from and unknown distribution and as output a value with low expected error (over that distirbution) wrt a loss function.

# Limits of learning

- Inductive bias
- Noise
- Overfitting and underfitting
- No one gives us the data distribution *D*, but only a sampling of it. If we would have *D*, it would suffice to output for each sample input, the sample output for which the probability of them being coupled is maximised (Bayes optimal)
  - This means also we can only approximate the true error with the sampling error **FORMULA TODO**

# Categories of models

We usually differentiate every induction framework for three aspects
   1. How we represent the function (hypothesis space)
   2. How we evaluate the function (objective function)
   3. How we optimize the function (loss function)

We can distinguish between two big categories of machine learning models 
- Parametric models, that have a fixed number of parameters. With this approach, the learning algorithm has as output of the optimization process the optimal parameters $ \omega ^* $, with which we then predict on test samples. After that, training data is not used anymore.
- Non parametric models, whose number of parameters can grow with the dimension |D| of the input data. With this approach the trained model makes predictions still using the available data.