---
# Advanced Financial Econometrics
## PhD seminar reading group

#### Winter Semester 2019/2020

#### Seminar 3 (Oct 23 2019): Chapter 5 (5.6-5.11)

by Lucie Kraicova



---

## Machine Learning Basics: 5.6 - 5.11

* 5.6 Bayesian Statistics
* 5.7 Supervised Learning Algorithms
* 5.8 Unsupervised Learning Algorithms
* 5.9 Stochastic Gradient Descent
* 5.10 Building a Machine Learning Algorithm
* 5.11 Challenges Motivating Deep Learning





### 5.6 Bayesian Statistics

**Frequentist statistics**: We estimate a single value of $\mathbb{\theta}$.

Using this approach we assume that the true $\theta$ is fixed, but unknown, while the dataset is considered random, therefore also the $\hat{\theta}$.

**Bayesian statistics**: We consider all possible values of $\theta$ when making a prediction.

Here the dataset is not considered random, as it is directly observed. The true $\theta$ is unknown, uncertain, thus treated as a random variable.

Before we observe the data, we represent our knowledge of $\theta$ using the **prior probability distribution** $p\left(\theta\right)$ (e.g. assuming _a priori_ that parameters lie in a finite range with a uniform distribution).
Then we observe data $\{x^{(1)},x^{(2)},...x^{(m)}\}$ and update our belief about $\theta$ using **Bayes' rule** and get the **posterior density**:

$$p\left(\theta|x^{(1)},x^{(2)},...x^{(m)}\right)=\frac{p\left(x^{(1)},x^{(2)},...x^{(m)}|\theta\right)p\left(\theta\right)}{p\left(x^{(1)},x^{(2)},...x^{(m)}\right)}$$

When we extend our dataset by one sample, our predicted distribution over this new sample is:

$$p\left(x^{(m+1)}|x^{(1)},x^{(2)},...x^{(m)}\right)=\int{p\left(x^{(m+1)}|\theta\right)p\left(\theta|x^{(1)},x^{(2)},...x^{(m)}\right) d\theta}$$

After observing data, there is still uncertainty about the true parameter values and this uncertainty is incorporated directly into our predictions. This may protect against overfitting

* typically generalize better when there is limited training data
* high computational costs on large training datasets
* human judgement in the _prior_ selection 

(Example p.137: Bayesian Linear Regression)

**Maximum _A Posteriori_ (MAP) Estimation** provides us with a single esimate of $\theta$ while allowing the influence by the _prior_.
MAP chooses the point of maximal posterior probability (probability density in the continuous case).

$\theta_{\text{MAP}}=\underset{\theta}{\text{ argmax}}\text{ p}\left(\theta|x\right)=\underset{\theta}{\text{argmax}}\left[ \text{ log p}\left(x|\theta\right)+\text{ log p}\left(\theta\right) \right]$

The decomposition shows how the MAP relates to the ML estimate. First term: log-Likelihood term, second term: log-prior distribution.

Using the prior information we can achieve lower variance at the cost of increased bias compared to ML estimate.
Many regularization strategies can be interpreted as MAP approximation to Bayesian inference.


### 5.7 Supervised Learning Algorithms
Learning algorithms that learn to associate some input and output given training data that include both the inputs $x$ and outputs $y$. 

In **Probabilistic supervised learning** we estimate $p(y|x)$.

**Support vector machines (SVM)** are one of the most influential approaches to supervised learning. The model uses linear function $w^{T}x+b$ to predict the presence of positive and negative classes (when the function becomes positive/ negative for given example).

**Kernel trick** is an innovation associated with, but not limited to SVM. It is based on the observation that many machine learning algorithms (e.g. SVM) can be written exclusively in terms of dot products between examples. 

$w^{T}x+b=b+\sum_{i=1}^{m}{\alpha_{i}x^{T}x^{(i)}}$

* $\alpha$ is vector of coefficients
* $x^{(i)}$ is a training example

Then x can be replaced by output of **feature function** $\phi(x)$, dot product with a **kernel** $k\left(x,x^{(i)}\right)=\phi(x)\cdot\phi(x^{(i)})$ and the prediction can be done using function:

$f(x)=b+\sum_{i}\alpha_{i}k\left(x,x^{(i)}\right)$.
* nonlinear in $x$, linear in $\phi(x)$ and $\alpha$
* $\phi(x)$ considered fixed, only alpha is estimated
* equivalent to estimation of linear model on input data preprocessed by $\phi(x)$
* helps us to deal with models that are nonlinear in x using convex optimization techniques
* cost of evaluating linear in number of training examples

Most commonly used is the **Gaussian kernel** $k(u,v)=N\left(u-v;0,\sigma^{2}I\right)$ known as **Radial basis function (RBF)** and is described as performing **Template matching**. Here, a training example $x$ associated with training label $y$ becomes a template for class $y$. When test point $x'$ is close to training point $x$, the RBF has a large response and thus gives a large weight to $y$ in the estimation. 

Category of algorithms that employ the Kernel trick is known as **Kernel machines** or **Kernel methods**

**k-Nearest Neighbor Regression** 
* interesting case with infinite training set
* problem when there are more features and just some of them are relevant for the output

**Decision Tree** 
* input space is broken into regions and subregions and usually, each leaf node maps every point in its input region to the same output.

### 5.8 Unsupervised Learning Algorithms
Learning algorithms that work on training data that include only features $x$, but not the supervisory signal $y$. They either extract information from the data and/or transform these data in an 'optimal' way. This enables to represent data in a different way. 
Examples: density estimation, learning to draw samples from a distribution, denoising,finding a manifold that the data lies near, clustering data in groups of related examples

* Lower dimensional representation: compress information

* Sparse representation: input entries should be mostly zero without discarding too much information

* Independent representation: disentangle sources of variation underlying the data


**Principal Components Analysis**
It provides a means of compressing and decorrelating data. It learns an orthogonal, linear transformation of the data that projects an input $x$ to a representation $z$

**k-means Clustering**
It divides the training set into k different clusters of examples that are near each other. Provides sparsity, as we can think about the clustering in terms of k-dimensional one-hot code vectors $h$ representing the inputs. If $x$ belongs to cluster $i$, then $h_{i}=1$, all its other entries are zero. 

The algorithm works by initializing k different centroids $\{\mu^{(1)},\mu^{(2)}...,\mu^{(k)}\}$ to different values. Then, each training example is assigned to cluster $i$, where $i$ is the index of nearest centroid $\mu^{(i)}$. Then, each centroid $\mu^{(i)}$ is updated to the mean of all training examples in the cluster $i$ and then the algorithm returns to the previous step.

* problem of performance masurement (example of red trucks, red cars, gray trucks, gray cars)
* distributed representation might be preferred to one-hot representation, as it could have more than one classification attributes per class
* having many attributes reduces the burden on the algorithm

### 5.9 Stochastic Gradient Descent (SGD)
Learning algorithm underlying 'nearly all deep learning'. It extends gradient descent algorithm from Chapter 4.
Reduces computational costs in case of large training sets, when the cost function can be decomposed as 'per-example loss functions' summed over all training examples. E.g. the negative conditional Log-likelihood of training data:
$$ J(\theta)=\mathbb{E}_{x,y\sim\hat{p}_{data}}L\left(x,y,\theta\right)=\frac{1}{m}\sum_{i=1}^{m}{L\left(x^{(i)},y^{(i)},\theta\right)} \\
L\left(x,y,\theta\right)=-\text{log }p\left(y|x;\theta\right)$$

$L$ is the per-example loss function and the gradient requires computing:
$$ \nabla_{\theta}J(\theta)=\frac{1}{m}\sum_{i=1}^{m}{\nabla_{\theta}L\left(x^{(i)},y^{(i)},\theta\right)}$$

Computational cost is $O(m)$ and with large training data, time to take a single gradient step becomes prohibitively long.

SGD takes the gradient as an expectation that can be approximately estimated using a small set of samples. On each step of the algorithm we can draw uniformly a subsample of all the training examples available $\mathbb{B}=\{x^{(1)},x^{(2)},...,x^{(m')}\}$ and estimate the gradient using this subsample:

$$ g=\frac{1}{m'}\nabla_{\theta}\sum_{i=1}^{m'}{L\left(x^{(i)},y^{(i)},\theta\right)}$$

The stochastic gradient descent algorithm follows the estimated gradient downhill:
$\theta\leftarrow\theta-\epsilon g$ where $\epsilon$ is the learning rate.

### 5.10 Building a Machine Learning Algorithm
Nearly all deep learning algorithms can be decomposed into 4 basic parts:
* Dataset
* Cost function
* Optimization procedure
* Model

As we can choose and manage each of these parts mostly independently from the others, we can obtain a wide variety of algorithms.

### 5.11 Challenges Motivating Deep Learning
**The curse of dimensionality**

**Local constancy and smoothness regularization**

**Manifold Learning**


