In [1]:
import numpy as np

# Deep Learning
(http://deeplearning.net/tutorial/)

## Introduction

### Formalization of Learning

Given:
* training examples $D=\{z_1, z_2, ..., z_n \}$ where $z_i$ is sampled from an **unkown** process $P(Z)$
* loss function $L(f,Z)$ with arguments:
    * decision function $f$
    * example $z$
    * return scalar $\in I\!R$
* Goal: **minimize** $I\!E [ L(f,Z) ]$ under unknown generating process $P(Z)$

### Supervised Learning

* examples are (input, target) pairs: $Z=(X,Y)$ and $f$ takes $X$ as parameter
* Most common types:
    * **regression**: $Y\in I\!R$ scalar or vector
        * output of $f$ is in the same set of values as $Y$
        * often, **squared error** function is taken as loss function: $$L(f, (X,Y))=\vert\vert \ f(X)-Y \ \vert\vert^2$$
    * **classification**: $Y \in I\!Z$ (finite)
        * often-used loss function: **negative conditional log-likelihood**: $f_i(X)$ estimates 
$$\Pr(Y=i\ \vert \ X): \\L(f,(X,Y)) = -\log f_Y(X) \ \ , f_Y(X)\geq 0, \sum_i f_i(X)=1$$

### Unsupervised Learning

* learn function $f$ to characterize the unknown distribution $P(Z)$
* Clustering
    * Hard clustering (e.g. k-means)
    * Soft clustering (Gaussian mixture models)
* Construct a new representation for $Z$ (e.g. PCA)

### Local Generalization

* Assumption: if input $x_i$ is close to $x_j$, than output $f(x_i)$ and $f(x_j)$ should also be close to each other
    * Limitations: **curse of dimensionality**
    
### Distributed vs. Local Representation and Non-Local Generalization

**Local representation**: integer $N$ is represented by sequence of $B$ bits such that $N<B$ and all bits are $0$ except the $N$-th one.

**Distributed representation**: integer $N$ is represented by a sequence of $\log_2B$ bits with usual encoding.

$\Rightarrow$ distributed representation is exponentially more efficient than local one

#### Other example:
**Clustering** vs **PCA**

## Deep Learning Algorithms

**Flow graph**: Graph representing a computation, in which each node represents an elementary computation and a value.

**Example**:

Given expression $\sin(a^2+b/a) \Rightarrow$
* Graph with two input nodes $a$ and $b$
* one node for the division $b/a$
* one node for the square (taking only $a$)
* one node for the addition ($a^2+b/a$)
* one output node computing sine

The **depth** of the graph is the length of the longest path from an input to an output

### Motivation

* In shallow architectures, the number of nodes and parameters grow very large (e.g. logical gates)
* The brain has a deep architecture