# Chapter 3: Understanding DL through kernel learning

## 0. Mathematical foundation

### Functional analysis

A function $f(x)$ can be considered as an infinite dimension vector in the sense it can be written as a linear combination of infinite bases $\{h_l(x)\}_{l=1}^\infty$

\begin{equation}
f(x) = \sum_{l=1}^\infty c_l h_l(x) \label{lincom}\tag{Eq 0.1}
\end{equation}

Such representation is not unique in the sense that one can choose another set of bases. This amounts to performing a coordinate transformation on the \ref{lincom}

A kernel $K(x,x')$ can be considered as an infinite dimension matrix. Again, one need to choose a bases to define the kernel as below

\begin{equation}
K(x,x') = \sum_{i=1}^\infty\sum_{j=1}^\infty k_{ij} h_i(x)h_j(x') \label{kernel_basis}\tag{Eq 0.2}
\end{equation}

One can express $K(x,x')$ in terms of its eigenfunctions $\phi_l(x)$, which will diagonalize the kernel in \ref{kernel_basis}:

\begin{equation}
K(x,x') = \sum_{l=1}^\infty \lambda_l \phi_l(x)\phi_l(x')
\end{equation}

Since $K(x,x')$ encodes information about the eigenfunctions, one can use the kernel to express $f(x)$ in terms of the eigenfunctions of $K(x,x')$

\begin{align}
f(x) &= \int dx' K(x,x')f(x') \label{lincom_kernel}\tag{Eq 0.3} \\
&= \int dx' \sum_{l=1}^\infty \lambda_l \phi_l(x)\phi_l(x')f(x') \\
&= \sum_{l=1}^\infty \Big[ \int dx' \phi_l(x')f(x')\Big] \lambda_l\phi_l(x) \\
&= \sum_{l=1}^\infty c_l \phi_l(x) \label{lincom_eigen}\tag{Eq 0.4}
\end{align}

where $c_l \equiv \lambda_l \int dx' \phi_l(x')f(x') $

For a $k^{th}$ order polynomal kernel, $\phi_i(x)$ will be combination of $x_i$ of up to $k$ terms. The number of basis functions will be finite = $ {k+d}\choose{d}$ where $d$ is the dimension of $x$. For Gaussian RBF, $\phi_i$ will be a combination of Gaussian functions and Hermite polynomials. See section 6.2 of [Ref 1](http://www.math.iit.edu/~fass/PDKernels.pdf). \ref{lincom_eigen} can be understood as expressing the value at point $x$ as a weighted sum of all other points $x'$. For a RBF, the weight depends on the distance between $x$ and $x'$.

## 1. Kernel method


### Motivation
The purpose of kernel method is to express the target function $f(x)$, as a linear combination of a set of bases $[\phi_1(x), \phi_2(x),...]$. Once the $f(x)$ is expressed as the linear combination of the eigenfunctions of the kernel bases, simple machine learning algorithm (e.g. linear regression) can be used to model the target function.

### Supervised learning
In practice, given a dataset $\{x_i, y_i\}_{i=1}^N$, one can approximate $K(x,x')$ in \ref{kernel_basis} by using the data. The constructed kernel will no longer be of infinite dimension and will become an $N\times N$ matrix. The corresponding eigenvectors can be computed from the kernel matrix. The integral in \ref{lincom_kernel} now becomes a sum

\begin{equation}
f(x) = \sum_{i=1}^N K(x, x_i)f(x_i) \label{lincom_kernel_data}\tag{Eq 1}
\end{equation}

In supervised learning, we do not know the form of $f(x)$ a priori. We therefore approximate the \ref{lincom_kernel_data}

\begin{equation}
f(x) \approx \sum_{i=1}^N \alpha_i K(x,x_i) + \alpha_0 \label{svm_func}\tag{Eq 2}
\end{equation}

Note that $f(x)$ is linear in $K(*, x_i)$. To solve for $\alpha$'s, we define the loss function with some penalty function $J[f]$ which we want to minimize with respect to the $\alpha$'s

\begin{equation}
\min_{f\in \mathcal{H}}\Big[ \sum_{i=1}^N L(y_i, f(x_i)) + \lambda J[f]\Big] \label{lossfunc1}\tag{Eq 3}
\end{equation}

Choosing $L^ 2$ regularization and express $f$ as the linear combination of $K(x,x_i)$ (and therefore implicitly $\phi_l(x)$) that span the whole Hilbert space, \ref{lossfunc1} becomes

\begin{align}
& \min_{\alpha_0, \vec{\alpha}}\Big[ \sum_{i=1}^N L\big(y_i, \sum_{i=1}^N \alpha_i K(x,x_i) + \alpha_0 \big) + \lambda \|f\|^2_{\mathcal{H}_K}\Big] \\
&= \min_{c_l}\Big[ \sum_{i=1}^N L\big(y_i, \sum_{i=l}^\infty c_l\phi_l(x_i) \big) + \lambda \|f\|^2_{\mathcal{H}_K}\Big] \label{lossfunc2}\tag{Eq 4} 
\end{align}

where $\|f\|^2_{\mathcal{H}_K}$ is defined as 

\begin{equation}
\|f\|^2_{\mathcal{H}_K} \equiv \langle f | K |f \rangle = \sum_{l=1}^\infty \frac{c_l^2}{\gamma_l} \label{hnorm}\tag{Eq 5}
\end{equation}

\ref{hnorm} implies that when expressing $f$ as a linear combination of $\phi_i$, basis functions with increasing small eigenvalues $\gamma_i$ are penalized more. To draw analogy with PCA, eigenvectors with large eigenvalues are smoother (capture general variance in the data) whereas eigenvectors with small eigenvalues are more rough (capture finer variation or noise in the data). The corollary is that in order to approximate the true function, only a finite number of basis functions are needed since higher order $\phi_i$ are suppressed by the $L^2$ term.


### SVM
One can choose different loss function in \ref{lossfunc2} e.g. OLS, hinge-loss. The latter is the loss function of choice for SVM. The minimization problem \ref{lossfunc2} becomes

\begin{align}
& \min_{f\in \mathcal{H}_K}\Big[\sum_{i=1}^N[1-y_i f(x_i)]_+ + \frac{\lambda}{2} \langle f | K | f \rangle\Big] \\
= & \min_{\alpha_0, \mathbf{\alpha}}\Big[\sum_{i=1}^N[1-y_i f(x_i)]_+ + \frac{\lambda}{2} \sum_{i=1}^N \mathbf{\alpha}^T K \mathbf{\alpha}\Big]
\label{svm_lossfunc}\tag{Eq 6}
\end{align}

where $f(x)$ is a function generated by the kernel $K$: $f(x) = \alpha_0 + \sum_{i=1}^N \alpha_i K(x, x_i)$. The hinge-loss function guarantees only wrong predictions matter (hinge-loss larger than 1). For correct predictions, the hinge-loss is small (between 0 and 1). Therefore, minimizing \ref{svm_lossfunc} amounts to setting those $\alpha$'s to zero. Those terms $K(x,x_i)$ with non-zero $\alpha_i$ are called *support vectors*.

### Gaussian Processes

Given a dataset $\{x_i, y_i\}_{i=1}^N$, a GP assumes the function is drawn from the following multivariant Gaussian distribution:

\begin{equation}
f(x)\ |\ \{x_i, y_i\}_{i=1}^N \sim N(\vec{\mu}, \Sigma)
\end{equation}

where the average $\vec{\mu}=[\mu(x_1), \mu(x_2),...,\mu(x_N)]$ and the covariance matrix $\Sigma_{ij} = K(x_i, x_j)$, which measures the similarity between $x_i$ and $x_j$. A common choice of the kernel is again a Gaussian RBF

\begin{equation}
K(x_i,x_j) = \sigma^2 e^{\frac{1}{2l^2}(x_i-x_j)^2}
\end{equation}

where the hyperparameters $\sigma$ and $l$ control the 'volatility' and correlation over distance of the function.
The intuition here is points interpolated from the observed data points should not be too far off from the neighboring observed data point. Such similarity is encoded by the kernel $K$. A stationary GP uses $K$ that depends on the distance between the two points. A non-stationary GP does not have such assumption. 

$\sigma$ and $l$ can be estimated using maximum log likelihood:

\begin{align}
LL(\sigma, l) &= -\ln |K| - (\vec{y}-\vec{\mu})^T K^{-1}(\vec{y}-\vec{\mu}) \\
&= -\ln |K| - \sum_{i=1}^N \frac{1}{\sigma^2}y_i y_j e^{\frac{1}{2l^2}(x_i-x_j)^2} \label{ll_max}\tag{Eq 7}
\end{align}

where we assume $\mu(x_i)$ are zero for all $i$ in the second line. To get an intuition of the optimization process, consider two points far apart with same sign for $y$ (since $\vec{\mu}$ is zero, $y$ is centered at zero). The second term in \ref{ll_max} will be very negative. To maximize the second term, $l$ should increase, hence improving the smoothness, which makes sense since we observe two far apart points have $y$ with the same sign.

A GP can be updated whenever addition data is provided. Assuming zero $\mu$, the old $y$ and new $y'$ target follow the multivariate Gaussian distribution

\begin{equation}
\begin{bmatrix}
y\\
y'
\end{bmatrix}
\sim N\Big(
\begin{bmatrix}
0\\
0
\end{bmatrix},
\begin{bmatrix}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix}\Big)
\end{equation}

Using Bayes' rule $P(y'|y)=P(y',y)/P(y)$, we can update $\Sigma_{11}$ according to

\begin{equation}
\Sigma_{11} \leftarrow \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
\end{equation}


### Bayesian view of kernel regression 

Once the kernel $K$ is specified, we also specified the prior of the class of functions to the regression problem. For example, the set of functions generated by the kernel $K$ can be thought of as a zero-mean stationary Gaussian process with prior covariance function $K(x,x')$. Typically, the prior is such that smoother $\phi$ will have larger variance and more rough $\phi$ will have smaller variance. Such prior enforce smoothness of the function by penalizing $\phi$ with small variance.

If we expect shift-invariance in the data, a Gaussian RBF can be a good choice as it only depends on the distance between the two points $||\vec{x}-\vec{x}'||$.

## 2. Understanding overparametrization

### 2.1 Overparametrization in linear regression

In the overparametrized regime $p > N$, the solution to the OLS linear regression problem

\begin{equation}
\hat{\beta} = \arg\min_{\beta}\|X^T\beta-y\|^2 \label{linreg_loss}\tag{Eq 2.1}
\end{equation}

is not unique as there are not enough constraints to determine all the $\beta_k$. Using the Moore-Penrose inverse and express the solution to \ref{linreg_loss} as

\begin{equation}
\beta^* = (X^TX)^+ X^Ty \label{moore_penrose_sol}\tag{Eq 2.2}
\end{equation}

one can show this solution corresponds to the minimum norm solution, i.e.

\begin{equation}
\beta^* = min[\|\beta\|^2 | f(x_i)=y_i \forall i] \label{min_norm_sol}\tag{Eq 2.3}
\end{equation}

Note that this is not equivalent to an $l^2$ regularization with regularization parameter $\lambda$ since in the limit of $\lambda \to \infty$, the solution will be $\hat\beta=0$ whereas \ref{min_norm_sol} gives a minimum (in general non-zero) value to $\beta$ given a strict condition that all points satistfy $f(x_i)=y_i$. Note that $|\beta|$ is the margin separating the two classes of perfectly separable data. Therefore, the minimum norm solution also corresponds to maximizing the margin between the two separate classes. 

The computation of Moore-Penrose inverse in \ref{moore_penrose_sol} requires $O(N^3)$ as inverse is usually done using SVD. A more efficient way to reach to the solution is via gradient descent

\begin{equation}
\beta \leftarrow \beta - \alpha \nabla_\beta L(X,y,\beta)
\end{equation}

With sufficiently small $\alpha$ (to avoid oscillation) and starting with $\beta^{(0)}=0$, it can be shown the solution will converge to the same minimum norm solution in \ref{moore_penrose_sol}. Therefore, initializing $\beta^{(0)}=0$, gradient descent leads to effectively some sort of regularization. 

Define $\gamma \equiv p/N$. If the true function is linear, in the underparametrized regime ($\gamma < 1$), the OOS error of a linear model has zero bias and only variance. In the overparametrized regime, the OOS error of a linear model will consist of both bias and variance. Bias will increase and variance will decrease as $\gamma$ increases.

Since the loss function is still convex, the only way it can have multiple solutions is that the basin of attraction is not a point but rather a curve (or a manifold?)

### 2.2 Overparametrization in kernel regression

To solve a kernel regression, one can use kernel gradient descent. Given $\{f\}$ generated by kernel $K$, the functional derivative $D \equiv d/df(x)$ of the loss function

\begin{align}
L[f] &= \sum_{i=1}^N\|f(x_i)-y_i\|^2 + \lambda\|f\|^2_{\mathcal{H}_K} \\
D L[f] &= 2\sum_{i=1}^N(f(x_i)-y_i)D f + 2\lambda f
\end{align}

Using \ref{lincom_kernel}

\begin{align}
D f &= D \int dx'' K(x,x'')f(x'') \\
&= D_{f(x')} \int Df(x'') K(x,x'') \\
&= K(x,x')
\end{align}

So the gradient descent for kernel regression becomes

\begin{equation}
f(x) \leftarrow f(x) - 2\alpha \sum_{i=1}^N(f(x_i)-y_i)K(x_i,x) - 2\lambda f(x_i) \label{GD_kernel}\tag{Eq 2.4}
\end{equation}

This is equivalent to performing gradient descent in the feature space. If one express $f(x)=\sum_{i=1}^N \alpha_i K(x,x_i)$ and initialize $\alpha_i$'s close to zero, i.e. $f^{(0)}(x)=0$, the least-squares kernel gradient descent converges to the minimum-norm solution with respect to the RKHS norm. 

### 2.3 Overparametrization in neural neworks

For multi-layer infinite width NN with randomized initialized weights, the problem can be casted into a ridgeless kernel regression problem with the Neural Tangent Kernel (NTK)

\begin{equation}
NTK(x,x') = \nabla_{\theta}f(x,\theta)\cdot\nabla_{\theta}f(x',\theta)
\end{equation}

where $f(x,\theta)$ is the scaler function learned by the NN.

It is suggested in [Ref 2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8131612/#MOESM1) that as the size of the training set grows, kernel regression fits successively higher spectral modes of the target function, where the spectrum is defined by solving an eigenfunction problem. Consequently, the theory can predict which kernels or neural architectures are well suited to a given task by studying the alignment of top kernel eigenfunctions with the target function for the task. Target functions that place most power in the top kernel eigenfunctions can be estimated accurately at small sample sizes, leading to good generalization.

#### Connection with kernel regression

It is shown that during training, the evolution an infinite width NN $f(x,\theta)$ with random initialization is equivalent to performing gradient descent in ridgeless kernel regression with an NTK. At $t=0$, an infinite width NN is a GP with zero mean and covariance $\Sigma(x,x')$, which can be derived from the data. During training, the mean and the covariance of the GP gets updated. The mean converges to the same estimator yielded by kernel regression with the NTK as ridgeless kernel regression, and the covariance is expressible in terms of the NTK and the initial GP covariance.

To illustrate, consider the loss function of an infinite width NN

\begin{equation}
L(\theta) = \frac{1}{N}\sum_{i=1}^N l(f(x_i,\theta),y_i) 
\end{equation}

The gradient of $L$, by chain rule becomes

\begin{equation}
\nabla_\theta L(\theta) = \frac{1}{N}\nabla_f l(f(x_i,\theta),y_i) \nabla_\theta f(x_i, \theta)
\end{equation}

The evolution of $f(x,\theta)$ can be described by the following differential equation

\begin{align}
\frac{df}{dt} &= \frac{df}{d\theta}\frac{d\theta}{dt} \\
&\approx -\frac{df}{d\theta}\nabla_\theta L(\theta) \\
&=-\frac{1}{N}\sum_{i=1}^N \big[\nabla_\theta f(x,\theta) \nabla_\theta f(x_i,\theta)\big]\nabla_f l(f(x_i,\theta),y_i) \\
&=-\frac{2}{N}\sum_{i=1}^N \big[\nabla_\theta f(x,\theta) \nabla_\theta f(x_i,\theta)\big](f(x_i,\theta)-y_i) \label{NN_DE_NTK}\tag{Eq 2.5}
\end{align}

where the second line assumes the weights $\theta$ get updated incrementally and can be approximated by the negative gradient of the loss function (first order approximation). This limit is valid as the width of the NN tends to infinity. In addition, the infinite limit leads to a stationary and deterministic NTK (constant over the training dynamic and independent of the initial random initialization) in \ref{NN_DE_NTK}. Comparing \ref{NN_DE_NTK} with \ref{GD_kernel}, one can see they are equivalent when $\lambda = 0$ (ridgeless) and when the kernel is an NTK. This connects the evolution of the training of an infinite width NN with the evolution of a kernel reproduced function through kernel gradient descent. 

One should appreciate the connection duality between \ref{NN_DE_NTK} and \ref{GD_kernel} implies the NN function is linear in the RKHS induced by the NTK, i.e. can be expressed as the linear combination of eigenfunctions of the neural tangent kernel. The loss function is thus convex in the coefficient space of the eigenfunctions, which then guarantee the optimization will reach the global minimum if solves by gradient descent with small enough learning rate. Again, starting the GD from $f^{(0)}(x)=0$ will guarantee reaching the minimum norm solution. 

Note that if we cast the NN into a NTK Hilbert space, the training/optimization is a convex problem only when the width of the NN is infinite (hence converge to NTK). 

It is shown in [Ref 3](https://proceedings.mlr.press/v139/yang21f/yang21f.pdf) that the determinstic and stationary property of NTK holds for any standard NN architecture (RNN, LSTM, GRU, ResNet, CNN, transformer). Such property is said to be architecturally universal. 


## 3. Manifold regularization

### Motivation

The geometry/distrbution of the unlabelled data can encode information about the label of the data. Kernel method can be used to build an RKHS that encodes information about the geometry of the data, thus introducing a bias to the supervised problem, trained on the limited labelled data.

### Formalism

The original RKHS $\mathcal{H}$ is deformed into $\tilde{ \mathcal{H}}$ in order to incorporate the geometric information of the data. The optimization of loss function with $L^2$ penalty is done over the $\tilde{ \mathcal{H}}$ space. The $L^2$ is also computed over the $\tilde{ \mathcal{H}}$ space.

\begin{equation}
\min_{f\in \tilde{ \mathcal{H}}}\Big[ \sum_{i=1}^l L(y_i, f(x_i)) + \lambda \|f\|^2_{\tilde{ \mathcal{H}}}\Big] \label{lossfunc3}\tag{Eq 6}
\end{equation}

It can be shown that $\|\cdot\|_{\mathcal{H}}$ is related to $\|\cdot\|_{\tilde{\mathcal{H}}}$ via

\begin{equation}
\|f\|^2_{\tilde{\mathcal{H}}} = \|f\|^2_{\mathcal{H}} + \langle f |M|f \rangle \label{distortH1}\tag{Eq 7}
\end{equation}

where $M$ is a kernel encodes the geometric information of the data. For example, given $N$ data points, $\langle f |M|f \rangle$ is written as

\begin{equation}
\langle f |M|f \rangle = \sum_{i,j=1}^N f(x_i)M_{ij}f(x_j) \label{distortH2}\tag{Eq 8}
\end{equation}

One of the choices for $M$ is the Laplacian $L=D-W$ of the graph built from the data (e.g. put weight between two points inversely proportional to the distance between them). Using \ref{distortH1} and $M=L$, \ref{lossfunc3} becomes

\begin{align}
& \min_{f\in \mathcal{H}}\Big[ \frac{1}{l}\sum_{i=1}^l L(y_i, f(x_i)) + \lambda_A \|f\|^2_{\mathcal{H}} + \frac{\lambda_I}{N^2}\sum_{i,j=1}^N f(x_i)L_{ij}f(x_j)\Big] \\
=& \min_{f\in \mathcal{H}}\Big[ \frac{1}{l}\sum_{i=1}^l L(y_i, f(x_i)) + \lambda_A \|f\|^2_{\mathcal{H}} + \frac{\lambda_I}{N^2}\sum_{i,j=1}^N W_{ij}(f(x_i)-f(x_j))^2\Big] \label{MR}\tag{Eq 9}
\end{align}

where $l$ is the number of labelled data and $N$ is the total number (labelled and unlabelled) data. $W_{ij}$ is the weight of the edge between node $i$ and $j$. The first loss function only involves labelled data and the last term (capture geometric information of the data) involves all data. Intuitively, if two points $i$ and $j$ are close, $W_{ij}$ will be non-zero and any difference in the prediction at the two points $f(x_i)$ and $f(x_j)$ will be penalized. \ref{MR} is also known as **manifold regularization**.

# NLP/DL Pillar foundational knowledge


## DL

### Theory
- Overparametrization
- Transfer learning
- Regularizations
- Inductive bias

### Practice
- pytorch basis: backprop, graphs, detach, epoch etc
- huggingface

## NLP
- how a transformer works
- how to train a transformer
- applications of transformer