### Supervised learning

#### 5.1.1 Nearest neighbors
- When k is too small, the classifier acts in a very random way, i.e., it is overfitting to the training data. As k gets larger, the classifier underfits (over-smooths) the data, resulting in
the shrinkage of the two smaller regions. 

- The optimal number of nearest neighbors to use k is a hyperparameter for this algorithm

- [Fast Library for Approximate Nearest Neighbors (FLANN)](https://github.com/flann-lib/flann), which collects a number of previously developed algorithms and is incorporated as part of OpenCV

- More recently (2021),  developed the GPU-enabled [Faiss library](https://github.com/facebookresearch/faiss) for scaling similarity search to billions of vectors

#### 5.1.3 Logistic regression
- For the binary classification task, let ti = [0,1] be the class label for each training
sample xi and pi = p(C0|x) be the estimated likelihood predicted by for a given weight and bias (w; b). We can maximize the likelihood of the correct labels being predicted by minimizing the negative log likelihood, i.e., the cross-entropy loss or error function

- To determine the best set of weights and biases, we can use gradient descent

- Logistic regression does have some limitations, which is why it is often used for only the simplest classification tasks
    - If the classes in feature space are not linearly separable, using simple projections onto weight vectors may not produce adequate decision surfaces. In this case, kernel methods, which measure the distances between new (test) feature vectors and select training examples, can often provide good solutions.
    - Another problem with logistic regression is that if the classes actually are separable, there can be more than a single unique separating plane

#### 5.1.4 Support vector machines
- how can we choose the best decision surface, keeping in mind that we only have a limited number of training examples. The answer to this problem is to use maximum margin classifiers. The
maximum margin classifier provides our best bet for correctly classifying as many of these
unseen examples as possible

- What happens if the two classes are not linearly separable, and in fact require a complex
curved surface to correctly classify samples. In this case, we can replace linear regression with kernel regression 

- Support vector machines can also be applied to overlapping (mixed) class distributions

#### 5.1.5 Decision trees and forests
- decision trees perform a sequence of simpler operations, often just looking at individual feature elements before deciding which element to look at next

- A random forest is created by building a set of decision trees, each of which makes
slightly different decisions. At test time, a new sample is classified by each of
the trees in the random forest, and the class distributions at the final leaf nodes are averaged
to provide an answer that is more accurate than could be obtained with a single tree

- Random forests have several design parameters:
    - the depth of each tree D
    - the number of trees T 
    - the number of samples examined at node construction time ρ

### Unsupervised learning

- Examples of Unsupervised learning in computer vision include image segmentation and face and body recognition and reconstruction

#### 5.2.1 Clustering
- In data clustering, algorithms can link clusters together based on the distance between
their closest points (single-link clustering), their farthest points (complete-link clustering)

- Mean-shift and mode finding techniques, such as k-means and mixtures of Gaussians, model the feature vectors associated with each pixel (e.g., color and position) as samples from an unknown probability density function and then try to find clusters (modes) in this distribution.

#### 5.2.2 K-means and Gaussians mixture models
- K-means
    - the algorithm is given the number of clusters k it is supposed to find and is initialized by randomly sampling k centers from the input feature vectors.
    - It then iteratively updates the cluster center location based on the samples that are closest to each center 

- Gaussians mixture:
    - each cluster center is augmented by a covariance matrix whose values are re-estimated from the corresponding samples 
    - Instead of using nearest neighbors to associate input samples with cluster centers, a Mahalanobis distance

#### 5.2.3 Principal component analysis
- PCA was originally used in computer vision for modeling faces, i.e., eigenfaces, initially for gray-scale images, and then for 3D models and active appearance models

#### 5.2.4 Manifold learning
- In many cases, the data we are analyzing does not reside in a globally linear subspace, but
does live on a lower-dimensional manifold. In this case, non-linear dimensionality reduction
can be used 

- Since these systems extract lower-dimensional manifolds in a higher-dimensional space, they are also known as manifold learning techniques

- In addition to dimensionality reduction, which can be useful for regularizing data and
accelerating similarity search, manifold learning algorithms can be used for visualizing input data distributions or neural network layer activations

#### 5.2.5 Semi-supervised learning
- if only a small number of examples are labeled with the correct class, we can still imagine extending these labels (inductively) to nearby samples and therefore not only labeling all of the data, but also constructing appropriate decision surfaces for future inputs, This area of study is called semi-supervised learning

- In general, it comes in two varieties
    - transductive learning: the goal is to classify all of the unlabeled inputs that are given as one batch at the same time as the labeled examples

    - inductive learning: we train a machine learning system that will classify all future inputs

- Semi-supervised learning is a subset of the larger class of weakly supervised learning problems

### Deep neural networks

#### 5.3.1 Weights and layers

- Deep neural networks (DNNs) are feedforward computation graphs composed of thousands of simple interconnected “neurons” (units), which, much like logistic regression, perform weighted sums of their inputs, followed by a non-linear activation function re-mapping

- A layer in which a full (dense) weight matrix is used for the linear combination is called
a fully connected (FC) layer

- A network that consists only of fully connected (and no convolutional) layers is now often called a multi-layer perceptron (MLP).

#### 5.3.2 Activation functions

- For the final layer in networks used for classification, the softmax function is normally used to convert from real-valued activations to class likelihoods

#### 5.3.3 Regularization and normalization
- Regularization and weight decay

    - As the weights are being optimized inside a neural network, these terms make the weights smaller, so this kind of regularization is also known as weight decay

- Dataset augmentation: Another powerful technique to reduce over-fitting is to add more training samples by perturbing the inputs and/or outputs of the samples that have already been collected.

- Dropout
    - Dropout is a regularization technique where at each mini-batch during training, some percentage p (say 50%) of the units in each layer are clamped to zero
    - Randomly setting units to zero injects noise into the training process which can help reduce overfitting and improve generalization.

- Batch normalization
    - The idea behind batch normalization is to re-scale (and recenter) the activations at a given unit so that they have unit variance and zero mean. After batch normalization, activations now have zero mean and unit variance.

    - Since the publication of the seminal paper by Ioffe and Szegedy (2015), a number of  variants have been developed:

        ![](./images/i5.png)
        - all the activations in a layer, which is called layer normalization
        - all the activations in a given convolutional output channel (see Section 5.4), which is called instance normalization
        - different sub-groups of output channels, which is called group normalization

#### 5.3.4 Loss functions

- While loss functions are traditionally applied to supervised learning tasks, where the correct label or target value tn is given for each input, it is also possible to use loss functions in an
unsupervised setting

- To train with a contrastive loss, you can run both pairs of inputs through the neural network, compute the loss, and then backpropagate the gradients through both instantiations
(activations) of the network

- It is also possible to construct a triplet loss that takes as input a pair of matching samples and a third non-matching sample and ensures that the distance between non-matching samples is greater than the distance between matches plus some margin

- Weight initialization:
    - Before we can start optimizing the weights in our network, we must first initialize them.

#### 5.3.5 Backpropagation

- Once we have set up our neural network by deciding on the number of layers, their widths
and depths, added some regularization terms, defined the loss function, and initialized the
weights, we are ready to train the network with our sample data
    - To do this, we compute the derivatives (gradients) of the loss function En for training sample n with respect to the weights w using the chain rule

    - Modern neural networks, however, may have millions of units and hence activations. The number of activations that need to be stored can be reduced by only storing them at certain layers and then re-computing the rest as needed, which goes under the name gradient checkpointing

#### 5.3.6 Training and optimization
- What we need at this point is some algorithm to turn these gradients into weight updates that will optimize the loss function and produce a network that generalizes well to new, unseen data.

- stochastic gradient descent (SGD):
    - In SGD, instead of evaluating the loss function by summing over all the training samples, we instead just evaluate a single training sample n and compute the derivatives of the associated loss
    - In practice, the directions obtained from just a single sample are incredibly noisy estimates of a good descent direction, so the losses and gradients are usually summed over a small subset of the training data is called a minibatch

- The step size parameter α is often called the learning rate and must be carefully adjusted
to ensure good progress while avoiding overshooting and exploding gradients.
    - In practice, it is common to start with a larger (but still small) learning rate αt and to decrease it over time so that the optimization settles into a good minimum 

- Regular gradient descent is prone to stalling when the current solution reaches a “flat
spot” in the search space, and stochastic gradient descent only pays attention to the errors
in the current minibatch. For these reasons, the SGD algorithms may use the concept of momentum

- a number of more sophisticated optimization techniques have been applied to deep network training:
    - Nesterov momentum: where the gradient is (effectively) computed at the state predicted from the velocity update

    - AdaGrad (Adaptive Gradient)

    - RMSProp, where the running sum of squared gradients is replaced with a leaky (decaying) sum 

    - Adadelta

    - Adam: which combines elements of all the previous ideas into a single framework and also de-biases the initial leaky estimates

    - AdamW, which is Adam with decoupled weight decay

- Adam and AdamW are currently the most popular optimizers for deep networks

### Convolutional neural networks

- use 1 × 1 convolutions, which do not actually perform convolutions but rather combine various channels on a per-pixel basis, often with the goal of reducing the dimensionality of the feature space.

- To fully determine the behavior of a convolutional layer, we still need to specify a few additional parameters. These include:
    - Padding
    - Stride: The default stride for convolution is 1 pixel, but it is also possible to only evaluate the convolution at every nth column and row
    - Dilation: Extra “space” (skipped rows and column) can be inserted between pixel samples during convolution, also known as dilated
    - Grouping: by default, all input channels are used to produce each output channel, we can also group the input and output layers into G separate groups, each of which is convolved separately, which is known as depthwise or channel separated convolution

#### 5.4.1 Pooling and unpooling
![](./images/i6.png)
- While unpooling can be used to (approximately) reverse the effect of max pooling operation, if we want to reverse a convolutional layer, we can look at learned variants of the interpolation operator, commonly known as transposed convolution

![](.//images/i7.png)

- U-Nets and Feature Pyramid Networks

#### 5.4.3 Network architectures

- Mobile networks: MobileNet, MobileNetV2, ShuffleNet, ShuffleNetV2, ESPNet, ESPNetV2

#### 5.4.4 Model zoos

- [torchvision models zoo](https://github.com/pytorch/vision/tree/main/torchvision/models)
- [PyTorch Image Models library](https://github.com/rwightman/pytorch-image-models)

- Neural Architecture Search (NAS)
    - One of the most recent trends in neural network design is the use of Neural Architecture Search (NAS) algorithms to try different network topologies and parameterizations 
    - This process is more efficient (in terms of a researcher’s time) than the trial-and-error approach that characterized earlier network design
    - [a useful guide for training neural networks which may help avoid common issues](http://karpathy.github.io/2019/04/25/recipe)

#### 5.4.5 Visualizing weights and activations
- OpenAI also recently released a great interactive tool called [Microscope](https://microscope.openai.com/models) which allows people to visualize the significance of every neuron in many common neural networks.

#### 5.4.7 Self-supervised learning
- The idea of training on one task and then using the learning on another is called transfer
learning, while the process of modifying the final network to its intended application and
statistics is called domain adaptation.

- The central idea in self-supervised learning is to create a supervised pretext task where
the labels can be automatically derived from unlabeled images

- Contrastive losses are a useful tool in metric learning, since they encourage distances
in an embedding space to be small for semantically related inputs

- An alternative is to use deep clustering to similarly encourage related inputs to produce similar outputs

- One final variant on self-supervised learning is using a student-teacher model,  where the
teacher network is used to provide training examples to a student network. These kinds
of architectures were originally called model compression and knowledge distillation

- VISSL provides open-source PyTorch implementations of many state-of-the-art [self-supervised learning models](https://github.com/facebookresearch/vissl) (with weights)

#### 5.5 More complex models