🗂 Dataset

Here are my personal deep learning notes. I've written this cheatsheet for keep track my knowledge but you can use it as a guide for learning deep learning aswell.

🗂 Dataset	Balance the data
	Split in train and validation
	Normalization
	Data augmentation

🧠 Model	Activation function
	Weight initialization
	Dropout
	Batch normalization
	Self-attention

📉 Loss	Loss Function
	Weight Penalty
	Label Tricks

🔥 Train	Optimizer
	Learning Rate
	Batch size
	Num epochs

🧐 Avoid overfitting (Try in that order)	1. Get more data
	2. Data augmentation
	3. Regularization
	4. Reduce Reduce complexity
	5. Ensemble

🕓 Train faster	Transfer learning
	Batch Normalization
	Precomputation
	Half precision
	Multiple GPUs

🤖 Applications (External repos)	Vision
	NLP
	Audio
	Tabular
	Reinforcement Learning

🖥️ Computer	Hardware
	Software
	Jupyter
	Kaggle

Resources

🗂 Dataset

Balance the data

Fix it in the dataloader WeightedRandomSampler
Subsample majority class. But you can lose important data.
Oversample minority class. But you can overfit.
Weighted loss function CrossEntropyLoss(weight=[…])

Split in train and validation

Training set: used for learning the parameters of the model.
Validation set: used for evaluating model while training. Don’t create a random validation set! Manually create one so that it matches the distribution of your data. Usaully a 10% or 20% of your train set.
- N-fold cross-validation. Usually 10
Test set: used to get a final estimate of how well the network works.

Normalization

Scale the inputs to have mean 0 and a variance of 1. Also linear decorrelation/whitening/pca helps a lot. Normalization parameters are obtained only from train set, and then applied to both train and valid sets.

Option 1: Standarization x = x-x.mean() / x.std() Most used
1. Mean subtraction: Center the data to zero. x = x - x.mean() fights vanishing and exploding gradients
2. Standardize: Put the data on the same scale. x = x / x.std() improves convergence speed and accuracy
Option 2: PCA Whitening
1. Mean subtraction: Center the data in zero. x = x - x.mean()
2. Decorrelation or PCA: Rotate the data until there is no correlation anymore.
3. Whitening: Put the data on the same scale. whitened = decorrelated / np.sqrt(eigVals + 1e-5)
Option 3: ZCA whitening Zero component analysis (ZCA).
Other options not used:
- (x-x.min()) / (x.max()-x.min()): Values from 0 to 1
- 2*(x-x.min()) / (x.max()-x.min()) - 1: Values from -1 to 1

In case of images, the scale is from 0 to 255, so it is not strictly necessary normalize.

neural networks data preparation

Data augmentation

Todo

🧠 Model

Activation function

reference

Softmax: Sigle-label classification (last layer)
Sigmoid: Multi-label classification (last layer)
Hyperbolic tangent:
ReLU: Non-linearity compontent of the net (hidden layers) check this paper
ELU: Exponential Linear Unit. paper
SELU: Scaled Exponential Linear Unit. paper
PReLU or Leaky ReLU:
SERLU:
Smoother ReLU. Differienzable. BEST
- GeLU: Gaussian Error Linear Units. Used in transformers. paper. (2016)
- Swish: x * sigmoid(x) paper (2017)
- Elish: xxxx paper (2018)
- Mish: x * tanh( ln(1 + e^x) ) paper (2019)
- myActFunc 1 = 0.5 * x * ( tanh(x) + 1 )
- myActFunc 2 = 0.5 * x * ( tanh (x+1) + 1)
- myActFunc 3 = x * ((x+x+1)/(abs(x+1) + abs(x)) * 0.5 + 0.5)

Weight initialization

Depends on the models architecture. Try to avoid vanishing or exploding outputs. blog1, blog2.

Constant value: Very bad
Random:
- Uniform: From 0 to 1. Or from -1 to 1. Bad
- Normal: Mean 0, std=1. Better
Xavier initialization: Good for MLPs with tanh activation func. paper
- Uniform:
- Normal:
Kaiming initialization: Good for MLPs with ReLU activation func. (a.k.a. He initialization) paper
- Uniform
- Normal
- When you use Kaiming, you ha to fix ReLU(x) equals to min(x,0) - 0.5 for a correct mean (0)
Delta-Orthogonal initialization: Good for vanilla CNNs (10000 layers). Read this paper

Dropout

📉 Loss

Loss function

Regression
- MBE: Mean Bias Error: mean(GT - pred) It could determine if the model has positive bias or negative bias.
- MAE: Mean Absolute Error (L1 loss): mean(|GT - pred|) The most simple.
- MSE: Mean Squared Error (L2 loss): mean((GT-pred)²) Penalice large errors more than MAE. Most used
- RMSE: Root Mean Squared Error: sqrt(MSE) Proportional to MSE. Value closer to MAE.
- Percentage errors:
  - MAPE: Mean Absolute Percentage Error
  - MSPE: Mean Squared Percentage Error
  - RMSPE: Root Mean Squared Percentage Error
Classification
- Cross Entropy: Sigle-label classification. Usually with softmax. nn.CrossEntropyLoss.
  - NLL: Negative Log Likelihood is the one-hot encoded target simplified version, see this nn.NLLLoss()
- Binary Cross Entropy: Multi-label classification. Usually with sigmoid. nn.BCELoss
- Hinge: Multi class SVM Loss nn.HingeEmbeddingLoss()
- Focal loss: Similar to BCE but scaled down, so the network focuses more on incorrect and low confidence labels than on increasing its confidence in the already correct labels. -(1-p)^gamma * log(p) paper
Segmentation
- Pixel-wise cross entropy
- IoU (F0): (Pred ∩ GT)/(Pred ∪ GT) = TP / TP + FP * FN
- Dice (F1): 2 * (Pred ∩ GT)/(Pred + GT) = 2·TP / 2·TP + FP * FN
  - Range from 0 (worst) to 1 (best)
  - In order to formulate a loss function which can be minimized, we'll simply use 1 − Dice

Classification Metrics

Dataset with 5 disease images and 20 normal images. If the model predicts all images to be normal, its accuracy is 80%, and F1-score of such a model is 0.88

Accuracy: TP + TN / TP + TN + FP + FN
F1 Score: 2 * (Prec*Rec)/(Prec+Rec)
- Precision: TP / TP + FP = TP / predicted possitives
- Recall: TP / TP + FN = TP / actual possitives
Dice Score: 2 * (Pred ∩ GT)/(Pred + GT)
ROC, AUC:
Log loss:

Label Tricks

Label Smoothing: Smooth the one-hot target label
Mixup: Combines pairs of examples and their labels.
- Merge 2 samples in 1: x_mixed = λxᵢ + (1−λ)xⱼ
- Fast.ai doc

🔥 Train

Learning Rate

How big the steps are during training.

Max LR: Compute it with LR Finder (lr_find())
LR schedule:
- Constant: Never use.
- Reduce it gradually: By steps, by a decay factor, with LR annealing, etc.
  - Flat + Cosine annealing: Flat start, and then at 50%-75%, start dropping the lr based on a cosine anneal.
- Warm restarts (SGDWR, AdamWR):
- OneCycle: Use LRFinder to know your maximum lr. Good for Adam.

Batch size

Number of samples to learn simultaneously.

Batch size = 1: Train each sample individually. (Online gradient descent) ❌
Batch size = length(dataset): Train the whole dataset at once, as a batch. (Batch gradient descent) ❌
Batch size = number: Train disjoint groups of samples (Mini-batch gradient descent). ✅
- Usually a power of 2. 32 or 64 are good values.
- Too low: like 4: Lot of updates. Very noisy random updates in the net (bad).
- Too high: like 512 Few updates. Very general common updates (bad).
  - Faster computation. Takes advantage of GPU mem. But sometimes it can no be fitted (CUDA Out Of Memory)

Some people are tring to make a batch size finder according to this paper.

Number of epochs

Times to learn the whole dataset.

Train until start overffiting (validation loss becomes to increase) (early stopping)

Optimizer

Gradient Descent methods. reference:

	Description	Paper	Score
SGD	Basic method. A bit slowly to get to the optimum.
SGD with Momentum	Speed it up with momentum, usually `mom=0.9`
AdaGrad	Adaptative lr	2011
RMSProp	Similar to momentum but with the gradient squared.	2012
Adam	Combination of Momentum with RMSProp.	2014	⭐
LARS	Layer-wise Adaptive Rate Scaling.	2017
AMSGrad	Worse than Adam in practice. (AdamX: new verion)	2018
AdamW	.	2018
LAMB	LARS improvement.	2019
NovoGrad	.	2019
Lookahead	Is like having a buddy system to explore the loss.	2019
RAdam	Rectified Adam. Stabilizes training at the start.	2019
Ranger	RAdam + Lookahead.	2019	⭐⭐⭐
RangerLars	RAdam + Lookahead + LARS.	2019
Ralamb	RAdam + LARS.	2019
Selective-Backprop	Faster training by focusing on the biggest losers.	2019
DiffGrad	Solves Adam’s "overshoot" issue	2019
AdaMod	A new deep learning optimizer with memory	2019

SGD: new_w = w - lr[gradient_w]
SGD with Momentum: Usually mom=0.9.
- mom=0.9, means a 10% is the normal derivative and a 90% is the same direction I went last time.
- new_w = w - lr[(0.1 * gradient_w) + (0.9 * w)]
- Other common values are 0.5, 0.7 and 0.99.
RMSProp (Adaptative lr) From 2012. Similar to momentum but with the gradient squared.
- new_w = w - lr * gradient_w / [(0.1 * gradient_w²) + (0.9 * w)]
- If the gradient in not so volatile, take grater steps. Otherwise, take smaller steps.

TODO: Read:

Efficient BackProp (1998, Yann LeCun)

LR finder

blog

paper

Superconvergence

A disciplined approach to neural network hyper-parameters (2018, Leslie Smith)

The 1cycle policy

🧐 Improve generalization
and avoid overfitting

(try in that order)

Get more data
- Similar datasets: Get a similar dataset for your problem.
- Create your own dataset
  - Segmentation annotation with Polygon-RNN++
- Synthetic data: Virtual objects and scenes instead of real images. Infinite possibilities of lighting, colors, angles...
Data augmentation: Augment your current data. (albumentations for faster aug. using the GPU)
- Test time augmentation (TTA): The same augmentations will also be applied when we are predicting (inference). It can improve our results if we run inference multiple times for each sample and average out the predictions.
- AutoAugment: RL for data augmentation. Trasfer learning NOT THE WEIGHTS but the policies of how to do data augmentation.
Regularization
- Dropout. Usually 0.5
- Weight penalty: Regularization in loss function (penalice high weights). Usually 0.0005
  - L1 regularization: penalizes the sum of absolute weights.
  - L2 regularization: penalizes the sum of squared weights by a factor, usually 0.01 or 0.1.
  - Weight decay: wd * w. Sometimes mathematically identical to L2 reg.
Reduce model complexity: Limit the number of hidden layers and the number of units per layer.
- Generalizable architectures?: Add more bachnorm layers, more densenets...
Ensambles: Gather a bunch of models to give a final prediction. kaggle ensembling guide
- Combination methods:
  - Ensembling: Merge final output (average, weighted average, majority vote, weighted majority vote).
  - Meta ensembling: Same but use a new model to produce the final output. (also called stacking or blending)
- Models generation techniques:
  - Stacking: Just use different classifiers algorithms.
  - Bagging (Bootstrap aggregating): Each model trained with a subset of the training data. Used in random forests. Prob of sample being selected: 0.632 Prob of sample in Out Of Bag 0.368
  - Boosting: The predictors are not made independently, but sequentially. Used in gradient boosting.
  - Snapshot Ensembling: Only for neural nets. M models for the cost of 1. Thanks to SGD with restarts you have several local minimum that you can average. paper.

Other tricks:

Label Smoothing: Smooth the one-hot target label

Knowledge Distillation: A bigger trained net (teacher) helps the network paper

🕓 Train faster

Transfer learning: Use a pretrainded model and retrain with your data.
1. Replace last layer
2. Fine-tune new layers
3. Fine-tune more layers (optional)
Batch Normalization Add BachNorm layers after your convolutions and linear layers for make things easier to your net and train faster.
Precomputation
1. Freeze the layers you don’t want to modify
2. Calculate the activations the last layer from the frozen layers(for your entire dataset)
3. Save those activations to disk
4. Use those activations as the input of your trainable layers
Half precision (fp16)
Multiple GPUs
2nd order optimization

Normalization inside network:

Batch Normalization paper

Layer Normalization paper

Instance Normalization paper

Group Normalization paper

Trick: Knowledge Distillation

A teacher model teach a student model.

Smaller student model → faster model.
- Model compresion: Less memory and computation
- To generalize and avoid outliers.
- Used in NLP transformers.
- paper
Bigger student model is → more accurate model.
- Useful when you have extra unlabeled data (kaggle competitions)
- 1. Train the teacher model with labeled dataset.
- 2. With the extra on unlabeled dataset, generate pseudo labels (soft or hard labels)
- 3. Train a student model on both labeled and pseudo-labeled datasets.
- 4. Student becomes teacher and repeat -> 2.
- Paper: When Does Label Smoothing Help?
- Paper: Noisy Student
- Video: Noisy Student

Supervised DL

Structured
- Tabular
  - Andres solution to ieee-fraud-detection
  - NODE: Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data paper
  - Continuous variables: Feed them directly to the network
  - Categorical variable: Use embeddings
- Collaborative filtering: When you have users and items. Useful for recommendation systems.
  - Singular Value Decomposition (SVD)
  - Metrics: Mean Average Precision (MAP)
- Time series
  - Arimax
  - IoT sensors
- Geospatial: Do Kaggle course
Unstructured
- Vision: Image, Video. Check my vision repo
- Audio: Sound, music, speech. Check my audio repo. Audio overview
- NLP: Text, Genomics. Check my NLP repo
- Knoledge Graph (KG): Graph Neural Networks (GNN)
  - Molecules
- Trees
  - math expresions
  - syntax
  - Models: Tree-LSTM, RNNGrammar (RNNG).
  - Tree2seq by Polish notation. Duda: only for binary trees?

Autoencoder

Standard autoencoders: Made for reconstruct the input. No continuous latant space.
- Simple Autoencoder: Same input and output net with a smaller middle hidden layer (botleneck layer, latent vector).
- Denoising Autoencoder (DAE): Adds noise to the input to learn how to remove noise.
- Only have a recontruction loss (pixel mean squared error for example)
Variational Autoencoder (VAE): Initially trained as a reconstruction problem, but later we can play with the latent vector to generate new outputs. Latant space need to be continuous.
- Latent vector: Is modified by adding gaussian noise (normal distribution, mean and std vectors) during training.
- Loss: loss = recontruction loss + latent loss
  - Recontruction loss: Keeps the output similar to the input (mean squared error)
  - Latent loss: Keeps the latent space continuous (KL divergence)
- Disentangled Variational Autoencoder (β-VAE): Improved version. Each parameter of the latent vector is devotod to tweak 1 characteristic. paper.
  - β to small: Overfitting. Learn to reconstruct your training data, but i won't generalize
  - β to big: Loose high definition details. Worse performance.

Graph Neural Networks

Type of graph data
- Graph Databases
- Knowledge Graphs (KG): Describes real-world entities and their interrelations
- Social Networks
- Transport Graphs
- Molecules (including proteins): Make predictions about their properties and reactions.
Models
- GNN Graph Neural Network, 2009
- DeepWalk: Online Learning of Social Representations, 2014
- GraphSage, 2017
- Relational inductive biases, DL, and graph networks, 2018
- KGCN: Knowledge Graph Convolutional Network, 2019
Survey papers
- A Gentle Introduction to GNN Medium, Feb 2019
- GNN: A Review of Methods and Applications: Dic 2018, last revised Jul 2019
- A Comprehensive Survey on GNN: Jan 2019, last revised Aug 2019
Application examples:
- Smell molecules
- Newton vs the machine: Solving the 3-body problem using DL (Not using graphs)

Semi-supervised DL

Check this kaggle discussion

Ladder Networks
GANs
Clustering like KMeans
Variational Autoencoder (VAE)
Pseudolabeling: Retrain with predicted test data as new labels.
label propagation and label spreading tutorial

Reinforcement Learning

Best resources:
- Openai spinning up: Probably the best one.
- Udacity repo: Good free repo for the paid course.
- theschool.ai move 37
- Reinforcement Learning: An Introduction: Best book
Q-learning
- DQN
Policy gradients
- A3C
C51
Rainbow
Implicit Quantile
Evolutionary Strategy
Genetic Algorithms

Reinforcement learning reference

RL’s foundational flaw

How to fix reinforcement learning

AlphaGoZero

Trust Region Policy Optimization

Introduction to RL Algorithms. Part I

Introduction to RL Algorithms. Part II

pytorch tutorial

RL Adventure

RL Adventure 2

DeepRL

Resources

Antor TODO

Automatic featuring engeniring

Fast.ai tabular: Not really works well
Problems:
- DL can not see frequency of an item
- Items that does not appear in the train set
Manually align 2 distributions:
- Microsoft Malware Prediction
- CPMP Solution: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84069

How start a competition/ML project

Data exploaration , haw is the data that we are going to work with
Think about input representation
- Is redundant?
- Need to be converted to somthing else?
- The most entropy that you can reconstruct the raw data
Look at the metric
- Makes sense?
- Is it differentiable
- Can i buid good enough metric equivalent
Build a toy model an overfit it with 1 or few samples
- To make sure that nothing is really broken

JPEG: 2 levels of comprehension:

Entropy
Choram

LIDAR

Projections (BAD REPRESENTATION) (complicated things with voxels) Dense matrix (antor) - Its a depth map i think - Not projections - NAtive output of the sensor but condensed in a dense matrix

Unordered set (point cloud, molecules)

Point net
transformer without positional encoding
- AtomTransformer (by antor)
- MoleculeTransformer (by antor)

TODO

Multi-Task Learning: Train a model on a variety of learning tasks

Meta-learning: Learn new tasks with minimal data using prior knowledge.

N-Shot Learning

Zero-shot: 0 trainning examples of that class.

One-shot: 1 trainning example of that class.

Few-shot: 2...5 trainning examples of that class.

Models

Naive approach: re-training the model on the new data, would severely overfit.

Siamese Networks (2015) Knows if to inputs are the same or not. (2 Feature extraction shares wights)

Matching Networks (2016) Weighted nearest-neighbor classifier applied within an embedding space.

Model-Agnostic Meta-Learning (MAML) (2017)

Prototypical Networks (2017): Better nearest-neighbor classifier of embeddings.

Meta-Learning for Semi-Supervised classification (2018) Extensions of Prototypical Networks. SotA.

Meta-Transfer Learning (MTL) (2018)

Online Meta-Learning (2019)

Neural Turing machine. paper, code

Neural Arithmetic Logic Units (NALU) paper

Remember the math:

Matrix calculus

Einsum: link 1, link 2

nvidia-smi daemon: Check that sm% is near to 100% for a good GPU usage.

Name		Name	Last commit message	Last commit date
Latest commit History 502 Commits
code		code
posts		posts
GPUs.md		GPUs.md
readme.md		readme.md

mahatria/deep-learning

Folders and files

Latest commit

History

Repository files navigation

🗂Dataset

🧠Model

📉Loss

🔥Train

🧐Avoidoverfitting

(Try in that order)

🕓Trainfaster

🤖Applications

(External repos)

🖥️Computer

🗂 Dataset

Balance the data

Split in train and validation

Normalization

Data augmentation

🧠 Model

Activation function

Weight initialization

Dropout

📉 Loss

Loss function

Classification Metrics

Label Tricks

🔥 Train

Learning Rate

Batch size

Number of epochs

Optimizer

🧐 Improve generalizationand avoid overfitting

(try in that order)

Other tricks:

🕓 Train faster

Trick: Knowledge Distillation

Supervised DL

Autoencoder

Graph Neural Networks

Semi-supervised DL

Reinforcement Learning

Resources

Antor TODO

Automatic featuring engeniring

How start a competition/ML project

JPEG: 2 levels of comprehension:

LIDAR

Unordered set (point cloud, molecules)

TODO

About

Resources

Stars

Watchers

Forks

Languages

🗂
Dataset

🧠
Model

📉
Loss

🔥
Train

🧐
Avoid
overfitting

🕓
Train
faster

🤖
Applications

🖥️
Computer

🧐 Improve generalization
and avoid overfitting