Skip to content

mahatria/deep-learning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Here are my personal deep learning notes. I've written this cheatsheet for keep track my knowledge but you can use it as a guide for learning deep learning aswell.

🗂
Dataset

Balance the data
Split in train and validation
Normalization
Data augmentation

🧠
Model

Activation function
Weight initialization
Dropout
Batch normalization
Self-attention

📉
Loss

Loss Function
Weight Penalty
Label Tricks

🔥
Train

Optimizer
Learning Rate
Batch size
Num epochs

🧐
Avoid
overfitting

(Try in that order)
1. Get more data
2. Data augmentation
3. Regularization
4. Reduce Reduce complexity
5. Ensemble

🕓
Train
faster

Transfer learning
Batch Normalization
Precomputation
Half precision
Multiple GPUs

🤖
Applications

(External repos)
Vision
NLP
Audio
Tabular
Reinforcement Learning

🖥️
Computer

Hardware
Software
Jupyter
Kaggle

🗂 Dataset

Balance the data

  • Fix it in the dataloader WeightedRandomSampler
  • Subsample majority class. But you can lose important data.
  • Oversample minority class. But you can overfit.
  • Weighted loss function CrossEntropyLoss(weight=[…])

Split in train and validation

  • Training set: used for learning the parameters of the model.
  • Validation set: used for evaluating model while training. Don’t create a random validation set! Manually create one so that it matches the distribution of your data. Usaully a 10% or 20% of your train set.
    • N-fold cross-validation. Usually 10
  • Test set: used to get a final estimate of how well the network works.

Normalization

Scale the inputs to have mean 0 and a variance of 1. Also linear decorrelation/whitening/pca helps a lot. Normalization parameters are obtained only from train set, and then applied to both train and valid sets.

  • Option 1: Standarization x = x-x.mean() / x.std() Most used
    1. Mean subtraction: Center the data to zero. x = x - x.mean() fights vanishing and exploding gradients
    2. Standardize: Put the data on the same scale. x = x / x.std() improves convergence speed and accuracy
  • Option 2: PCA Whitening
    1. Mean subtraction: Center the data in zero. x = x - x.mean()
    2. Decorrelation or PCA: Rotate the data until there is no correlation anymore.
    3. Whitening: Put the data on the same scale. whitened = decorrelated / np.sqrt(eigVals + 1e-5)
  • Option 3: ZCA whitening Zero component analysis (ZCA).
  • Other options not used:
    • (x-x.min()) / (x.max()-x.min()): Values from 0 to 1
    • 2*(x-x.min()) / (x.max()-x.min()) - 1: Values from -1 to 1

Data augmentation

Todo

🧠 Model

Activation function

reference

  • Softmax: Sigle-label classification (last layer)
  • Sigmoid: Multi-label classification (last layer)
  • Hyperbolic tangent:
  • ReLU: Non-linearity compontent of the net (hidden layers) check this paper
  • ELU: Exponential Linear Unit. paper
  • SELU: Scaled Exponential Linear Unit. paper
  • PReLU or Leaky ReLU:
  • SERLU:
  • Smoother ReLU. Differienzable. BEST
    • GeLU: Gaussian Error Linear Units. Used in transformers. paper. (2016)
    • Swish: x * sigmoid(x) paper (2017)
    • Elish: xxxx paper (2018)
    • Mish: x * tanh( ln(1 + e^x) ) paper (2019)
    • myActFunc 1 = 0.5 * x * ( tanh(x) + 1 )
    • myActFunc 2 = 0.5 * x * ( tanh (x+1) + 1)
    • myActFunc 3 = x * ((x+x+1)/(abs(x+1) + abs(x)) * 0.5 + 0.5)

Weight initialization

Depends on the models architecture. Try to avoid vanishing or exploding outputs. blog1, blog2.

  • Constant value: Very bad
  • Random:
    • Uniform: From 0 to 1. Or from -1 to 1. Bad
    • Normal: Mean 0, std=1. Better
  • Xavier initialization: Good for MLPs with tanh activation func. paper
    • Uniform:
    • Normal:
  • Kaiming initialization: Good for MLPs with ReLU activation func. (a.k.a. He initialization) paper
    • Uniform
    • Normal
    • When you use Kaiming, you ha to fix ReLU(x) equals to min(x,0) - 0.5 for a correct mean (0)
  • Delta-Orthogonal initialization: Good for vanilla CNNs (10000 layers). Read this paper

Dropout

📉 Loss

Loss function

  • Regression
    • MBE: Mean Bias Error: mean(GT - pred) It could determine if the model has positive bias or negative bias.
    • MAE: Mean Absolute Error (L1 loss): mean(|GT - pred|) The most simple.
    • MSE: Mean Squared Error (L2 loss): mean((GT-pred)²) Penalice large errors more than MAE. Most used
    • RMSE: Root Mean Squared Error: sqrt(MSE) Proportional to MSE. Value closer to MAE.
    • Percentage errors:
      • MAPE: Mean Absolute Percentage Error
      • MSPE: Mean Squared Percentage Error
      • RMSPE: Root Mean Squared Percentage Error
  • Classification
    • Cross Entropy: Sigle-label classification. Usually with softmax. nn.CrossEntropyLoss.
      • NLL: Negative Log Likelihood is the one-hot encoded target simplified version, see this nn.NLLLoss()
    • Binary Cross Entropy: Multi-label classification. Usually with sigmoid. nn.BCELoss
    • Hinge: Multi class SVM Loss nn.HingeEmbeddingLoss()
    • Focal loss: Similar to BCE but scaled down, so the network focuses more on incorrect and low confidence labels than on increasing its confidence in the already correct labels. -(1-p)^gamma * log(p) paper
  • Segmentation
    • Pixel-wise cross entropy
    • IoU (F0): (Pred ∩ GT)/(Pred ∪ GT) = TP / TP + FP * FN
    • Dice (F1): 2 * (Pred ∩ GT)/(Pred + GT) = 2·TP / 2·TP + FP * FN
      • Range from 0 (worst) to 1 (best)
      • In order to formulate a loss function which can be minimized, we'll simply use 1 − Dice

Classification Metrics

Dataset with 5 disease images and 20 normal images. If the model predicts all images to be normal, its accuracy is 80%, and F1-score of such a model is 0.88

  • Accuracy: TP + TN / TP + TN + FP + FN
  • F1 Score: 2 * (Prec*Rec)/(Prec+Rec)
    • Precision: TP / TP + FP = TP / predicted possitives
    • Recall: TP / TP + FN = TP / actual possitives
  • Dice Score: 2 * (Pred ∩ GT)/(Pred + GT)
  • ROC, AUC:
  • Log loss:

Label Tricks

  • Label Smoothing: Smooth the one-hot target label
  • Mixup: Combines pairs of examples and their labels.
    • Merge 2 samples in 1: x_mixed = λxᵢ + (1−λ)xⱼ
    • Fast.ai doc

🔥 Train

Learning Rate

How big the steps are during training.

  • Max LR: Compute it with LR Finder (lr_find())
  • LR schedule:
    • Constant: Never use.
    • Reduce it gradually: By steps, by a decay factor, with LR annealing, etc.
      • Flat + Cosine annealing: Flat start, and then at 50%-75%, start dropping the lr based on a cosine anneal.
    • Warm restarts (SGDWR, AdamWR):
    • OneCycle: Use LRFinder to know your maximum lr. Good for Adam.

Batch size

Number of samples to learn simultaneously.

  • Batch size = 1: Train each sample individually. (Online gradient descent) ❌
  • Batch size = length(dataset): Train the whole dataset at once, as a batch. (Batch gradient descent) ❌
  • Batch size = number: Train disjoint groups of samples (Mini-batch gradient descent). ✅
    • Usually a power of 2. 32 or 64 are good values.
    • Too low: like 4: Lot of updates. Very noisy random updates in the net (bad).
    • Too high: like 512 Few updates. Very general common updates (bad).
      • Faster computation. Takes advantage of GPU mem. But sometimes it can no be fitted (CUDA Out Of Memory)

Some people are tring to make a batch size finder according to this paper.

Number of epochs

Times to learn the whole dataset.

  • Train until start overffiting (validation loss becomes to increase) (early stopping)

Optimizer

Gradient Descent methods. reference:

Description Paper Score
SGD Basic method. A bit slowly to get to the optimum.
SGD with Momentum Speed it up with momentum, usually mom=0.9
AdaGrad Adaptative lr 2011
RMSProp Similar to momentum but with the gradient squared. 2012
Adam Combination of Momentum with RMSProp. 2014
LARS Layer-wise Adaptive Rate Scaling. 2017
AMSGrad Worse than Adam in practice. (AdamX: new verion) 2018
AdamW . 2018
LAMB LARS improvement. 2019
NovoGrad . 2019
Lookahead Is like having a buddy system to explore the loss. 2019
RAdam Rectified Adam. Stabilizes training at the start. 2019
Ranger RAdam + Lookahead. 2019 ⭐⭐⭐
RangerLars RAdam + Lookahead + LARS. 2019
Ralamb RAdam + LARS. 2019
Selective-Backprop Faster training by focusing on the biggest losers. 2019
DiffGrad Solves Adam’s "overshoot" issue 2019
AdaMod A new deep learning optimizer with memory 2019
  • SGD: new_w = w - lr[gradient_w]
  • SGD with Momentum: Usually mom=0.9.
    • mom=0.9, means a 10% is the normal derivative and a 90% is the same direction I went last time.
    • new_w = w - lr[(0.1 * gradient_w) + (0.9 * w)]
    • Other common values are 0.5, 0.7 and 0.99.
  • RMSProp (Adaptative lr) From 2012. Similar to momentum but with the gradient squared.
    • new_w = w - lr * gradient_w / [(0.1 * gradient_w²) + (0.9 * w)]
    • If the gradient in not so volatile, take grater steps. Otherwise, take smaller steps.

TODO: Read:

🧐 Improve generalization
and avoid overfitting

(try in that order)

  1. Get more data
    • Similar datasets: Get a similar dataset for your problem.
    • Create your own dataset
      • Segmentation annotation with Polygon-RNN++
    • Synthetic data: Virtual objects and scenes instead of real images. Infinite possibilities of lighting, colors, angles...
  2. Data augmentation: Augment your current data. (albumentations for faster aug. using the GPU)
    • Test time augmentation (TTA): The same augmentations will also be applied when we are predicting (inference). It can improve our results if we run inference multiple times for each sample and average out the predictions.
    • AutoAugment: RL for data augmentation. Trasfer learning NOT THE WEIGHTS but the policies of how to do data augmentation.
  3. Regularization
    • Dropout. Usually 0.5
    • Weight penalty: Regularization in loss function (penalice high weights). Usually 0.0005
      • L1 regularization: penalizes the sum of absolute weights.
      • L2 regularization: penalizes the sum of squared weights by a factor, usually 0.01 or 0.1.
      • Weight decay: wd * w. Sometimes mathematically identical to L2 reg.
  4. Reduce model complexity: Limit the number of hidden layers and the number of units per layer.
    • Generalizable architectures?: Add more bachnorm layers, more densenets...
  5. Ensambles: Gather a bunch of models to give a final prediction. kaggle ensembling guide
    • Combination methods:
      • Ensembling: Merge final output (average, weighted average, majority vote, weighted majority vote).
      • Meta ensembling: Same but use a new model to produce the final output. (also called stacking or blending)
    • Models generation techniques:
      • Stacking: Just use different classifiers algorithms.
      • Bagging (Bootstrap aggregating): Each model trained with a subset of the training data. Used in random forests. Prob of sample being selected: 0.632 Prob of sample in Out Of Bag 0.368
      • Boosting: The predictors are not made independently, but sequentially. Used in gradient boosting.
      • Snapshot Ensembling: Only for neural nets. M models for the cost of 1. Thanks to SGD with restarts you have several local minimum that you can average. paper.

Other tricks:

  • Label Smoothing: Smooth the one-hot target label
  • Knowledge Distillation: A bigger trained net (teacher) helps the network paper

🕓 Train faster

  • Transfer learning: Use a pretrainded model and retrain with your data.
    1. Replace last layer
    2. Fine-tune new layers
    3. Fine-tune more layers (optional)
  • Batch Normalization Add BachNorm layers after your convolutions and linear layers for make things easier to your net and train faster.
  • Precomputation
    1. Freeze the layers you don’t want to modify
    2. Calculate the activations the last layer from the frozen layers(for your entire dataset)
    3. Save those activations to disk
    4. Use those activations as the input of your trainable layers
  • Half precision (fp16)
  • Multiple GPUs
  • 2nd order optimization

Normalization inside network:

  • Batch Normalization paper
  • Layer Normalization paper
  • Instance Normalization paper
  • Group Normalization paper

Trick: Knowledge Distillation

A teacher model teach a student model.

  • Smaller student model → faster model.
    • Model compresion: Less memory and computation
    • To generalize and avoid outliers.
    • Used in NLP transformers.
    • paper
  • Bigger student model is → more accurate model.
    • Useful when you have extra unlabeled data (kaggle competitions)
    • 1. Train the teacher model with labeled dataset.
    • 2. With the extra on unlabeled dataset, generate pseudo labels (soft or hard labels)
    • 3. Train a student model on both labeled and pseudo-labeled datasets.
    • 4. Student becomes teacher and repeat -> 2.
    • Paper: When Does Label Smoothing Help?
    • Paper: Noisy Student
    • Video: Noisy Student

Supervised DL

  • Structured
    • Tabular
    • Collaborative filtering: When you have users and items. Useful for recommendation systems.
    • Time series
      • Arimax
      • IoT sensors
    • Geospatial: Do Kaggle course
  • Unstructured
    • Vision: Image, Video. Check my vision repo
    • Audio: Sound, music, speech. Check my audio repo. Audio overview
    • NLP: Text, Genomics. Check my NLP repo
    • Knoledge Graph (KG): Graph Neural Networks (GNN)
    • Trees
      • math expresions
      • syntax
      • Models: Tree-LSTM, RNNGrammar (RNNG).
      • Tree2seq by Polish notation. Duda: only for binary trees?

Autoencoder

  • Standard autoencoders: Made for reconstruct the input. No continuous latant space.
    • Simple Autoencoder: Same input and output net with a smaller middle hidden layer (botleneck layer, latent vector).
    • Denoising Autoencoder (DAE): Adds noise to the input to learn how to remove noise.
    • Only have a recontruction loss (pixel mean squared error for example)
  • Variational Autoencoder (VAE): Initially trained as a reconstruction problem, but later we can play with the latent vector to generate new outputs. Latant space need to be continuous.
    • Latent vector: Is modified by adding gaussian noise (normal distribution, mean and std vectors) during training.
    • Loss: loss = recontruction loss + latent loss
      • Recontruction loss: Keeps the output similar to the input (mean squared error)
      • Latent loss: Keeps the latent space continuous (KL divergence)
    • Disentangled Variational Autoencoder (β-VAE): Improved version. Each parameter of the latent vector is devotod to tweak 1 characteristic. paper.
      • β to small: Overfitting. Learn to reconstruct your training data, but i won't generalize
      • β to big: Loose high definition details. Worse performance.

Graph Neural Networks

Semi-supervised DL

Check this kaggle discussion

Reinforcement Learning

Reinforcement learning reference


Resources


Antor TODO

Automatic featuring engeniring

How start a competition/ML project

  1. Data exploaration , haw is the data that we are going to work with
  2. Think about input representation
    • Is redundant?
    • Need to be converted to somthing else?
    • The most entropy that you can reconstruct the raw data
  3. Look at the metric
    • Makes sense?
    • Is it differentiable
    • Can i buid good enough metric equivalent
  4. Build a toy model an overfit it with 1 or few samples
    • To make sure that nothing is really broken

JPEG: 2 levels of comprehension:

  • Entropy
  • Choram

LIDAR

Projections (BAD REPRESENTATION) (complicated things with voxels) Dense matrix (antor) - Its a depth map i think - Not projections - NAtive output of the sensor but condensed in a dense matrix

Unordered set (point cloud, molecules)

  • Point net
  • transformer without positional encoding
    • AtomTransformer (by antor)
    • MoleculeTransformer (by antor)

TODO

About

🦅 Deep Learning awesome cheatsheet

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 97.1%
  • Python 1.9%
  • C 1.0%