Concept Hierarchy in Neural Networks for NLP
Below is a list of important concepts in neural networks for NLP. In the annotations/
directory in this repository,
we have examples of papers annotated with these concepts that you can peruse.
Annotation Critera: For a particular paper, the concept should be annotated if it is important to understand the proposed method. It should also be annotated if it's important to understand the evaluation. For example, if a proposed self-attention model is compared to a baseline that uses an LSTM, and the difference between these two methods is important to understanding the experimental results, then the LSTM concept should also be annotated. Concepts do not need to be annotated if they are simply mentioned in passing, or in the related work section.
Implication: Some tags are listed with "XXX
(implies YYY
)" which means you need to understand a particular
concept XXX
in order to understand concept YYY
. If YYY
exists in a paper, you do not need to annotate XXX
.
Non-neural Papers: This conceptual hierarchy is for tagging papers that are about neural network models for NLP.
If a paper is not fundamentally about some application of neural networks to NLP, it should be tagged with not-neural
,
and no other tags need to be applied.
Optimization/Learning
Optimizers and Optimization Techniques
- Mini-batch SGD:
optim-sgd
- Adam:
optim-adam
(impliesoptim-sgd
) - Adagrad:
optim-adagrad
(impliesoptim-sgd
) - Adadelta:
optim-adadelta
(impliesoptim-sgd
) - Adam with Specialized Transformer Learning Rate ("Noam" Schedule):
optim-noam
(impliesoptim-adam
) - SGD with Momentum:
optim-momentum
(impliesoptim-sgd
) - AMS:
optim-amsgrad
(impliesoptim-sgd
) - Projection / Projected Gradient Descent:
optim-projection
(impliesoptim-sgd
)
Initialization
- Glorot/Xavier Initialization:
init-glorot
- He Initialization:
init-he
Regularization
- Dropout:
reg-dropout
- Word Dropout:
reg-worddropout
(impliesreg-dropout
) - Norm (L1/L2) Regularization:
reg-norm
- Early Stopping:
reg-stopping
- Patience:
reg-patience
(impliesreg-stopping
) - Weight Decay:
reg-decay
- Label Smoothing:
reg-labelsmooth
Normalization
- Layer Normalization:
norm-layer
- Batch Normalization:
norm-batch
- Gradient Clipping:
norm-gradient
Loss Functions (other than cross-entropy)
- Canonical Correlation Analysis (CCA):
loss-cca
- Singular Value Decomposition (SVD):
loss-svd
- Margin-based Loss Functions:
loss-margin
- Contrastive Loss:
loss-cons
- Noise Contrastive Estimation (NCE):
loss-nce
(impliesloss-cons
) - Triplet Loss:
loss-triplet
(impliesloss-cons
)
Training Paradigms
- Multi-task Learning (MTL):
train-mtl
- Multi-lingual Learning (MLL):
train-mll
(impliestrain-mtl
) - Transfer Learning:
train-transfer
- Active Learning:
train-active
- Data Augmentation:
train-augment
- Curriculum Learning:
train-curriculum
- Parallel Training:
train-parallel
Sequence Modeling Architectures
Activation Functions
- Hyperbolic Tangent (tanh):
activ-tanh
- Rectified Linear Units (RelU):
activ-relu
Pooling Operations
Recurrent Architectures
- Recurrent Neural Network (RNN):
arch-rnn
- Bi-directional Recurrent Neural Network (Bi-RNN):
arch-birnn
(impliesarch-rnn
) - Long Short-term Memory (LSTM):
arch-lstm
(impliesarch-rnn
) - Bi-directional Long Short-term Memory (LSTM):
arch-bilstm
(impliesarch-birnn
,arch-lstm
) - Gated Recurrent Units (GRU):
arch-gru
(impliesarch-rnn
) - Bi-directional Gated Recurrent Units (GRU):
arch-bigru
(impliesarch-birnn
,arch-gru
)
Other Sequential/Structured Architectures
- Bag-of-words, Bag-of-embeddings, Continuous Bag-of-words (BOW):
arch-bow
- Convolutional Neural Networks (CNN):
arch-cnn
- Attention:
arch-att
- Self Attention:
arch-selfatt
(impliesarch-att
) - Recursive Neural Network (RecNN):
arch-recnn
- Tree-structured Long Short-term Memory (TreeLSTM):
arch-treelstm
(impliesarch-recnn
) - Graph Neural Network (GNN):
arch-gnn
- Graph Convolutional Neural Network (GCNN):
arch-gcnn
(impliesarch-gnn
)
Architectural Techniques
- Residual Connections (ResNet):
arch-residual
- Gating Connections, Highway Connections:
arch-gating
- Memory:
arch-memo
- Copy Mechanism:
arch-copy
- Bilinear, Biaffine Models:
arch-bilinear
- Coverage Vectors/Penalties:
arch-coverage
- Subword Units:
arch-subword
- Energy-based, Globally-normalized Mdels:
arch-energy
Standard Composite Architectures
- Transformer:
arch-transformer
(impliesarch-selfatt
,arch-residual
,arch-layernorm
,optim-noam
)
Model Combination
- Ensembling:
comb-ensemble
Search Algorithms
- Greedy Search:
search-greedy
- Beam Search:
search-beam
- A* Search:
search-astar
- Viterbi Algorithm:
search-viterbi
- Ancestral Sampling:
search-sampling
- Gumbel Max:
search-gumbel
(impliessearch-sampling
)
Prediction Tasks
- Text Classification (text -> label):
task-textclass
- Text Pair Classification (two texts -> label:
task-textpair
- Sequence Labeling (text -> one label per token):
task-seqlab
- Extractive Summarization (text -> subset of text):
task-extractive
(impliestext-seqlab
) - Span Labeling (text -> labels on spans):
task-spanlab
- Language Modeling (predict probability of text):
task-lm
- Conditioned Language Modeling (some input -> text):
task-condlm
(impliestask-lm
) - Sequence-to-sequence Tasks (text -> text, including MT):
task-seq2seq
(impliestask-condlm
) - Cloze-style Prediction, Masked Language Modeling (right and left context -> word):
task-cloze
- Context Prediction (as in word2vec) (word -> right and left context):
task-context
- Relation Prediction (text -> graph of relations between words, including dependency parsing):
task-relation
- Tree Prediction (text -> tree, including syntactic and some semantic semantic parsing):
task-tree
- Graph Prediction (text -> graph not necessarily between nodes):
task-graph
- Lexicon Induction/Embedding Alignment (text/embeddings -> bi- or multi-lingual lexicon):
task-lexicon
- Word Alignment (parallel text -> alignment between words):
task-alignment
Composite Pre-trained Embedding Techniques
- word2vec:
pre-word2vec
(impliesarch-cbow
,task-cloze
,task-context
) - fasttext:
pre-fasttext
(impliesarch-cbow
,arch-subword
,task-cloze
,task-context
) - GloVe:
pre-glove
- Paragraph Vector (ParaVec):
pre-paravec
- Skip-thought:
pre-skipthought
(impliesarch-lstm
,task-seq2seq
) - ELMo:
pre-elmo
(impliesarch-bilstm
,task-lm
) - BERT:
pre-bert
(impliesarch-transformer
,task-cloze
,task-textpair
) - Universal Sentence Encoder (USE):
pre-use
(impliesarch-transformer
,task-seq2seq
)
Structured Models/Algorithms
- Hidden Markov Models (HMM):
struct-hmm
- Conditional Random Fields (CRF):
struct-crf
- Context-free Grammar (CFG):
struct-cfg
- Combinatorial Categorical Grammar (CCG):
struct-ccg
Relaxation/Training Methods for Non-differentiable Functions
- Complete Enumeration:
nondif-enum
- Straight-through Estimator:
nondif-straightthrough
- Gumbel Softmax:
nondif-gumbelsoftmax
- Minimum Risk Training:
nondif-minrisk
- REINFORCE:
nondif-reinforce
Adversarial Methods
- Generative Adversarial Networks (GAN):
adv-gan
- Adversarial Feature Learning:
adv-feat
- Adversarial Examples:
adv-examp
- Adversarial Training:
adv-train
(impliesadv-examp
)
Latent Variable Models
- Variational Auto-encoder (VAE):
latent-vae
- Topic Model:
latent-topic
Meta Learning
- Meta-learning Initialization:
meta-init
- Meta-learning Optimizers:
meta-optim
- Meta-learning Loss functions:
meta-loss
- Neural Architecture Search:
meta-arch