# Interesting concepts

## TOC:
* ### Learning
    * [Learning rate decay](#rate_decay)
    * [Dropout](#dropout)
    * [Batch normalization](#batch)
    * [ReLU](#relu)
* ### NLP
    * [Distributional Hypothesis](#distrib_hyp)
    * [CBOW](#cbow)
    * [Skip Gram](#skip_gram)
    * [Perplexity](#perplexity)
    * [LSTM](#lstm)
    * [GRU](#gru)

General imports:

# Learning

## Learning rate decay <a class="anchor" id="rate_decay"></a>

When training a model, it is often recommended to lower the learning rate as the training progresses. If you are going to fast you might end up jumping from one side of the valley to the other. 
Slowing your learning rate by a certain factor will also slow down the process of learning, which is not desirable. Solution: __start fast and slowly decay__ your learning rate.

## Dropout <a class="anchor" id="dropout"></a>

Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.

## Batch normalization <a class="anchor" id="batch"></a>

Batch normalization potentially helps in two ways: faster learning and higher overall accuracy. The improved method also allows you to use a higher learning rate, potentially providing another boost in speed.

Normalization (shifting inputs to zero-mean and unit variance) is often used as a pre-processing step to make the data comparable across features. As the data flows through a deep network, the weights and parameters adjust those values, sometimes making the data too big or too small again - a problem the authors refer to as "internal covariate shift". By normalizing the data in each mini-batch, this problem is largely avoided.

## ReLU <a class="anchor" id="relu"></a>

Activation function defined as the positive part of its argument. This activation is based on strong biological motivations.

# NLP

## Distributional Hypothesis <a class="anchor" id="distrib_hyp"></a>

States that words that appear in the same contexts share semantic meaning (linguistic items with similar distributions have similar meanings)

## CBOW <a class="anchor" id="cbow"></a>

Continuous Bag-of-Words

## Skip-Gram <a class="anchor" id="skip_gram"></a>

Generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.

Formally, an $n$-gram is a consecutive subsequence of length $n$ of some sequence of tokens $w_1 … w_n$. A $k-skip-n-gram$ is a length-$n$ subsequence where the components occur at distance at most $k$ from each other.

For example, in the input text:

*the rain in Spain falls mainly on the plain*

the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences

*the in, rain Spain, in falls, Spain mainly, falls on, mainly the, and on plain.*

## Perplexity <a class="anchor" id="perplexity"></a>

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. A low perplexity indicates the probability distribution is good at predicting the sample.

## LSTM <a class="anchor" id="lstm"></a>

Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN)

## GRU <a class="anchor" id="gru"></a>

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks. Their performance on polyphonic music modeling and speech signal modeling was found to be similar to that of long short-term memory.

They have fewer parameters than LSTM, as they lack an output gate.