# Deep Learning for NLP: Feed forward Networks

## Feedforward Neural Networks Basics
Deep Learning: re-branded name for neural networks, a banch of machine learning

Deep: refers to many layers that are chained together in a model

**Neurons**: computation units


### Feed-forward NN
<font color=red>AKA multilayer perceptrons (多层感知机)</font>

**Arrow**: carries weights; reflecting importance

Certain layers have non-linear activation functions (e.g. softmax)

<center>
 <img src="./figures/week3l1-1.png" width = "250" alt="图片名称" align=center />
</center>


### Neuron
each neuron is a function: $h=tanh(\sum_jw_jx_j+b)$

scales input (with $w$), and adds offset (bias $b$)

**Parameters**: $w$ and $b$s.

#### Matrix Vector Notation

$$h_i=func(\sum_jw_{ij}x_j+b_i)$$

$$\bar{h}=func(W\bar{x}+\bar{b})$$

where $W$ is a matrix comprising the weight vectors, $b$ is a vector of all bias terms

**None-linear function applied element-wise**


### Outputlayer

1. Binary classification: (+,-) 1 neuron with sigmoid activation function
2. Multi-class classification: softmax ensures probabilities $> 0$ and sum to $1$ (<font color=red>AKA discrete probabilities.</font>)
$$\left[\frac{\exp(v_1)}{\sum_i\exp(v_i)},\frac{\exp(v_2)}{\sum_i\exp(v_i)},\dots,\frac{\exp(v_m)}{\sum_i\exp(v_i)}\right]$$

3. Continuous Probability?: <font color=red>MDN</font>
4. Regression problems?


### Learning from Data (for classification)
maximise the total probability $L$
$$L=\prod_{i=1}^mP(y_i|x_i)$$
equivalent to minimise $-\log L$ with respect to parameters
$$\iff -\log(L)$$

trained using Gradient descent

**<font color=red>How to compute the loss?</font>**


### Regularisation
Many params, overfits easily; Regularisation to avoid overfitting

Low bias (without any assumption), high variance (easy overfit)

**Very much import in NNs**

1. L1-norm: sum of absolute values of all params ($W,b$ etc.): encourage the model to split the model to all neurons.
2. L2-norm: sum of squares
3. Dropout: randomly zero-out some neurons of a layer (??)

#### Dropout
Set dropout rate=0.1, a random 10\% of neurons now have 0 values

mostly apply to the <font color=red>hidden layers</font>, but also any other layers

e.g.
<center>
 <img src="./figures/week3l1-2.png" width = "250" alt="图片名称" align=center />
</center>

**Works because:**

It prevents the model from being over-reliant on certain neurons

indirectly: It penalises large parameter weights; It introduces noise into the network


## Applications in NLP

### Topic Classification
input: bag-of-words (document-words matrix)

e.g.
<center>
 <img src="./figures/week3l1-3.png" width = "400" alt="图片名称" align=center />
</center>

Architecture: last layer is softmax (for probability distribution)

Training: Loss is cross-entropy

Prediciton: Choose the argmax as current class

#### Improvements
1. bag of <font color=red>bigrams</font> as input
2. preprocess text to lemmatise words and remove stopwords?
3. can weight words using <font color=red>TF-IDF</font> or <font color=red>one-hot</font> vector instead of word count.


### Language Model Revisited
to assign a probability to a sequence of words

typically, with sliding a window, to predict from finite context. e.g. $n=3$, trigram
$$P(w_1,w_2,...,w_m)=\prod_{i=1}^mP(w_i|w_{i-2},w_{i-1})$$

Training involves collecting frequency counts (rare events → smoothing)

**As Classifier**: LMs can be considered simple classifiers. e.g. trigram 
$$P(w_i|w_{i-2}=salt,w_{i-1}=and)$$


### Feed-forward NN Language Model
Just use NN to model the equation above.

input features: the previous two words => **Embeddings** (continuous?)

output class: the next word (large num of classes)


### Word Embeddings
Maps discrete word symbols to <font color=red>continuous vectors</font> in a __relatively low dimensional__ space

help to capture similarity between words (latent semantic?)

<center>
 <img src="./figures/week3l1-4.png" width = "500" alt="图片名称" align=center />
</center>

**??Question**: 是否可以不做normalisation，既然这一步骤会耗费大量的计算资源


### Training a FFNN LM

<center>
 <img src="./figures/week3l1-5.png" width = "500" alt="图片名称" align=center />
</center>

<center>
 <img src="./figures/week3l1-6.png" width = "500" alt="图片名称" align=center />
</center>

**Question**:圈加代表什么？

<center>
 <img src="./figures/week3l1-7.png" width = "500" alt="图片名称" align=center />
</center>
</br>
<center>
 <img src="./figures/week3l1-8.png" width = "500" alt="图片名称" align=center />
</center>


### Advantages of FFNN LM
Count-based $N$-fram models:

1. Cheap to train
2. problems with sparsity & novelty (scaling to larger contexts)
3. won't adequately capture properties of words (e.g. grammatical and semantic similarity)

pros of FFNN $N$-gram models:

1. automatically capture word properties -> robust estimates
2. without any smooth! (?always give some prob to new words at random initialization)


## Convolutional Networks
popular in CV; <font color=red>Identify indicative local predictors</font>; Combine them to produce a <font color=red>fixed-size</font> representation

[A Beginner's Guide To Understanding Convolutional Neural Networks](https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/)


<center>
 <img src="./figures/week3l1-9.png" width = "500" alt="图片名称" align=center />
</center>


### Summary for DL in NLP (FFNL)
Pros:

1. Excellent performance
2. less hand-engineering of features
3. Flexible (customised architecture for different tasks)

Cons:

1. Much slower to train (GPU involved)
2. Lots of params (due to vocabulary size; both input and output)
3. Data hungry (tiny data poor performance -> pre-trained models)

### Related reference
- Feed-forward network: G15, section 4; JM Ch. 7.3-7.5 
- Convolutional network: G15, section 9

# Recurrent Networks

## Recurrent Networks

### N-gram Language Models

pros:
1. Can be implemented using counts (with smoothing)
2. Can be implemented using feed-forward neural networks (word embedding)
3. Generates sentences like (trigram model) : `I saw a table is round and about`

Prolems: **limited context**


### Recurrent Neural Networks (RNN)
pros:
1. Allow representation of arbitrarily sized inputs (n-1 long)

Idea: process the input sequence one at a time, by applying a recurrence formula

**State Vector**: ty represent contexts that have been previously processed

$$s_i=f(s_{i-1},x_i)$$
$$s_{i+1}=f(s_{i},x_{i+1})$$

$f$-recurring function (use this all the time), 

<center>
 <img src="./figures/week3l2-1.png" width = "500" alt="图片名称" align=center />
</center>
just adds some non-linear layers

$$s_i=tanh(W_ss_{i-1}+W_xx_i+b)$$

if $x_i$ is a one-hot vector, then $W_x$ is just the word embedding


### Simle RNN

$$s_i=tanh(W_ss_{i-1}+W_xx_i+b)$$
$$$$

<center>
 <img src="./figures/week3l2-1.png" width = "500" alt="图片名称" align=center />
</center>

$(W_s,W_x,b,W_y)$ are used across all time steps, so not many params

#### Simple RNN Training


### (Simple) RNN for Language Model

#### RNN Language Model Training

<center>
 <img src="./figures/week3l2-1.png" width = "500" alt="图片名称" align=center />
</center>

once finish the training sentence, we'll then sum all the loss up, and back propagation.

#### RNN Language Model Generation

<center>
 <img src="./figures/week3l2-1.png" width = "500" alt="图片名称" align=center />
</center>

#### RNN Generation Prolems
1. mismatch between training and decoding (ighest propability won't always give you a good choice)
2. error propagation in intermediate steps will never recover
2. trends to generate "bland" or "generic" language



## Long Short-term Memory Networks (LSTM)







## Applications