In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2021-10-06
# GitHub: https://github.com/jaaack-wang 
# About: Language Models and RNNs for Stanford CS224N - NLP with Deep Learning | Winter 2019

# Table of Contents
- [1. Language Modeling](#1)
    - [1.1 Overview](#1-1)
    - [1.2 n-gram Language Models](#1-2)
    - [1.3 Fix-window Neural Language Model](#1-3)
    - [1.4 Evaluation: Perplexity](#1-4)
- [2. Recurrent Neural Networks](#2)
    - [2.1 Overview](#2-1)
        - [2.1.1 Basic architeture of RNN](#2-1-1)
        - [2.1.2 Applications](#2-1-2)
    - [2.2 RNN Language Model](#2-2)
        - [2.2.1 Example](#2-2-1)
        - [2.2.2 Pros and cons](#2-2-2)
        - [2.2.3 Training](#2-2-3)
- [3. Recap](#3)
- [4. References](#4)

<a name='1'></a>
# 1. Language Modeling


<a name='1-1'></a>
## 1.1 Overview

**Definition:**
Language Modeling is the task of predicting what word comes next. More formally, given a sequence of words $x^{(1)}, x^{(2)}...,x^{(t)}$, compute the probability distribution of the next word $x^{(t+1)}$:

$$P(x^{(t+1)}|x^{(t)},...,x^{(1)})$$

where $x^{(t+1)}$ can be any word in the vocabulary $V = {w_1,...,w_{|V|}}$.

This follows, we can think of a Language Model as a system that assigns probability to a piece of text, say, $x^{(1)},...,x^{(T)}$, which equals:


$$P(x^{(1)},...,x^{(T)}) = P(x^{(1)}) \times P(x^{(2)}|x^{(1)}) \times P(x^{(T)}|x^{(T-1)},...,x^{(1)}) = \prod P(x^{(t)}|x^{(t-1)},...,x^{(1)}) \tag{1-1}$$


<a name='1-2'></a>
## 1.2 n-gram Language Models


- Idea: Collection statistics about how frequent different n-grams are, and use these to predict next word (see formula 1-1 above).
- Assumption: $x^{(t+1)}$ depends only on the preceding $n-1$ words (this is a very simplistic view).

**Problems:**
- Sparsity problem: not much granularity in the probability distribution
    - Less depedent on contexts, especially complex one. 
    - Unseen n-gram/events in the training set. (Smoothing + backoff)
- Storage (With n increasing, the storage also increases drastically, see [my ngram experiments with Chinese](https://github.com/jaaack-wang/ChineseNgrams)) 

**Example (text generation):**
<img src='../images/6-ngramTextG.png' width='600' height='300'>


<a name='1-3'></a>
## 1.3 Fix-window Neural Language Model

<img src='../images/6-neuralLM.png' width='600' height='300'>

Improvements & Problems:

- Also, this can be very expensive to compute when the vocab size is extremely large. 

<img src='../images/6-fixNLM-ImprProblm.png' width='600' height='300'>


<a name='1-4'></a>
## 1.4 Evaluation: Perplexity

<img src='../images/6-perplexity.png' width='600' height='300'>

<a name='2'></a>
# 2. Recurrent Neural Networks

<a name='2-1'></a>
## 2.1 Overview

<a name='2-1-1'></a>
### 2.1.1 Basic architeture of RNN

<img src='../images/6-RRNArch.png' width='600' height='300'>

<a name='2-1-2'></a>
### 2.1.2 Applications

- Language Model
- Part of sppech
- Sentiment classification
- Question answering
- Speech recognition
- Machine Translation


<a name='2-2'></a>
## 2.2 RNN Language Model

<a name='2-2-1'></a>
### 2.2.1 Example
Model example
<img src='../images/6-RRNLM.png' width='600' height='300'>

Application example:
<img src='../images/6-RRNexample.png' width='600' height='300'>

<a name='2-2-2'></a>
### 2.2.2 Pros and cons

<img src='../images/6-RRNProsCons.png' width='600' height='300'>

<a name='2-2-3'></a>
### 2.2.3 Training
<img src='../images/6-tranRRNLM.png' width='600' height='300'>
<img src='../images/6-tranRRNLM2.png' width='600' height='300'>
<img src='../images/6-RRNbackpro.png' width='600' height='300'>

<a name='3'></a>
# 3. Recap

<img src='../images/6-Recap.png' width='600' height='300'>

<a name='4'></a>
# 4. References

- [Course website](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/index.html)

- [Lecture video](https://youtu.be/iWea12EAu6U) 

- [Lecture slide](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture06-rnnlm.pdf)