In [None]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2021-10-11
# GitHub: https://github.com/jaaack-wang 
# About: ConvNets for NLP for Stanford CS224N- NLP with Deep Learning | Winter 2019

# Table of Contents
- [1. Convolutional Nerual Netwrok (CNN)](#1)
    - [1.1 Overview](#1-1)
    - [1.2 2D example](#1-2)
    - [1.3 1D example](#1-3)
        - [1.3.1 With padding](#1-3-1)
        - [1.3.2 With multiple filters](#1-3-2)
        - [1.3.3 With Stride = 2](#1-3-3)
        - [1.3.4 k-max pooling](#1-3-4)
        - [1.3.5 Dilated CNN](#1-3-5)
        - [1.3.6 PyTorch implementation (example)](#1-3-6)
- [2. Yoon Kim (2014)](#2)
    - [2.1 Single layer CNN for sentence classification](#2-1)
    - [2.2 Hyperparameters](#2-2)
    - [2.3 Model Variants and results](#2-3)
- [3. Model comparisons, related techniques and applications](#3)
    - [3.1 Comparisons: Bag of Vectors, Window Model, CNNs, RNNs](#3-1)
    - [3.2 Techniques](#3-2)
        - [3.2.1 Batch Normalization](#3-2-1)
        - [3.2.2 1 x 1 Convolutions](#3-2-2)
    - [3.3 Application](#3-3)
        - [3.3.1 Translation](#3-3-1)
        - [3.3.2 POS tagging](#3-3-2)
        - [3.3.3 Character-Aware Neural Language Models](#3-3-3)

<a name='1'></a>
# 1. Convolutional Nerual Netwrok (CNN)

<a name='1-1'></a>
## 1.1 Overview

- A good introduction to CNN can be found in [Andrew Ng's deep learning specilization](https://www.coursera.org/specializations/deep-learning?) on Coursera, specifically Course 4. 
- CNN is originally designed to deal with images, so it is usually used to extract features of two dimensions or three dimensions (depending on whether the image is colored or not). 

<a name='1-2'></a>
## 1.2 2D example

- 3D is similar, but with more layers for the filter
- filter converts a patch into a single value

<img src='../images/10-CNN2D.png' width='600' height='300'>


<a name='1-3'></a>
## 1.3 1D example

<a name='1-3-1'></a>
### 1.3.1 With padding
- ∅ is optional padding (here = 1)
- \[∅, t, d\] corresponds to the patch containing ∅ plus the first two words that the filter covers. so on and so forth.
- Obviously, the filter here will only move up and down, no leftward or rightward.

<img src='../images/10-CNN1D.png' width='600' height='300'>


<a name='1-3-2'></a>
### 1.3.2 With multiple filters

- Different filters are said to be able to capture different dimensions of semantics/meanings, which of course are not always obvious by looking at the (visualized) results directly.
- max pooling for convoluted values: 0.3, 1.6, 1.4 (for every column, namely, every filter)
- average pooling: −0.87, 0.26, 0.53

<img src='../images/10-CNN1D2.png' width='600' height='300'>

<a name='1-3-3'></a>
### 1.3.3 With Stride = 2
- This is less useful in 1D setting
- Stride is the gap between the position the filter covers in the last step and the position the filter covers in the this step 
<img src='../images/10-CNN1D3.png' width='600' height='300'>

- local max pooling 
<img src='../images/10-CNN1D4.png' width='600' height='300'>

<a name='1-3-4'></a>
### 1.3.4 k-max pooling

- I did not see this mentioned in Andrew Ng's course. Also seems to be not very often used. 
<img src='../images/10-CNN1D5.png' width='600' height='300'>

<a name='1-3-5'></a>
### 1.3.5 Dilated CNN

- Never see this before!
- This is basically CNN within another CNN. 1, 3, 5 refers to the 1st, 3rd and 5th rows of the first convulated table and so on.

<img src='../images/10-CNN1D6.png' width='600' height='300'>


<a name='1-3-6'></a>
### 1.3.6 PyTorch implementation (example)

<img src='../images/10-pyTorchImplemntation.png' width='600' height='300'>


<a name='2'></a>
# 2. Yoon Kim (2014)

- Reference:  Yoon Kim (2014): Convolutional Neural Networks for Sentence Classification. EMNLP 2014. https://arxiv.org/pdf/1408.5882.pdf
- This is a short and simple paper, but worth studying. 

<a name='2-1'></a>
## 2.1 Single layer CNN for sentence classification

- CNN used: A variant of convolutional NNs of [Collobert, Weston et al. (2011)](https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf)

<img src='../images/10-modelArc.png' width='600' height='300'>


- Goal: Sentence classification:
    - Mainly positive or negative sentiment of a sentence
- Other tasks like:
    - Subjective or objective language sentence
    - Question classification:about person, location, number,...

- A similar paper: Zhang and Wallace (2015) A Sensitivity
Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
https://arxiv.org/pdf/1510.03820.pdf

<img src='../images/10-modelArc2.png' width='600' height='300'>


<a name='2-2'></a>
## 2.2 Hyperparameters


<img src='../images/10-hyperparameters.png' width='600' height='300'>



<a name='2-3'></a>
## 2.3 Model Variants and results

<img src='../images/10-modelVariants.png' width='600' height='300'>

<img src='../images/10-results.png' width='600' height='300'>



<a name='3'></a>
# 3. Model comparisons, related techniques and applications


<a name='3-1'></a>
## 3.1 Comparisons: Bag of Vectors, Window Model, CNNs, RNNs

<img src='../images/10-models.png' width='600' height='300'>

<a name='3-2'></a>
## 3.2 Techniques


<a name='3-2-1'></a>
### 3.2.1 Batch Normalization

- Reference: [Ioffe and Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.](https://arxiv.org/pdf/1502.03167.pdf)
- Often used in CNNs
- Transform the convolution output of a batch by scaling the activations to have zero mean and unit variance (similar to Z-transform of statistics)
- Use of BatchNorm makes models much less sensitive to parameter initialization, since outputs are automatically rescaled
- PyTorch: nn.BatchNorm1d


<a name='3-2-2'></a>
### 3.2.2 1 x 1 Convolutions

- [Lin, Chen, and Yan. 2013. Network in network. arXiv:1312.4400.](https://arxiv.org/pdf/1312.4400.pdf)
- 1 x 1 convolutions, a.k.a. Network-in-network (NiN)
connections, are convolutional kernels with kernel_size=1
- A 1 x 1 convolution gives you a fully connected linear layer across channels!
- It can be used to map from many channels to fewer channels
- 1 x 1 convolutions add additional neural network layers with very few additional parameters
    - Unlike Fully Connected (FC) layers which add a lot of parameters


<a name='3-3'></a>
## 3.3 Application

<a name='3-3-1'></a>
### 3.3.1 Translation

- Reference: [Kalchbrenner and Blunsom (2013) “Recurrent Continuous Translation Models”](https://aclanthology.org/D13-1176.pdf)

<img src='../images/10-cnnApp.png' width='600' height='300'>


<a name='3-3-2'></a>
### 3.3.2 POS tagging

- Reference: [Dos Santos and Zadrozny (2014). Learning Character-level Representations for Part-of-Speech Tagging](http://proceedings.mlr.press/v32/santos14.pdf)

<img src='../images/10-cnnApp2.png' width='600' height='300'>


<a name='3-3-3'></a>
### 3.3.3 Character-Aware Neural Language Models

- Reference: [Kim, Jernite, Sontag, and Rush 2015. Character-Aware Neural Language Models](https://arxiv.org/pdf/1508.06615.pdf)
- Abstract: We describe a simple neural language model that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model (RNN-LM). On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. The results suggest that on many languages, character inputs are sufficient for language modeling. Analysis of word representations obtained from the character composition part of the model reveals that the model is able to encode, from characters only, both semantic and orthographic information.


<font color="blue"> There are also other models introduced in the lecture, but in a way that is too board as well as too complicated to capture precisely. </font>

<a name='4'></a>
# 4. References

- [Course website](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/index.html)

- [Lecture video](https://youtu.be/EAJoRA0KX7I) 

- [Lecture slide](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture11-convnets.pdf)

- [Yoon Kim (2014): Convolutional Neural Networks for Sentence Classification. EMNLP 2014.](https://arxiv.org/pdf/1408.5882.pdf)