# CNN for Sentence Classification (Sentiment Analysis)
- It is widely known that CNNs are good for snapshot-like data, like images
- However, CNNs are effectve for NLP tasks as well
- For more information, refer to:
    - Kim 2014 (http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf)
    - Zhang et al 2015 (https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf)
    
<br>
- In this section, we perform sentence classification with CNNs (Kim 2014)
</br>
<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM.png" style="width: 800px"/>

<br>
- Pixels are made of embedding vectors of each word in a sentence
- Convolutions are performed based on word-level
- Classify each sentence as positive (1) or negative (0)

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png" style="width: 600px"/>

## Dataset (+preprocessing)
- IMDb Movie reviews sentiment classification Dataset
- Doc: https://keras.io/datasets/
- Parameter description
    - num_features: number of words to account for (i.e., only frequent n words are considered)
    - sequence_length: maximum number of words for a sentence (if sentence is too short, pad by zeros)
    - embedding_dimension: dimensionality of embedding space (i.e., dimensionality of vector representation for each word)

In [10]:
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [11]:
num_features = 3000
sequence_length = 300
embedding_dimension = 100

In [12]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_features)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(25000,)
(25000,)
(25000,)
(25000,)


In [13]:
X_train = pad_sequences(X_train, maxlen=sequence_length, padding='pre')
X_test = pad_sequences(X_test, maxlen=sequence_length, padding='pre')

print(X_train.shape)
print(X_test.shape)

(25000, 300)
(25000, 300)


In [14]:
X_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    1,   14,   22,   16,   43,  530,
        973, 1622, 1385,   65,  458,    2,   66,    2,    4,  173,   36,
        256,    5,   25,  100,   43,  838,  112,   50,  670,    2,    9,
         35,  480,  284,    5,  150,    4,  172,  112,  167,    2,  336,
        385,   39,    4,  172,    2, 1111,   17,  546,   38,   13,  447,
          4,  192,   50,   16,    6,  147, 2025,   19,   14,   22,    4,
       1920,    2,  469,    4,   22,   71,   87,   

## 0. Basic CNN sentence classification model
- Basic CNN using 1D convolution and pooling
- Known as "temporal convolution"