<a href="https://colab.research.google.com/github/katarinagresova/GLP/blob/main/experiments/CNN_for_DNA_Classification_with_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CNN for DNA Classification with PyTorch

## Introduction

This article is based on [A Complete Guide to CNN for Sentence Classification with PyTorch](https://chriskhanhtran.github.io/posts/cnn-sentence-classification/) and adapted to task of DNA classification.

**Convolutional Neural Networks (CNN)** were originally invented for computer vision (CV) and now are the building block of state-of-the-art CV models. One of the earliest applications of CNN in Natural Language Processing (NLP) was introduced in the paper ***Convolutional Neural Networks for Sentence Classification*** (Kim, 2014). With the same idea as in computer vision, CNN model is used as an feature extractor that encodes semantic features of sentences before these features are fed to a classifier.

With only a simple one-layer CNN trained on top of pretrained word vectors and little hyperparameter tuning, the model achieves excellent results on multiple sentence-level classification tasks. CNN models are now used widely other NLP tasks such as translation or question answering as a part of a more complex architecture.

CNNs also found their way into field of DNA classification. We can look at DNA as sequence of letters that are grouped into words and connected into sentences. So with just little effort we can adapt model created for sentence classification to classify DNA.


This article covers:
- Tokenizing and building vocabuilary from DNA data
- Loading pretrained dna2vec vectors and creating embedding layer for fine-tuning
- Building and training CNN model with PyTorch
- Advice for practitioners
- Bonus: Using Skorch as a scikit-like wrapper for PyTorch's Deep Learning models

**Reference:**
-  [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) (Kim, 2014).
- [A Complete Guide to CNN for Sentence Classification with PyTorch](https://chriskhanhtran.github.io/posts/cnn-sentence-classification/) (Tran, 2020)
- [dna2vec: Consistent vector representations of variable-length k-mers](https://arxiv.org/abs/1701.06279) (Ng, 2017)



## 1. Set up

### 1.1. Install dependencies

In [1]:
!pip install genomic-benchmarks
!pip install gensim==4.1.2

Collecting genomic-benchmarks
  Downloading genomic_benchmarks-0.0.6.tar.gz (17 kB)
Collecting biopython>=1.79
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 9.6 MB/s 
Collecting pyyaml>=5.3.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 74.5 MB/s 
Collecting yarl
  Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
[K     |████████████████████████████████| 271 kB 78.3 MB/s 
Collecting multidict>=4.0
  Downloading multidict-5.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
[K     |████████████████████████████████| 160 kB 75.7 MB/s 
Building wheels for collected packages: genomic-benchmarks
  Building wheel for genomic-b

You will be promted to restart your environment in Google Colab after following cell. Just click on `RESTART RUNTIME` button at the bottom of output from following cell. Confirm the prompt and continue with executing cells.

In [2]:
!wget https://raw.githubusercontent.com/xinrcornelia/dna2vec/master/requirements.txt
!pip install -r requirements.txt

--2022-01-14 16:51:37--  https://raw.githubusercontent.com/xinrcornelia/dna2vec/master/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 351 [text/plain]
Saving to: ‘requirements.txt’


2022-01-14 16:51:37 (23.7 MB/s) - ‘requirements.txt’ saved [351/351]

Collecting arrow==0.8.0
  Downloading arrow-0.8.0.tar.gz (81 kB)
[K     |████████████████████████████████| 81 kB 5.6 MB/s 
[?25hCollecting biopython==1.68
  Downloading biopython-1.68.tar.gz (14.4 MB)
[K     |████████████████████████████████| 14.4 MB 11.6 MB/s 
[?25hCollecting boto==2.46.1
  Downloading boto-2.46.1-py2.py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 56.5 MB/s 
[?25hCollecting bz2file==0.98
  Downloading bz2file-0.98.tar.gz (11 kB)
Collecting C

### 1.2. Import Libraries

In [1]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import matplotlib.pyplot as plt
import torch

%matplotlib inline

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### 1.3. Download Dataset

The dataset we will use is Human Nontata Promotes, genomic dataset from [Genomic Benchmarks](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks). The dataset has 19657 positive and 16474 negative sentences with length of 251 characters.

In [2]:
from genomic_benchmarks.loc2seq import download_dataset
download_dataset("human_nontata_promoters", version=0)

Downloading 1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4 into /root/.genomic_benchmarks/human_nontata_promoters.zip... Done.
Unzipping...Done.


PosixPath('/root/.genomic_benchmarks/human_nontata_promoters')

In [6]:
from genomic_benchmarks.data_check import info
info("human_nontata_promoters", version=0)

Dataset `human_nontata_promoters` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 251.

Totally 36131 sequences have been found, 27097 for training and 9034 for testing.


Unnamed: 0,train,test
negative,12355,4119
positive,14742,4915


### 1.4. 

Loading training data into `np.array`. Downloaded data have each sequence stored in a separate `.txt` file. We also insert spaces into sequences to create words of fixed length. This way, we can use word tokenization later.

In [4]:
WORD_LENGTH = 3

def insert_spaces(text, length):

    # strip last word that would be shorter than length
    text = text[: -1 * (len(text) % length)]

    # 
    return ' '.join(text[i:i+length] for i in range(0,len(text),length))

def load_text(data_dir):
    """Load text data, uppercase text and save to a list."""

    texts = []
    for filename in os.listdir(data_dir):
        with open(os.path.join(data_dir, filename), 'r') as f: # open in readonly mode
            for line in f:
                texts.append(insert_spaces(line.upper().strip(), WORD_LENGTH))

    return texts

# Load files
neg_text = load_text('/root/.genomic_benchmarks/human_nontata_promoters/train/negative/')
pos_text = load_text('/root/.genomic_benchmarks/human_nontata_promoters/train/positive/')

# Concatenate and label data
texts = np.array(neg_text + pos_text)
labels = np.array([0]*len(neg_text) + [1]*len(pos_text))

Let's look at one training sequence.

In [5]:
texts[0]

'CAT TTT ATG TAC TCA ATA AAT AGC AAC TAT AAT TCA TCT AAT GAT AAC AAT GCT CAT TTT GAA GCT TTA AAA ATA TGT AAA AGC ATA AAG GAG AAA GTA GAC ATA ATC TCT AGT CCT ACC ACT CAG AGG CAA CAA ATG TTA ATG TTT TAG CAT CTT TTC TTC ACC TTC CAT TCA CAC ACA TAT ATG TCC ACA CAG TTG ATT TTG CAG TGT GCA AAC AAC TTT GTA CGG TGT GTA TTA GCC AGG GTT TCA'

### 1.4. Download dna2vec vectors

Here we use dna2vec: Consistent vector representations of variable-length k-mers (word2vec adapted for DNA).



In [6]:
!wget https://github.com/pnpnpn/dna2vec/raw/master/pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v

--2022-01-13 13:14:24--  https://github.com/pnpnpn/dna2vec/raw/master/pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/pnpnpn/dna2vec/master/pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v [following]
--2022-01-13 13:14:24--  https://raw.githubusercontent.com/pnpnpn/dna2vec/master/pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84042299 (80M) [text/plain]
Saving to: ‘dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v.

This is extracted part of dna2vev implementation modified by [xinrcornelia](https://github.com/xinrcornelia). Original implementation was using old and uncompatible libraries. 

In [10]:
#Thanks to https://github.com/xinrcornelia/dna2vec/blob/master/dna2vec/multi_k_model.py
from gensim.models import KeyedVectors
import tempfile

class SingleKModel:
    def __init__(self, model):
        self.model = model
        self.vocab_lst = sorted(model.index_to_key)

class MultiKModel:
    def __init__(self, filepath):
        self.aggregate = KeyedVectors.load_word2vec_format(filepath, binary=False)
        #self.logger = logbook.Logger(self.__class__.__name__)

        vocab_lens = [len(vocab) for vocab in self.aggregate.index_to_key]
        self.k_low = min(vocab_lens)
        self.k_high = max(vocab_lens)
        self.vec_dim = self.aggregate.vector_size

        self.data = {}
        for k in range(self.k_low, self.k_high + 1):
            self.data[k] = self.separate_out_model(k)

    def vector(self, vocab):
        return self.data[len(vocab)].model[vocab]

    def l2_norm(self, vocab):
        return np.linalg.norm(self.vector(vocab))

    def separate_out_model(self, k_len):
        vocabs = [vocab for vocab in self.aggregate.index_to_key if len(vocab) == k_len]
        if len(vocabs) != 4 ** k_len:
            self.logger.warn('Missing {}-mers: {} / {}'.format(k_len, len(vocabs), 4 ** k_len))

        header_str = '{} {}'.format(len(vocabs), self.vec_dim)
        with tempfile.NamedTemporaryFile(mode='w') as fptr:
            print(header_str, file=fptr)
            for vocab in vocabs:
                vec_str = ' '.join("%f" % val for val in self.aggregate[vocab])
                print('{} {}'.format(vocab, vec_str), file=fptr)
            fptr.flush()
            return SingleKModel(KeyedVectors.load_word2vec_format(fptr.name, binary=False))

paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress


In [11]:
filepath = 'dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

Each DNA word is represented by vector with 100 dimensions.

In [21]:
mk_model.vector('AGT')

array([-0.100437,  0.037371, -0.003308, -0.17211 ,  0.263888,  0.213367,
        0.139173, -0.228796, -0.43588 , -0.152078, -0.06351 ,  0.12935 ,
       -0.309992,  0.110419,  0.15974 ,  0.422234,  0.11112 , -0.152604,
       -0.194822, -0.08244 ,  0.146237,  0.162316, -0.056551,  0.226918,
       -0.566757, -0.160063, -0.476681,  0.246377,  0.402231,  0.49138 ,
        0.148335, -0.276922,  0.306808, -0.256018, -0.023064,  0.23046 ,
       -0.370285,  0.353037, -0.116914,  0.465472,  0.168274,  0.347329,
        0.321257, -0.034686,  0.03985 ,  0.123678, -0.148329, -0.216877,
       -0.061961, -0.132535, -0.090232, -0.301371, -0.264962, -0.030402,
        0.107969, -0.09688 ,  0.133854, -0.43233 ,  0.130724, -0.203363,
       -0.014014,  0.063082,  0.023929,  0.067139,  0.22355 ,  0.158326,
        0.10423 ,  0.02596 ,  0.114201, -0.139774, -0.349662, -0.12374 ,
       -0.328185,  0.079055, -0.22557 , -0.106094, -0.244369, -0.405145,
       -0.144079,  0.052439, -0.031427,  0.036935, 

### 1.5. Set up GPU for Training

Google Colab offers free GPUs and TPUs. Since we'll be training a large neural network it's best to utilize these features.

A GPU can be added by going to the menu and selecting:

> Runtime -> Change runtime type -> Hardware accelerator: GPU

Then we need to run the following cell to specify the GPU as the device.

In [12]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla K80


## 2. Data Preparation

To prepare our text data for training, first we need to tokenize our sentences and build a vocabulary dictionary `word2idx`, which will later be used to convert our tokens into indexes and build an embedding layer.

***So, what is an embedding layer?***

An embedding layer serves as a look-up table which take word indexes in the vocabulary as input and output word vectors. Hence, the embedding layer has shape $(N, d)$ where $N$ is the size of the vocabulary and $d$ is the embedding dimension. In order to fine-tune pretrained word vectors, we need to create an embedding layer in our `nn.Modules` class. Our input to the model will then be `input_ids`, which is the tokens' index in the vocabulary.

### 2.1. Tokenize

The function `tokenize` will tokenize our sentences, build a vocabulary and fine the maximum sentence length. The function `encode` will take in the outputs of `tokenize`, perform sentence padding and return `input_ids` as a numpy array.

In [13]:
from nltk.tokenize import word_tokenize
from collections import defaultdict

def tokenize(texts):
    """Tokenize texts, build vocabulary and find maximum sentence length.
    
    Args:
        texts (List[str]): List of text data
    
    Returns:
        tokenized_texts (List[List[str]]): List of list of tokens
        word2idx (Dict): Vocabulary built from the corpus
        max_len (int): Maximum sentence length
    """

    max_len = 0
    tokenized_texts = []
    word2idx = {}

    # Add <pad> and <unk> tokens to the vocabulary
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    # Building our vocab from the corpus starting from index 2
    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)

        # Add `tokenized_sent` to `tokenized_texts`
        tokenized_texts.append(tokenized_sent)

        # Add new token to `word2idx`
        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        # Update `max_len`
        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

def encode(tokenized_texts, word2idx, max_len):
    """Pad each sentence to the maximum sentence length and encode tokens to
    their index in the vocabulary.

    Returns:
        input_ids (np.array): Array of token indexes in the vocabulary with
            shape (N, max_len). It will the input of our CNN model.
    """

    input_ids = []
    for tokenized_sent in tokenized_texts:
        # Pad sentences to max_len
        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        # Encode tokens to input_ids
        input_id = [word2idx.get(token) for token in tokenized_sent]
        input_ids.append(input_id)
    
    return np.array(input_ids)

### 2.2. Load Pretrained Vectors

We will load the pretrain vectors for each tokens in our vocabulary. For tokens with no pretraiend vectors, we will initialize random word vectors with the same length and variance.

In [18]:
from tqdm import tqdm_notebook

def load_pretrained_vectors(word2idx):
    """Load pretrained vectors and create embedding layers.
    
    Args:
        word2idx (Dict): Vocabulary built from the corpus
        fname (str): Path to pretrained vector file

    Returns:
        embeddings (np.array): Embedding matrix with shape (N, d) where N is
            the size of word2idx and d is embedding dimension
    """
    d = len(mk_model.vector('AAA'))

    # Initilize random embeddings
    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), d))
    embeddings[word2idx['<pad>']] = np.zeros((d,))
    embeddings[word2idx['<unk>']] = np.ones((d,))

    # Load pretrained vectors
    count = 0
    for word in word2idx.keys():
        if word == '<pad>' or word == '<unk>':
          continue
        try
          embeddings[word2idx[word]] = mk_model.vector(word)
        catch ex:
          pass

    return embeddings

Now let's put above steps together.

In [30]:
# Tokenize, build vocabulary, encode tokens
print("Tokenizing...\n")
tokenized_texts, word2idx, max_len = tokenize(texts)
input_ids = encode(tokenized_texts, word2idx, max_len)

# Load pretrained vectors
embeddings = load_pretrained_vectors(word2idx)
embeddings = torch.tensor(embeddings)

Tokenizing...



In [None]:
embeddings['<pad>']

In [None]:
embeddings['ACG']

### 2.3. Create PyTorch DataLoader

We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed. The batch_size used in the paper is 50.

In [33]:
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler,
                              SequentialSampler)

def data_loader(train_inputs, val_inputs, train_labels, val_labels,
                batch_size=50):
    """Convert train and validation sets to torch.Tensors and load them to
    DataLoader.
    """

    # Convert data type to torch.Tensor
    train_inputs, val_inputs, train_labels, val_labels =\
    tuple(torch.tensor(data) for data in
          [train_inputs, val_inputs, train_labels, val_labels])

    # Specify batch_size
    batch_size = 50

    # Create DataLoader for training data
    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Create DataLoader for validation data
    val_data = TensorDataset(val_inputs, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

    return train_dataloader, val_dataloader

We will use 90% of the dataset for training and 10% for validation.

In [34]:
from sklearn.model_selection import train_test_split

# Train Test Split
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    input_ids, labels, test_size=0.1, random_state=42)

# Load data to PyTorch DataLoader
train_dataloader, val_dataloader = \
data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)

## 3. Model

**CNN Architecture**

The picture below is the illustration of the CNN architecture that we are going to build with three filter sizes: 2, 3, and 4, each of which has 2 filters.

![](https://github.com/chriskhanhtran/CNN-Sentence-Classification-PyTorch/blob/master/cnn-architecture.JPG?raw=true)

*CNN Architecture (Source: Zhang, 2015)*

```python
# Sample configuration:
filter_sizes = [2, 3, 4]
num_filters = [2, 2, 2]
```

Suppose that we are classifying the sentence "***I like this movie very much!***" ($N = 7$ tokens) and the dimensionality of word vectors is $d=5$. After applying the embedding layer on the input token ids, the sample sentence is presented as a 2D tensor with shape (7, 5) like an image.

$$\mathrm{x_{emb}} \quad \in \mathbb{R}^{7 \times 5}$$

We then use 1-dimesional convolution to extract features from the sentence. In this example, we have 6 filters in total, and each filter has shape $(f_i, d)$ where $f_i$ is the filter size for $i \in \{1,...,6\}$. Each filter will then scan over $\mathrm{x_{emb}}$ and returns a feature map:

$$\mathrm{x_{conv_ i} = Conv1D(x_{emb})} \quad \in \mathbb{R}^{N-f_i+1}$$

Next, we apply the ReLU activation to $\mathrm{x_{conv_{i}}}$ and use max-over-time-pooling to reduce each feature map to a single scalar. Then we concatenate these scalars into the final feature vector which will be fed to a fully connected layer to compute the final scores for our classes (logits).

$$\mathrm{x_{pool_i} = MaxPool(ReLU(x_{conv_i}))} \quad \in \mathbb{R}$$

$$\mathrm{x_{fc} = \texttt{concat}(x_{pool_i})} \quad \in \mathbb{R}^6$$

The idea here is that each filter will capture different semantic signals in the sentence (ie. happiness, humor, politic, anger...) and max-pooling will record only the strongest signal over the sentence. This logic makes sense because humans also perceive the sentiment of a sentence based on its strongest word/signal.

Finally, we use a fully connected layer with the weight matrix $\mathbf{W_{fc}} \in \mathbb{R}^{2 \times 6} $ and dropout to compute $\mathrm{logits}$, which is a vector of length 2 that keeps the scores for 2 classes.

$$\mathrm{logits = Dropout(\mathbf{W_{fc}}x_{fc})}  \in \mathbb{R}^2$$

An in-depth explanation of CNN can be found in this [article](https://cs231n.github.io/convolutional-networks/) and this [video](https://www.youtube.com/watch?v=YRhxdVk_sIs).










### 3.1. Create CNN Model

For simplicity, the model above has very small configurations. The final model we'll use is much bigger but has the same architecture:

|Description         |Values           |
|:------------------:|:---------------:|
|input word vectors  |dna2vec         |
|embedding size      |300              |
|filter sizes        |(3, 4, 5)        |
|num filters         |(100, 100, 100)  |
|activation          |ReLU             |
|pooling             |1-max pooling    |
|dropout rate        |0.5              |



In [35]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN_NLP(nn.Module):
    """An 1D Convulational Neural Network for DNA Classification."""
    def __init__(self,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 filter_sizes=[3, 4, 5],
                 num_filters=[100, 100, 100],
                 num_classes=2,
                 dropout=0.5):
        """
        The constructor for CNN_NLP class.

        Args:
            pretrained_embedding (torch.Tensor): Pretrained embeddings with
                shape (vocab_size, embed_dim)
            freeze_embedding (bool): Set to False to fine-tune pretraiend
                vectors. Default: False
            vocab_size (int): Need to be specified when not pretrained word
                embeddings are not used.
            embed_dim (int): Dimension of word vectors. Need to be specified
                when pretrained word embeddings are not used. Default: 300
            filter_sizes (List[int]): List of filter sizes. Default: [3, 4, 5]
            num_filters (List[int]): List of number of filters, has the same
                length as `filter_sizes`. Default: [100, 100, 100]
            n_classes (int): Number of classes. Default: 2
            dropout (float): Dropout rate. Default: 0.5
        """

        super(CNN_NLP, self).__init__()
        # Embedding layer
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                          freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)
        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        # Fully-connected layer and Dropout
        self.fc = nn.Linear(np.sum(num_filters), num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):
        """Perform a forward pass through the network.

        Args:
            input_ids (torch.Tensor): A tensor of token ids with shape
                (batch_size, max_sent_length)

        Returns:
            logits (torch.Tensor): Output logits with shape (batch_size,
                n_classes)
        """

        # Get embeddings from `input_ids`. Output shape: (b, max_len, embed_dim)
        x_embed = self.embedding(input_ids).float()

        # Permute `x_embed` to match input shape requirement of `nn.Conv1d`.
        # Output shape: (b, embed_dim, max_len)
        x_reshaped = x_embed.permute(0, 2, 1)

        # Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
        x_conv_list = [F.relu(conv1d(x_reshaped)) for conv1d in self.conv1d_list]

        # Max pooling. Output shape: (b, num_filters[i], 1)
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
            for x_conv in x_conv_list]
        
        # Concatenate x_pool_list to feed the fully connected layer.
        # Output shape: (b, sum(num_filters))
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],
                         dim=1)
        
        # Compute logits. Output shape: (b, n_classes)
        logits = self.fc(self.dropout(x_fc))

        return logits

### 3.2. Optimizer

To train Deep Learning models, we need to define a loss function and minimize this loss. We'll use back-propagation to compute gradients and use an optimization algorithm (ie. Gradient Descent) to minimize the loss. The original paper used the Adadelta optimizer.

In [36]:
import torch.optim as optim

def initilize_model(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    filter_sizes=[3, 4, 5],
                    num_filters=[100, 100, 100],
                    num_classes=2,
                    dropout=0.5,
                    learning_rate=0.01):
    """Instantiate a CNN model and an optimizer."""

    assert (len(filter_sizes) == len(num_filters)), "filter_sizes and \
    num_filters need to be of the same length."

    # Instantiate CNN model
    cnn_model = CNN_NLP(pretrained_embedding=pretrained_embedding,
                        freeze_embedding=freeze_embedding,
                        vocab_size=vocab_size,
                        embed_dim=embed_dim,
                        filter_sizes=filter_sizes,
                        num_filters=num_filters,
                        num_classes=2,
                        dropout=0.5)
    
    # Send model to `device` (GPU/CPU)
    cnn_model.to(device)

    # Instantiate Adadelta optimizer
    optimizer = optim.Adadelta(cnn_model.parameters(),
                               lr=learning_rate,
                               rho=0.95)

    return cnn_model, optimizer

### 3.3. Training Loop

For each epoch, the code below will perform a forward step to compute the *Cross Entropy* loss, a backward step to compute gradients and use the optimizer to update weights/parameters. At the end of each epoch, the loss on training data and the accuracy over the validation data will be printed to help us keep track of the model's performance. The code is heavily annotated with detailed explanations.

In [37]:
import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, optimizer, train_dataloader, val_dataloader=None, epochs=10):
    """Train the CNN model."""
    
    # Tracking best validation accuracy
    best_accuracy = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {\
    'Val Acc':^9} | {'Elapsed':^9}")
    print("-"*60)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================

        # Tracking time and loss
        t0_epoch = time.time()
        total_loss = 0

        # Put the model into the training mode
        model.train()

        for step, batch in enumerate(train_dataloader):
            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Update parameters
            optimizer.step()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        # =======================================
        #               Evaluation
        # =======================================
        if val_dataloader is not None:
            # After the completion of each training epoch, measure the model's
            # performance on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Track the best accuracy
            if val_accuracy > best_accuracy:
                best_accuracy = val_accuracy

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {\
            val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            
    print("\n")
    print(f"Training complete! Best accuracy: {best_accuracy:.2f}%.")

def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's
    performance on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled
    # during the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

## 4. Evaluation 

In the original paper, the author tried different variations of the model.
- **CNN-rand**: The baseline model where the embedding layer is randomly initialized and then updated during training.
- **CNN-static**: A model with pretrained vectors. However, the embedding layer is freezed during training.
- **CNN-non-static**: Same as above but the embedding layers are fine-tuned during training.

We will experiment with all 3 variations and compare their performance. Below is the report of our results and the results in the original paper.

|Model            |Kim's results  |Our results  |
|:----------------|:-------------:|:-----------:|
|CNN-rand         |76.1           |74.2         |
|CNN-static       |81.0           |82.7         |
|CNN-non-static   |81.5           |84.4         |

Randomness could cause the difference in the results. I think the reason for the improvement in our results is that we used fastText pretrained vectors, which are of higher quality than word2vec vectors that the author used.


In [38]:
# CNN-rand: Word vectors are randomly initialized.
set_seed(42)
cnn_rand, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.25,
                                      dropout=0.5)
train(cnn_rand, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   0.547790   |  0.468032  |   78.00   |   11.39  
   2    |   0.473107   |  0.430156  |   80.76   |   11.17  
   3    |   0.450396   |  0.418008  |   81.27   |   11.12  
   4    |   0.435977   |  0.412865  |   81.96   |   11.10  
   5    |   0.423769   |  0.405721  |   81.45   |   11.09  
   6    |   0.409464   |  0.394151  |   82.18   |   11.09  
   7    |   0.394133   |  0.386645  |   82.80   |   11.09  
   8    |   0.383522   |  0.397456  |   82.51   |   11.12  
   9    |   0.367888   |  0.374340  |   82.29   |   11.15  
  10    |   0.349695   |  0.382369  |   82.91   |   11.12  
  11    |   0.335532   |  0.358181  |   83.71   |   11.12  
  12    |   0.317736   |  0.351361  |   84.18   |   11.14  
  13    |   0.302791   |  0.369666  |   82.44   |   11.18  
  14    |   0.288211   |  0.339142  |   85.31   |   11.19  
  15    |   0.274176

In [39]:
# CNN-static: dna2vec pretrained word vectors are used and freezed during training.
set_seed(42)
cnn_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                        freeze_embedding=True,
                                        learning_rate=0.25,
                                        dropout=0.5)
train(cnn_static, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   0.506759   |  0.427203  |   81.20   |   4.00   
   2    |   0.440566   |  0.460579  |   75.45   |   3.98   
   3    |   0.408395   |  0.377703  |   83.64   |   3.94   
   4    |   0.382690   |  0.402652  |   82.40   |   3.97   
   5    |   0.358497   |  0.352684  |   84.73   |   3.96   
   6    |   0.339975   |  0.348530  |   84.51   |   3.98   
   7    |   0.316618   |  0.336532  |   85.35   |   4.00   
   8    |   0.299876   |  0.328854  |   86.29   |   3.98   
   9    |   0.282721   |  0.326675  |   85.56   |   3.98   
  10    |   0.260602   |  0.307738  |   87.16   |   4.00   
  11    |   0.254129   |  0.311938  |   86.33   |   3.98   
  12    |   0.235081   |  0.417167  |   83.31   |   4.01   
  13    |   0.221338   |  0.297859  |   86.76   |   3.98   
  14    |   0.209746   |  0.440919  |   83.16   |   4.01   
  15    |   0.195728

In [40]:
# CNN-non-static: dna2vec pretrained word vectors are fine-tuned during training.
set_seed(42)
cnn_non_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_non_static, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   0.502920   |  0.425191  |   81.35   |   6.38   
   2    |   0.438719   |  0.455764  |   75.85   |   6.35   
   3    |   0.406695   |  0.378622  |   83.27   |   6.32   
   4    |   0.379535   |  0.401353  |   82.69   |   6.33   
   5    |   0.355122   |  0.353867  |   84.87   |   6.37   
   6    |   0.334047   |  0.344414  |   85.13   |   6.42   
   7    |   0.309328   |  0.331308  |   85.75   |   6.36   
   8    |   0.288443   |  0.325796  |   86.58   |   6.35   
   9    |   0.269115   |  0.317186  |   86.76   |   6.38   
  10    |   0.246912   |  0.305986  |   86.98   |   6.47   
  11    |   0.233542   |  0.316299  |   86.58   |   6.39   
  12    |   0.214799   |  0.419305  |   83.45   |   6.37   
  13    |   0.199761   |  0.294926  |   87.13   |   6.33   
  14    |   0.189140   |  0.329985  |   86.80   |   6.33   
  15    |   0.173189

## 5. Test Model

Let's test our CNN-non-static model on some examples.

In [None]:
def predict(text, model=cnn_non_static.to("cpu"), max_len=62):
    """Predict probability that a review is positive."""

    # Tokenize, pad and encode text
    tokens = word_tokenize(text.lower())
    padded_tokens = tokens + ['<pad>'] * (max_len - len(tokens))
    input_id = [word2idx.get(token, word2idx['<unk>']) for token in padded_tokens]

    # Convert to PyTorch tensors
    input_id = torch.tensor(input_id).unsqueeze(dim=0)

    # Compute logits
    logits = model.forward(input_id)

    #  Compute probability
    probs = F.softmax(logits, dim=1).squeeze(dim=0)

    print(f"This review is {probs[1] * 100:.2f}% positive.")

Our model can easily regconize reviews with strong negative signals. On samples that have mixed feelings but positive sentiment overvall, our model also gets excellent results.

In [None]:
predict("All of friends slept while watching this movie. But I really enjoyed it.")
predict("I have waited so long for this movie. I am now so satisfied and happy.")
predict("This movie is long and boring.")
predict("I don't like the ending.")

This review is 61.22% positive.
This review is 94.68% positive.
This review is 0.01% positive.
This review is 4.03% positive.


## 6. Advice for Practitioners

In [***A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification***](https://arxiv.org/abs/1510.03820) (Zhang, 2015), the authors conducted a sensitivity analysis of the above CNN architecture by running it many different sets of hyperparameters. Based on main empirical findings of the research, below are some advice for practioners to choose hyperparameters when applying this architecture for sentence classification tasks:
- **Input word vectors:** Using pretrained word vectors such as word2vec, Glove (or fastText in our implementation) yields much better results than using one-hot vectors or randomly initialized vectors.
- **Filter region size** can have a large effect on performance, and should be tuned. A reasonable range might be 1~10. For example, using `filter_size=[7]` and `num_filters=[400]` yields the best result in the MR dataset.
- **Number of feature maps:** try values from 100 to 600 for each filter region size.
- **Activation funtions:** ReLu and tanh are the best candidates.
- **Pooling:** Use 1-max pooling.
- **Regularization:** When increasing number of feature maps, try imposing stronger regularization, e.g. a dropout rate larger than 0.5.





## Bonus: Skorch: A Scikit-like Library for PyTorch Modules 

If you find the training loop in PyTorch intimidating with a lot of steps and wonder why those steps aren't wrapped in a function like `model.fit()` and `model.predict()` in `scikit-learn` library. Actually it is something I like in PyTorch. It allows me to manipulate my codes to add extra customizations during training like gradient clipping or updating learning rates. And because I build my model and training loop block by block, when my model runs into errors, it's easier for me to navigate the bugs. However, when I need to deploy a baseline model quickly, writing an entire training loop is really a burden. It's when I come to `skorch`.

`skorch` is "a scikit-learn compatible neural network library that wraps PyTorch." There is no need to create `DataLoader` or write a training/evaluation loop. All you just need to do is defining the model and optimizer as in the code below, then a simple `net.fit(X, y)` is enough.

`skorch` does not only make it neat and fast to train your Deep Learning models, it also provides even more powerful support. You can specify `callbacks` parameters to define early stopping and when to save your model. You can also combine `skorch` model with `scikit-learn` methods to do cross-validation and hyperparameter tuning with grid-search. Please check out the [documentation](https://skorch.readthedocs.io/en/stable/index.html#) to explore this powerful PyTorch library for Deep Learning.





In [41]:
!pip install skorch
from skorch import NeuralNetClassifier
from skorch.helper import predefined_split
from skorch.callbacks import EarlyStopping, Checkpoint, LoadInitState
from skorch.dataset import CVSplit, Dataset

# Specify validation set
val_dataset = Dataset(val_inputs, val_labels)

# Specify callbacks and checkpoints
cp = Checkpoint(monitor='valid_acc_best', dirname='exp1')
callbacks = [
    ('early_stop', EarlyStopping(monitor='valid_acc', patience=5, lower_is_better=False)),
    cp
]

net = NeuralNetClassifier(
    # Module
    module=CNN_NLP,
    module__pretrained_embedding=embeddings,
    module__freeze_embedding=False,
    module__dropout=0.5,
    # Optimizer
    criterion=nn.CrossEntropyLoss,
    optimizer=optim.Adadelta,
    optimizer__lr=0.25,
    optimizer__rho=0.95,
    # Others
    max_epochs=20,
    batch_size=50,
    train_split=predefined_split(val_dataset),
    iterator_train__shuffle=True,
    warm_start=False,
    callbacks=callbacks,
    device=device
)

Collecting skorch
  Downloading skorch-0.11.0-py3-none-any.whl (155 kB)
[?25l[K     |██▏                             | 10 kB 36.7 MB/s eta 0:00:01[K     |████▎                           | 20 kB 8.2 MB/s eta 0:00:01[K     |██████▍                         | 30 kB 7.5 MB/s eta 0:00:01[K     |████████▌                       | 40 kB 7.1 MB/s eta 0:00:01[K     |██████████▋                     | 51 kB 5.3 MB/s eta 0:00:01[K     |████████████▊                   | 61 kB 5.5 MB/s eta 0:00:01[K     |██████████████▉                 | 71 kB 5.3 MB/s eta 0:00:01[K     |█████████████████               | 81 kB 5.9 MB/s eta 0:00:01[K     |███████████████████             | 92 kB 5.9 MB/s eta 0:00:01[K     |█████████████████████▏          | 102 kB 5.3 MB/s eta 0:00:01[K     |███████████████████████▎        | 112 kB 5.3 MB/s eta 0:00:01[K     |█████████████████████████▍      | 122 kB 5.3 MB/s eta 0:00:01[K     |███████████████████████████▌    | 133 kB 5.3 MB/s eta 0:00:01[K   

`skorch` also prints training results in a very nice table. My training loop in section 3 is inspired by this format. When model (checkpoints) are saved, you can see the `+` sign in column `cp`.

In [42]:
set_seed(42)
_ = net.fit(np.array(train_inputs), train_labels)

valid_acc_best = np.max(net.history[:, 'valid_acc'])
print(f"Training complete! Best accuracy: {valid_acc_best * 100:.2f}%")

  epoch    train_loss    valid_acc    valid_loss    cp     dur
-------  ------------  -----------  ------------  ----  ------
      1        [36m0.5031[0m       [32m0.8111[0m        [35m0.4267[0m     +  6.3466
      2        [36m0.4386[0m       0.7649        0.4560        6.3238
      3        [36m0.4070[0m       [32m0.8321[0m        [35m0.3807[0m     +  6.3304
      4        [36m0.3805[0m       0.8251        0.4025        6.3272
      5        [36m0.3556[0m       [32m0.8399[0m        [35m0.3551[0m     +  6.3347
      6        [36m0.3344[0m       [32m0.8446[0m        [35m0.3482[0m     +  6.3666
      7        [36m0.3110[0m       0.8428        [35m0.3419[0m        6.3822
      8        [36m0.2910[0m       [32m0.8601[0m        [35m0.3234[0m     +  6.3817
      9        [36m0.2695[0m       [32m0.8620[0m        [35m0.3203[0m     +  6.3544
     10        [36m0.2482[0m       [32m0.8683[0m        [35m0.3060[0m     +  6.3494
     11        [36

As Deep Learning model can overfit the training data, it's important to save our model when it fits our validation data just right. After training, we can load our model from the last checkpoint to make predictions.

In [None]:
# Load parameters from checkpoint
net.load_params(checkpoint=cp)

predict("All of friends slept while watching this movie. But I really enjoyed it.", model=net)
predict("I have waited so long for this movie. I am now so satisfied and happy.", model=net)
predict("This movie is long and boring.", model=net)
predict("I don't like the ending.", model=net)

This review is 67.25% positive.
This review is 61.38% positive.
This review is 0.12% positive.
This review is 19.14% positive.


## Conclusion

Before the rise of huge and complicated models using Transformer architecture, a simple CNN architecture with one layer of convolution can yeild excellent performance on sentence classification tasks. The model can take advantages of unsupervise pre-training of word vectors to improve overall performance. Improvements can be made in this architecture by increasing the number of CNN layers or utilizing sub-word model (using BPE tokenizer and fastText pretrained sub-word vectors). Because of its speed, we can use the CNN model as a strong baseline model before trying more complicated models such as BERT.

Thank you for staying with me to this point. If interested, you can check out other articles in my NLP tutorial series:
- [Tutorial: Fine-tuning BERT for Sentiment Analysis](https://chriskhanhtran.github.io/posts/bert_for_sentiment_analysis/)