# Text classification with Transformer

작성: 고유경

---

[Original Keras Code](https://keras.io/examples/nlp/text_classification_with_transformer/)


**Main Task**: IMDB 영화 리뷰 데이터 감성 분석 (긍/부정)

reference

- https://tutorials.pytorch.kr/beginner/text_sentiment_ngrams_tutorial.html

- https://wikidocs.net/60691

- https://cpm0722.github.io/pytorch-implementation/transformer

---


## 0. Import

- `torch == 1.10.0`
- `torchtext == 0.11.0`

In [1]:
import re
import os
import math
from timeit import default_timer as timer
from typing import Iterable, List
from IPython.display import Image

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
from torch import Tensor
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torchsummary import summary

import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.data.functional import to_map_style_dataset
from torchtext.vocab import build_vocab_from_iterator

In [2]:
## GPU check
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
print(torch.cuda.device_count())

True
NVIDIA GeForce RTX 2080 Ti
1


In [3]:
NUM_EPOCHS = 5
LEARNING_RATE = 0.01
BATCH_SIZE = 64
UNK_IDX, PAD_IDX = 0, 1
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 128

#NUM_ENCODER_LAYERS = 3

## 1. 데이터 로드

IMDB 영화 리뷰 데이터

- 리뷰 50000개로 구성

- torchtext 내장 데이터 (for text classification)

- `torchtext.datasets.IMDB`로 data_iterator 간편히 불러오기 가능

- data split을 위해 `torchtext.data.functional.to_map_style_dataset`을 통해 iterable-style dataset -> map-style dataset 형태로 변환 ([참고-공식문서](https://pytorch.org/text/stable/data_functional.html))

- 7.5:1.5:1로 train, valid, test 분할

In [4]:
# Data load
train_iter, test_iter = torchtext.datasets.IMDB(root='.data', split=('train', 'test'))

# Convert iterable-style dataset to map-style dataset.
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
total_dataset = train_dataset + test_dataset

# Data split
num_train = int(len(total_dataset) * 0.75)
num_val = int(len(total_dataset) * 0.15)
num_test = len(total_dataset) - num_train - num_val

train_data, val_data, test_data = random_split(total_dataset, [num_train, num_val, num_test])

In [5]:
print(f"{len(train_data)} training set")
print(f"{len(val_data)} validation set")
print(f"{len(test_data)} test set")

37500 training set
7500 validation set
5000 test set


In [6]:
## smaller version(실험용)

train_iter = torchtext.datasets.IMDB(root='.data', split=('train'))
train_dataset = to_map_style_dataset(train_iter)
dataset = train_dataset[12450:12590] #리뷰 140개만 사용
train_data, val_data, test_data = random_split(dataset, [100, 20, 20])

In [7]:
print(f"{len(train_data)} training set")
print(f"{len(val_data)} validation set")
print(f"{len(test_data)} test set")

100 training set
20 validation set
20 test set


In [8]:
## data 예시 출력
ex_list = [train_data[1], train_data[3]]
ex_list

[('pos',
  'I just recently watched this 1954 movie starring Vincent Price for the first time on Turner Classic Movies. Price portrays Don Gallico, a magician/inventor who is driven to murder when his boss steals several of his magical inventions (and also his wife, portrayed in a brief role by the lovely Eva Gabor). Even though Price is a murderer, I actually found myself rooting for him, he is a sympathetic character who is driven mad by the greedy people around him who keep taking advantage of him.<br /><br />Although this movie doesn\'t have the "horror" factor of some of his more famous roles (such as my favorite, "House of Wax") it nonetheless has enough going for it to keep the viewers interest. <br /><br />This is a must for Vincent Price fans.'),
 ('neg',
  "Exceedingly complicated and drab. I'm a bright guy, but this was just too much for a tired brain. It would really benefit from a few early clues as to who these people are and what they are doing. Probably better for the U

## 2. 텍스트 전처리

### 2.1. html 태그, 특수문자, 숫자 삭제

In [9]:
def clean_web_text(text):
    # text 내 html 태그 삭제
    text = text.replace("<br />", " ")
    text = text.replace("&quot;", '"')
    text = text.replace("<p>", " ")
    text = text.replace("<a href=", " ")
    text = text.replace("</a>", "")
    
    # 알파벳 제외 특수문자, 숫자, 공백 삭제
    text = text.replace("\\n", " ")
    text = re.sub('[^A-Za-z\s]', '', text)
    text = text.replace("   ", "") ## 3개 이상 공백은 제거
    text = text.replace("  ", " ")
    text = text.lower()
    
    return text

In [10]:
for t in ex_list:
    print(clean_web_text(t[1]))

i just recently watched this movie starring vincent price for the first time on turner classic movies price portrays don gallico a magicianinventor who is driven to murder when his boss steals several of his magical inventions and also his wife portrayed in a brief role by the lovely eva gabor even though price is a murderer i actually found myself rooting for him he is a sympathetic character who is driven mad by the greedy people around him who keep taking advantage of him although this movie doesnt have the horror factor of some of his more famous roles such as my favorite house of wax it nonetheless has enough going for it to keep the viewers interestthis is a must for vincent price fans
exceedingly complicated and drab im a bright guy but this was just too much for a tired brain it would really benefit from a few early clues as to who these people are and what they are doing probably better for the us market gc himself hinted that this alone did not supply his oscar and you can se

### 2.2.  토크나이저 정의 및 단어사전 구축

- spacy tockenizer 사전 다운로드 필요

- generator **`yield_tokens`**를 정의하여 train text 내 문장 정제, 토크나이징 및 단어 집합을 반복적으로 수행할 수 있도록 함

    - 제너레이터는 함수 안에서 yield라는 키워드만 사용하여 대용량 반복 수행 시, 메모리 효율적으로 사용 ([참고](https://dojang.io/mod/page/view.php?id=2412))


In [None]:
# spacy 토크나이저 다운
#!pip install -U spacy
#!python -m spacy download en_core_web_sm

In [11]:
def tokenizer(text):
    tokenize = get_tokenizer('spacy', language = "en_core_web_sm")
    clean_text = clean_web_text(text)
    return tokenize(clean_text)

In [12]:
def yield_tokens(data_iter):    
    for label, text in data_iter:
        yield tokenizer(text)

In [13]:
vocab = build_vocab_from_iterator(yield_tokens(train_data), min_freq=5, specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

In [14]:
vocabs = vocab.get_stoi()
sorted(vocabs.items(), key = lambda x: x[1])

[('<unk>', 0),
 ('<pad>', 1),
 ('the', 2),
 ('and', 3),
 ('a', 4),
 ('of', 5),
 ('to', 6),
 ('is', 7),
 ('in', 8),
 ('i', 9),
 ('it', 10),
 ('this', 11),
 ('that', 12),
 ('was', 13),
 ('as', 14),
 ('but', 15),
 ('with', 16),
 ('film', 17),
 ('movie', 18),
 ('for', 19),
 ('on', 20),
 ('he', 21),
 ('you', 22),
 ('nt', 23),
 ('his', 24),
 ('are', 25),
 ('have', 26),
 ('not', 27),
 ('one', 28),
 ('who', 29),
 ('like', 30),
 ('be', 31),
 ('they', 32),
 ('an', 33),
 ('its', 34),
 ('about', 35),
 ('all', 36),
 ('at', 37),
 ('or', 38),
 ('if', 39),
 ('from', 40),
 ('good', 41),
 ('just', 42),
 ('so', 43),
 ('when', 44),
 ('out', 45),
 ('more', 46),
 ('by', 47),
 ('s', 48),
 ('see', 49),
 ('what', 50),
 ('has', 51),
 ('do', 52),
 ('my', 53),
 ('there', 54),
 ('up', 55),
 ('other', 56),
 ('would', 57),
 ('some', 58),
 ('very', 59),
 ('even', 60),
 ('only', 61),
 ('time', 62),
 ('two', 63),
 ('me', 64),
 ('could', 65),
 ('films', 66),
 ('first', 67),
 ('had', 68),
 ('really', 69),
 ('which', 70),

In [15]:
VOCAB_SIZE = len(vocab)
print(f"단어 집합 크기: {VOCAB_SIZE}")

단어 집합 크기: 595


### 2.3. data -> tensor 변환 (Vectorization)

In [16]:
# Text -> Tensor w/ index from vacab_dict
def text2tensor(text):
    tokens = tokenizer(text)
    token_ids = vocab(tokens)
    return torch.tensor(token_ids, dtype=torch.int64)

In [17]:
# Encode label -> pos=1, neg=0
def label2num(label):
    n_label = 1 if label == 'pos' else 0
    return n_label

### 2.4. Dataloader 정의

- `collate_fn`: collate lists of samples into batches ([참고](https://pytorch.org/docs/stable/data.html#dataloader-collate-fn))

    - `custom_collate_fn`을 정의하여 text를 input으로 받았을 때 vectorization, padding 등 연속 작업을 수행하여 iterable한 dataloader로 반환할 수 있도록 함
    - `torch.utils.data.DataLoader`의 인자로 들어감


In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [19]:
# DatoLoader에 쓰일 collate_fn 커스터마이징

def custom_collate_fn(batch):
    label_batch = []
    text_batch = []
    
    for label, text in batch:
        label_batch.append(label2num(label))
        text_batch.append(text2tensor(text))
        
    label_batch = torch.tensor(label_batch, dtype=torch.int64)
    
    # padding
    text_batch = pad_sequence(text_batch, padding_value=PAD_IDX)
    
    # batch_first = True
    text_batch = text_batch.transpose(0,1) # (batch_size x max_len)
    
    return label_batch.to(device), text_batch.to(device)

In [20]:
# train, valid 용 dataloader 생성
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=custom_collate_fn)
val_dataloader = DataLoader(val_data, batch_size=BATCH_SIZE, collate_fn=custom_collate_fn)

In [21]:
train_dataloader = DataLoader(train_data, batch_size = 16, collate_fn = custom_collate_fn)

In [22]:
## batch 예시 출력
batch_ex = next(iter(train_dataloader))
print(batch_ex)

(tensor([1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1], device='cuda:0'), tensor([[ 76, 191, 290,  ..., 215,  11,  18],
        [  9,  42, 401,  ...,   1,   1,   1],
        [ 58, 137,   0,  ...,   1,   1,   1],
        ...,
        [  9, 152,  40,  ...,   1,   1,   1],
        [  9,  26,   6,  ...,   1,   1,   1],
        [  9,  13,  55,  ...,   1,   1,   1]], device='cuda:0'))


In [23]:
batch_ex[1].shape

torch.Size([16, 405])

## 3. 모델링(Transformer)

![transformer](transformer.png)

- classification이기 때문에 Encoder block으로만 구성

### 1. Positional Embedding

- positional encoding + token embedding

In [24]:
class PositionalEmbedding(nn.Module):
    def __init__(self, vocab_size=VOCAB_SIZE, emb_size=512, pad_idx=PAD_IDX, device=device, maxlen=1000):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, emb_size, padding_idx=pad_idx).to(device) #(128, 17, 5)
        
        pos = torch.arange(0,maxlen).reshape(maxlen,1).to(device)
        two_i = torch.arange(0,emb_size,2).to(device)
        
        self.pos_encoding = torch.zeros((maxlen, emb_size))
        self.pos_encoding = self.pos_encoding.to(device)
        self.pos_encoding.requires_grad= False
        
        self.pos_encoding[:,0::2] = torch.sin(pos/ (10000**(two_i/emb_size))) #짝수 index
        self.pos_encoding[:,1::2] = torch.cos(pos/ (10000**(two_i/emb_size))) # 홀수 index
        
    def forward(self, x):
        try:
            sen_length = x.shape[1] #(batch_size, seq_len, emb_size) 3d tensor 기준
        except:
            sen_length = x.shape[0] #(batch_size, seq_len, emb_size) 2d tensor 기준

        return self.token_embedding(x) + self.pos_encoding[:sen_length,:]

### 2. Encoder

MSA, FFN 이전(input), 이후 차원 동일


- **Multi-head Self-Attention** (토큰 간 관계, 문맥 학습)

- Residual Connection
- Layer Norm
    
- **Position-wise FFN** (head 별 mix up)

- Residual Connection
- Layer Norm

**Multi-head Self-Attention**

- pos_embedding을 거친 임베딩 벡터는 query, key, value로 copy되어 각각 선형 결합을 통해 동일한 차원의 w_q, w_k, w_v로 재표현됨

- 이후, `num_head` 개수의 head로 나뉘어 각각 dot product, scaling, masking, softmax를 거쳐 attention score matrix (batch_size, num_head, max_len, d_head)이 도출됨

- 최종적으로 head 별 attn_score concat하여 MHA-score 도출

In [25]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_head=8, emb_size=512):
        super().__init__()
        
        self.num_head = num_head
        self.emb_size = emb_size
        self.linear_query = nn.Linear(self.emb_size, self.emb_size).to(device)
        self.linear_key = nn.Linear(self.emb_size, self.emb_size).to(device)
        self.linear_value = nn.Linear(self.emb_size, self.emb_size).to(device)
        
    
    def attention(self, q, k, v, mask=None):
        batch_size, num_head, max_len, d_head = q.size()
        
        # 1. dot product (q*k^T)
        k_T = k.transpose(2,3)
        attn = torch.matmul(q, k_T) ## -> attn: (batch_size, num_head, max_len, max_len)
        
        # 2. scaling (divide w/ sqrt(d_k))
        attn = attn / math.sqrt(d_head)
        
        # 3. masking!!!!
        if mask is not None:
            attn = attn.masked_fill(mask == 0, -1e9)
        
        # 4. softmax
        attn = nn.Softmax(dim=-1)(attn) ## (max_len, max_len) # 행의 합 1이 되도록 dim=-1
        
        # 5. multipy w/ V
        attn_score = torch.matmul(attn, v) ## -> attn_score: (batch_size, num_head, max_len, d_head)
        
        return attn_score, attn
        
        
    def forward(self, out_pos_emb, mask=None):
    
        batch_size, max_len, emb_size = out_pos_emb.size()
        d_head = emb_size // self.num_head
        
        # query, key, value #각각의 가중치 벡터와 곱해져 w_q, w_k, w_v 생성
        w_q = self.linear_query(out_pos_emb)
        w_k = self.linear_key(out_pos_emb)
        w_v = self.linear_value(out_pos_emb)
        
        # split to heads (torch.view, traspose) -> (batch_size, num_head, max_len, d_head)
        qs = w_q.view(batch_size, max_len, self.num_head, d_head).transpose(1,2)
        ks = w_k.view(batch_size, max_len, self.num_head, d_head).transpose(1,2)
        vs = w_v.view(batch_size, max_len, self.num_head, d_head).transpose(1,2)
        
        # multi-head-attention # attn_score: (batch_size, num_head, max_len, d_head)
        attn_score, attention = self.attention(qs, ks, vs, mask)
        
        # concat -> out_mha: (batch_size, max_len, emb_size)
        out_mha = attn_score.transpose(1,2).contiguous().view(batch_size, max_len, emb_size)
        
        return out_mha, attention
        

**Feed Forward Network**

- Linear Layer 1 -> ReLU -> Linear Layer2

In [26]:
class FeedForwardNetwork(nn.Module):
    def __init__(self, emb_size=512, d_hid=128):
        super().__init__()
        self.layer1 = nn.Linear(emb_size, d_hid)
        self.layer2 = nn.Linear(d_hid, emb_size)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x

**Transformer(Encoder) Block for Classification**

In [27]:
class Transformer_for_Classification(nn.Module):
    def __init__(self, num_class=2, emb_size = 512, num_head = 8,
                 layer_norm_eps=1e-05, dropout_rate = 0.2, pad_idx = PAD_IDX, device=device):
        super().__init__()
        self.emb_size = emb_size
        self.num_head = num_head
        self.pad_idx = pad_idx
        
        self.pos_embedding = PositionalEmbedding(vocab_size=VOCAB_SIZE, emb_size=512, pad_idx=self.pad_idx, device=device, maxlen=1000) #argument 채워넣어라
        self.mh_attention = MultiHeadAttention(num_head=8, emb_size=512)
        self.ffn = FeedForwardNetwork(emb_size=512, d_hid=128)
        
        self.classifier = nn.Linear(emb_size, num_class)
        
        self.layer_norm = nn.LayerNorm(self.emb_size, eps=layer_norm_eps)
        self.dropout = nn.Dropout(dropout_rate)
        
        
    
    def forward(self, data_batch):
        
        # 0. padding mask 생성
        batch_size, max_len = data_batch.size()
        batch_mask = (data_batch != self.pad_idx).view(batch_size, 1, 1, max_len) #boolean mask: False -> 0
        
        # 1. Positional Embedding
        out_pos_emb = self.pos_embedding(data_batch)
        
        
        # 2. Multi-Head Attention
        x, attn = self.mh_attention(out_pos_emb, batch_mask)
        
        ### Dropout
        x = self.dropout(x)
        ### Residual Connection
        x = x + out_pos_emb
        ### Layer Normalization
        out_mha = self.layer_norm(x)
        
        
        # 3. Feed Forward Network
        x = self.ffn(out_mha)
        
        ### Dropout
        x = self.dropout(x)
        ### Residual Connection
        x = x + out_mha
        ### Layer Normalization
        x = self.layer_norm(x) # -> (batch_size, max_len, emb_size)
        
        
        # 4. Classifier
        max_len = x.size(1)
        x = nn.AvgPool2d((max_len,1))(x).squeeze() # -> (batch_size, emb_size)
        # x.mean(dim=0)
        
        x = self.classifier(x)
        
        return x

## 4. Training, Evaluation

In [28]:
torch.manual_seed(0) #랜덤 시드 고정

transformer = Transformer_for_Classification(num_class=2, emb_size = 512, num_head = 8,
                 layer_norm_eps=1e-05, dropout_rate = 0.2, pad_idx = PAD_IDX, device=device)
transformer = transformer.to(device)


# CE Loss
criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Optimizer : SGD
optimizer = torch.optim.SGD(transformer.parameters(), lr=LEARNING_RATE)


In [29]:
def train_epoch(model, optimizer, dataloader):
    model.train()
    losses = 0
    
    for label_batch, text_batch in dataloader:
        
        optimizer.zero_grad()
        
        logits = model(text_batch)
        loss = criterion(logits, label_batch)
        loss.backward()
        
        optimizer.step()
        
        losses += loss.item()
        
    return losses / len(dataloader)

def evaluate(model, dataloader):
    model.eval()
    losses = 0
    
    with torch.no_grad():
        for label_batch, text_batch in dataloader:
            logits = model(text_batch)
        
            loss = criterion(logits, label_batch)
            losses += loss.item()
        
    return losses / len(dataloader)    

In [30]:
# Start training! (5 epochs)
for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer, train_dataloader)
    end_time = timer()
    val_loss = evaluate(transformer, val_dataloader)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))


Epoch: 1, Train loss: 0.126, Val loss: 0.018, Epoch time = 44.206s
Epoch: 2, Train loss: 0.021, Val loss: 0.009, Epoch time = 43.584s
Epoch: 3, Train loss: 0.013, Val loss: 0.006, Epoch time = 43.304s
Epoch: 4, Train loss: 0.010, Val loss: 0.005, Epoch time = 42.804s
Epoch: 5, Train loss: 0.008, Val loss: 0.004, Epoch time = 42.916s


In [31]:
# save model.pth

path = "model/"
if not os.path.exists(path):
    os.mkdir(path)
torch.save(transformer, path+'transformer_cls.pth')

## 5. Test (Predict)

In [32]:
### load saved model

model = torch.load(path+'transformer_cls.pth')

In [33]:
### define function for prediction(Test)

def predict(model, text_num, dataset):
    with torch.no_grad():
        text = text2tensor(test_data[text_num][1]).unsqueeze(0).to(device)
        output = model(text)
        pred_label = 'pos' if output.argmax(0).item() == 1 else 'neg'

    print( f'[Label] {test_data[text_num][0]} \n[Pred_Label] {pred_label} \n [Text] {test_data[text_num][1]}')

**오답 예시**

In [37]:
predict(model, 2, test_data)

[Label] pos 
[Pred_Label] neg 
 [Text] `Europa' (or, as it is also known, `Zentropa') is one of the most visually stunning films I have ever seen. The blend of grayscale and colour photography is near seamless... a true feast for the eyes. The picture was a contender for a 1991's Golden Palm in Canners. The award went to Barton Fink (by Coen brothers); a film stylistically very similar to Zentropa. Here's an exercise in class: rent both films and be a judge for yourself.


**정답 예시**

In [36]:
predict(model, 1, test_data)

[Label] neg 
[Pred_Label] neg 
 [Text] Sondra Locke stinks in this film, but then she was an awful 'actress' anyway. Unfortunately, she drags everyone else (including then =real life boyfriend Clint Eastwood down the drain with her. But what was Clint Eastwood thinking when he agreed to star in this one? One read of the script should have told him that this one was going to be a real snorer. It's an exceptionally weak story, basically no story or plot at all. Add in bored, poor acting, even from the normally good Eastwood. There's absolutely no action except a couple arguments and as far as I was concerned, this film ranks up at the top of the heap of natural sleep enhancers. Wow! Could a film BE any more boring? I think watching paint dry or the grass grow might be more fun. A real stinker. Don't bother with this one.
