해당 페이지는 Zichao Yang1, Diyi Yang1, Chris Dyer1, Xiaodong He2, Alex Smola1, Eduard Hovy1 (2016), "Hierarchical Attention Networks for Document Classification" 논문에 관한 구현입니다.
http://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf

__________________________________

references : 
- https://github.com/pandeykartikey/Hierarchical-Attention-Network/blob/master/HAN%20yelp.ipynb
- https://github.com/vietnguyen91/Hierarchical-attention-networks-pytorch/blob/master/src/utils.py
- https://github.com/EdGENetworks/attention-networks-for-classification/blob/master/attention_model_validation_experiments.ipynb

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
from torch.utils import data

from nltk.tokenize import sent_tokenize,word_tokenize
from sklearn.model_selection import train_test_split

import string
import random
import re
import os
import time
import json
import pandas as pd
from collections import Counter,defaultdict
from imblearn.over_sampling import *
from bs4 import BeautifulSoup
import itertools

SEED = 1

random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Loading

In [2]:
df = pd.read_csv('yelp.csv')
# https://www.kaggle.com/yelp-dataset/yelp-dataset#yelp_academic_dataset_review.json

# Preprocessing

In [3]:
def clean_str(string, max_seq_len):
    """
    adapted from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = BeautifulSoup(string, "lxml").text
    string = re.sub(r"[^A-Za-z0-9(),!?\"\`]", " ", string)
    string = re.sub(r"\"s", " \"s", string)
    string = re.sub(r"\"ve", " \"ve", string)
    string = re.sub(r"n\"t", " n\"t", string)
    string = re.sub(r"\"re", " \"re", string)
    string = re.sub(r"\"d", " \"d", string)
    string = re.sub(r"\"ll", " \"ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    s =string.strip().lower().split(" ")
    if len(s) > max_seq_len:
        return s[0:max_seq_len] 
    return s

In [4]:
X = df['text'].tolist()
Y = (df['stars'] - 1).tolist()
X_train, X_test, y_train, y_test = \
train_test_split(X,Y, test_size=0.33, random_state=123)

In [5]:
## creates a 3D list of format paragraph[sentence[word]]

def create3DList(data, max_sent_len,max_seq_len):
    x=[]
    for docs in data:
        x1=[]
        idx = 0
        for seq in sent_tokenize(docs) :
            x1.append(clean_str(seq,max_sent_len))
            if(idx>=max_seq_len-1):
                break
            idx= idx+1
        x.append(x1)
    return x

## Fix the maximum length of sentences in a paragraph and words in a sentence
max_sent_len = 12; max_seq_len = 25

## divides review in sentences and sentences into word creating a 3DList
x_train = create3DList(X_train, max_sent_len,max_seq_len)
x_test = create3DList(X_test, max_sent_len,max_seq_len)

print("x_train: {}".format(len(x_train)))
print("x_test: {}".format(len(x_test)))

  ' Beautiful Soup.' % markup)

3." looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


x_train: 6700
x_test: 3300


  ' that document to Beautiful Soup.' % decoded_markup


In [6]:
word_to_idx_dict = {'<unk>':0,'<pad>':1}

for idx,doc in enumerate(x_train) : 
    if idx % 1000 == 0 : print("{}번째 문서 처리 중이며 word_to_idx_dict의 길이는 {}입니다."\
                               .format(idx,len(word_to_idx_dict)))
    for sent in doc : 
        for word in sent : 
            if word not in word_to_idx_dict.keys() :                 
                word_to_idx_dict[word] = len(word_to_idx_dict)

0번째 문서 처리 중이며 word_to_idx_dict의 길이는 2입니다.
1000번째 문서 처리 중이며 word_to_idx_dict의 길이는 7499입니다.
2000번째 문서 처리 중이며 word_to_idx_dict의 길이는 10698입니다.
3000번째 문서 처리 중이며 word_to_idx_dict의 길이는 13034입니다.
4000번째 문서 처리 중이며 word_to_idx_dict의 길이는 15058입니다.
5000번째 문서 처리 중이며 word_to_idx_dict의 길이는 16777입니다.
6000번째 문서 처리 중이며 word_to_idx_dict의 길이는 18434입니다.


In [7]:
word_to_freq_dict = defaultdict(int)

for idx,doc in enumerate(x_train) : 
    if idx % 1000 == 0 : print("{}번째 문서 처리 중이며 word_to_freq_dict의 길이는 {}입니다."\
                               .format(idx,len(word_to_freq_dict)))
    for sent in doc : 
        for word in sent : 
            word_to_freq_dict[word] += 1

0번째 문서 처리 중이며 word_to_freq_dict의 길이는 0입니다.
1000번째 문서 처리 중이며 word_to_freq_dict의 길이는 7497입니다.
2000번째 문서 처리 중이며 word_to_freq_dict의 길이는 10696입니다.
3000번째 문서 처리 중이며 word_to_freq_dict의 길이는 13032입니다.
4000번째 문서 처리 중이며 word_to_freq_dict의 길이는 15056입니다.
5000번째 문서 처리 중이며 word_to_freq_dict의 길이는 16775입니다.
6000번째 문서 처리 중이며 word_to_freq_dict의 길이는 18432입니다.


In [8]:
def word_to_idx(doc,min_freq=5) : 
    """
    doc : train or validation or test datasets which are composed with list within list
    """

    min_freq_ls = [[word for word in sent if word_to_freq_dict[word] > min_freq] for sent in doc]
    idx_dict = \
    [[word_to_idx_dict[word] if word in word_to_idx_dict.keys() else 0 for word in sent]\
     for sent in min_freq_ls]
    return idx_dict

In [9]:
train_X = [word_to_idx(batch) for batch in x_train]
test_X = [word_to_idx(batch) for batch in x_test]

- max_sent_len : 한 문서가 가지는 최대 문장 갯수이자, 최소 문장 갯수입니다.(패딩 적용)
- max_seq_len : 한 문장이 가지는 최대 단어 갯수이자, 최소 단어 갯수입니다. (패딩 적용)

In [10]:
## Padding the number of sentence
train_X = [doc + [[0]]*(max_sent_len - len(doc)) if len(doc) <= max_sent_len else doc[:max_sent_len]\
           for doc in train_X]
test_X = [doc + [[0]]*(max_sent_len - len(doc)) if len(doc) <= max_sent_len else doc[:max_sent_len]\
           for doc in test_X]

## Padding the number of word
train_X = [[sent + [0] * (max_seq_len - len(sent)) for sent in doc] for doc in train_X]
test_X = [[sent + [0] * (max_seq_len - len(sent)) for sent in doc] for doc in test_X]

## Make Datasets with iterators 

In [11]:
print("한 문장 안에 있는 단어의 길이 : ",set([len(sent) for doc in train_X for sent in doc]))
print("한 문서 안에 있는 문장의 길이 : ",set([len(doc) for doc in train_X]))

한 문장 안에 있는 단어의 길이 :  {25}
한 문서 안에 있는 문장의 길이 :  {12}


In [12]:
# 뒤에 나오는 DataLoader는 Cuda Tensor를 지원하지 않습니다.
train_X = torch.LongTensor(train_X)
train_y = torch.LongTensor(y_train)
test_X = torch.LongTensor(test_X)
test_y = torch.LongTensor(y_test)

In [13]:
train_X.shape, train_y.shape, test_X.shape, test_y.shape

(torch.Size([6700, 12, 25]),
 torch.Size([6700]),
 torch.Size([3300, 12, 25]),
 torch.Size([3300]))

In [14]:
class Dataset(data.Dataset):
    def __init__(self, X, y):
        'Initialization'
        self.y = y
        self.X = X

    def __len__(self):
        'Denotes the total number of samples'
        return len(self.X)

    def __getitem__(self, index):
        # Load data and get label
        'Generates one sample of data'
        # Select sample
        X = self.X[index]
        y = self.y[index]

        return X, y


In [15]:
# Parameters
params = {'batch_size': 50,
          'shuffle': True,
          'num_workers': 6}

# Generators
training_set = Dataset(train_X,train_y)
train_iter = data.DataLoader(training_set, **params)

testing_set = Dataset(test_X,test_y)
test_iter = data.DataLoader(testing_set, **params)

In [16]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

for local_batch, local_labels in train_iter:
    local_batch, local_labels = local_batch.to(device), local_labels.to(device)
    break

In [17]:
local_batch.size(),local_labels.size()
#[batch_size, sent_len, word_len]
# 25개의 단어를 가지고 12개의 문장을 가진 64개의 문서가 있는 것입니다.

(torch.Size([50, 12, 25]), torch.Size([50]))

# Modeling

In [18]:
class WordAttention(nn.Module) : 
    
    def __init__(self,batch_size,hidden_size) : 
        
        super(WordAttention,self).__init__() 
        self.batch_size = batch_size
        self.linear = nn.Linear(hidden_size*2,hidden_size*2).to(device)
        self.word_proj_params = nn.Parameter(torch.Tensor(hidden_size*2,1)).to(device)
        
    def forward(self,outputs) : 
        
        outputs = outputs.permute(1,0,2) #[batch_size, sent_len, hidden_dim*2]

        u = torch.tanh(self.linear(outputs)) #[batch_size, sent_len, hidden_dim*2]        
        word_proj_params = self.word_proj_params.expand(self.batch_size,-1,-1) #[batch_size,hidden_dim*2,1]
        atten = torch.bmm(u,word_proj_params) #[batch_size,sent_len,1]
        a = torch.softmax(atten,dim=1) #[batch_size,sent_len,1]
        s = torch.sum(torch.mul(a,outputs),dim=1) #[batch_size,hidden_dim*2]
        
        return s,a

In [93]:
hidden = Variable(torch.randn(2,50,100).cuda())
outputs = WordRNN(50,len(word_to_idx_dict),128,100,1,12)(local_batch,hidden)

In [106]:
WordAttention(50,100)(outputs)[1].squeeze()[0]

tensor([0.0374, 0.0321, 0.0264, 0.0292, 0.0293, 0.0334, 0.0330, 0.0370, 0.0439,
        0.0442, 0.0438, 0.0434, 0.0432, 0.0431, 0.0430, 0.0430, 0.0430, 0.0430,
        0.0431, 0.0433, 0.0435, 0.0439, 0.0444, 0.0450, 0.0453],
       device='cuda:0', grad_fn=<SelectBackward>)

In [107]:
class WordRNN(nn.Module) : 
    
    def __init__(self,batch_size,vocab_size,embed_size,hidden_size,num_layer,max_sent_len) : 
        
        super(WordRNN,self).__init__()
        self.batch_size = batch_size
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.gru_hidden_size = hidden_size
        self.num_layer = num_layer
        self.max_sent_len = max_sent_len
        
        self.embeddings = nn.Embedding(vocab_size,embed_size,padding_idx = 1).to(device)
        self.gru = nn.GRU(embed_size,hidden_size,num_layer,bidirectional=True).to(device)
        
        self.word_atten = WordAttention(batch_size,hidden_size).to(device)
    def forward(self,input_,hidden) : 
        
        sent_vec_ls = []; word_attention_ls = []
        
        for i in range(self.max_sent_len) : 
            x = input_[:,i,:]  # x : [batch_size, T :(word length per sentence)]
            embeds = self.embeddings(x).permute(1,0,2) # [T, batch_size, embed_dim] 
            outputs, hidden = self.gru(embeds,hidden)
            sent_vec,word_attention = self.word_atten(outputs)
        
            sent_vec_ls.append(sent_vec.unsqueeze(1))
            word_attention_ls.append(word_attention.permute(0,2,1))
        
        sent_vec = torch.cat(sent_vec_ls,dim=1)
        word_attention = torch.cat(word_attention_ls,dim=1)
                
        return sent_vec,word_attention,hidden
    # [batch_size,sent_len,hidden_size]
    # [batch_size,sent_len,word_len]
    # [num_layer*bidirectional(2), batch_size, hidden_size]

In [20]:
class SentAttention(nn.Module) : 
    
    def __init__(self,batch_size,hidden_size) : 
        
        super(SentAttention,self).__init__() 
        self.batch_size = batch_size
        self.linear = nn.Linear(hidden_size*2,hidden_size*2).to(device)
        self.sent_proj_params = nn.Parameter(torch.Tensor(hidden_size*2,1)).to(device)
    
    def forward(self,outputs) : 
        
        outputs = outputs.permute(1,0,2) #[batch_size, doc_len, hidden_dim*2]
        u = torch.tanh(self.linear(outputs)) #[batch_size, doc_len, hidden_dim*2]
        sent_proj_params = self.sent_proj_params.expand(self.batch_size,-1,-1) #[batch_size,hidden_dim*2,1]
        atten = torch.bmm(u,sent_proj_params) #[batch_size,doc_len,1]
        a = torch.softmax(atten,dim=1) #[batch_size,doc_len,1]
        v = torch.sum(a * outputs,dim=1) #[batch_size,hidden_dim*2]
        return v,a

In [21]:
class SentRNN(nn.Module) : 
    
    def __init__(self,batch_size,vocab_size,embed_size,hidden_size,num_layer) : 
        
        super(SentRNN,self).__init__()
        self.batch_size = batch_size
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.gru_hidden_size = hidden_size
        self.num_layer = num_layer
        
        self.gru = nn.GRU(hidden_size*2,hidden_size,num_layer,bidirectional=True).to(device)
        
        self.word_atten = SentAttention(batch_size,hidden_size)
    def forward(self,x,hidden) : 
        
        x = x.permute(1,0,2) #x : [doc_len,batch_size, hidden*2]

        outputs, hidden = self.gru(x,hidden)
    
        doc_vec,sent_attention = self.word_atten(outputs)
        
        return doc_vec,sent_attention,hidden
    
    #[batch_size,hidden_dim*2]
    #[batch_size,doc_len,1]
    #[num_layer*2,batch_size,hidden_dim]

In [22]:
class HAN(nn.Module) : 
    
    def __init__(self,batch_size,vocab_size,embed_size,hidden_size,num_layer,max_sent_len,num_class) : 
        
        super(HAN,self).__init__()
        self.batch_size = batch_size
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.num_layers = num_layer
        self.max_sent_len = max_sent_len
        self.num_class = num_class
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    
        self.word_encoder =\
        WordRNN(batch_size,vocab_size,embed_size,hidden_size,num_layer,max_sent_len).to(self.device)
        
        self.sent_encoder =\
        SentRNN(batch_size,vocab_size,embed_size,hidden_size,num_layer).to(self.device)
        
        self.proj_layer = nn.Linear(hidden_size*2,num_class).to(self.device)
        
    def init_hidden(self,batch_size):
        hidden = \
        Variable(torch.randn(self.num_layers*2, batch_size, self.hidden_size, device=self.device))
            
        return hidden
    
    def forward(self,input_) : 
        
        (batch_size,sent_len,doc_len) = input_.size()
        
        word_encoder_hidden = self.init_hidden(batch_size)
        sent_vec,word_attention,hidden = self.word_encoder(input_,word_encoder_hidden)
        
        sent_encoder_hidden = self.init_hidden(batch_size)
        doc_vec,sent_attention,hidden = self.sent_encoder(sent_vec,sent_encoder_hidden)
        
        logit = self.proj_layer(doc_vec)
        log_softmax = torch.log_softmax(logit,dim=1)
        
        return log_softmax, word_attention, sent_attention

In [30]:
params = {'batch_size' : 50,
'vocab_size' : len(word_to_idx_dict),
'embed_size' : 128,
'hidden_size' : 100,
'num_layer' : 1,
'max_sent_len' : 12,       
'num_class' : 5
}

model = HAN(**params).to(device)
model

HAN(
  (word_encoder): WordRNN(
    (embeddings): Embedding(19371, 128, padding_idx=1)
    (gru): GRU(128, 100, bidirectional=True)
    (word_atten): WordAttention(
      (linear): Linear(in_features=200, out_features=200, bias=True)
    )
  )
  (sent_encoder): SentRNN(
    (gru): GRU(200, 100, bidirectional=True)
    (word_atten): SentAttention(
      (linear): Linear(in_features=200, out_features=200, bias=True)
    )
  )
  (proj_layer): Linear(in_features=200, out_features=5, bias=True)
)

In [31]:
log_softmax, word_attention, sent_attention = model(local_batch)

In [32]:
log_softmax.size(), word_attention.size(), sent_attention.size()

(torch.Size([50, 5]), torch.Size([50, 12, 25]), torch.Size([50, 12, 1]))

# Trainig and Testing

In [34]:
def adjust_learning_rate(optimizer, epoch, init_lr=0.1, decay = 0.1 ,per_epoch=10):
    """Decay learning rate by a factor of 0.1 every lr_decay_epoch epochs."""
    for param_group in optimizer.param_groups:
        param_group['lr'] *= 1/(1 + decay)

    return optimizer , float(param_group['lr'])

In [35]:
def train(model,train_loader , test_loader , epochs = 10, lr = 0.01, batch_size = 50) :
    
    optimizer = torch.optim.Adam(model.parameters(),lr)
    criterion = nn.NLLLoss().to(device)

    for epoch in range(1,epochs+1) :
        optimizer , lr_int = \
        adjust_learning_rate(optimizer, epoch, init_lr=lr, decay = 0.1 ,per_epoch=10)
        model.train()        
        n_correct = 0
        
        for local_batch, local_labels in train_loader:
            
            local_batch,local_labels = local_batch.to(device),local_labels.to(device)
        
            train_softmax, word_attention, sent_attention = model(local_batch)
            train_predict = train_softmax.argmax(dim=1)
            
            n_correct += (train_predict == local_labels).sum().item()            
            loss = criterion(train_softmax,local_labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
        acc = n_correct / (len(train_loader) * batch_size)  
        print('Train epoch : %s,  loss : %s,  accuracy :%.3f, learning rate :%.3f'%(epoch, loss.item(), acc,lr_int))
        print('=================================================================================================')
        
        if (epoch) % 2 == 0:
            model.eval()
            n_correct = 0  # accuracy 계산을 위해 맞은 갯수 카운트
            val_loss = 0

            for local_batch, local_labels in test_loader:
                local_batch,local_labels = local_batch.to(device),local_labels.to(device)
                
                test_softmax, word_attention, sent_attention = model(local_batch)
                test_predict = test_softmax.argmax(dim = 1)

                val_loss = criterion(test_softmax, local_labels)
                
                n_correct += (test_predict == local_labels).sum().item() #맞은 갯수                

            val_acc = n_correct / (len(test_loader) * batch_size)

            print('*************************************************************************************************')
            print('*************************************************************************************************')
            print('Val Epoch : %s, Val Loss : %.03f , Val Accuracy : %.03f'%(epoch, val_loss, val_acc))
            print('*************************************************************************************************')
            print('*************************************************************************************************')



In [36]:
train(model, train_iter, test_iter, epochs=30)

Train epoch : 1,  loss : 1.303987741470337,  accuracy :0.369, learning rate :0.009
Train epoch : 2,  loss : 1.100485920906067,  accuracy :0.479, learning rate :0.008
*************************************************************************************************
*************************************************************************************************
Val Epoch : 2, Val Loss : 0.987 , Val Accuracy : 0.488
*************************************************************************************************
*************************************************************************************************
Train epoch : 3,  loss : 1.028915286064148,  accuracy :0.603, learning rate :0.008
Train epoch : 4,  loss : 0.8593549132347107,  accuracy :0.701, learning rate :0.007
*************************************************************************************************
*************************************************************************************************
Val Epoch : 4, Val Loss 

Train epoch : 22,  loss : 0.0011363792000338435,  accuracy :1.000, learning rate :0.001
*************************************************************************************************
*************************************************************************************************
Val Epoch : 22, Val Loss : 4.237 , Val Accuracy : 0.470
*************************************************************************************************
*************************************************************************************************
Train epoch : 23,  loss : 0.0007162761758081615,  accuracy :1.000, learning rate :0.001
Train epoch : 24,  loss : 0.0007084274548105896,  accuracy :1.000, learning rate :0.001
*************************************************************************************************
*************************************************************************************************
Val Epoch : 24, Val Loss : 3.103 , Val Accuracy : 0.472
************************************

In [77]:
batch = next(iter(test_iter))

In [78]:
predict = model(batch[0].cuda())[0].argmax(1)
correct_answer = batch[1].cuda()

In [79]:
sum(predict == correct_answer).item() / 50#batch_size

0.48

In [80]:
model(batch[0].cuda())[1][0][0]

tensor([0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400,
        0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400,
        0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400, 0.0400],
       device='cuda:0', grad_fn=<SelectBackward>)

어탠션의 분포가 uniform하게 형성되어 있다. 의도는 이것이 아니였기 때문에, 굉장히 당황스럽지만 데이터의 갯수가 적기 때문에(training : 6700) 이런 결과가 나왔다고 예상해본다.