<a href="https://colab.research.google.com/github/hjkim909/BERT_Practice/blob/master/BERT_for_Korean_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[블로그](https://zzaebok.github.io/deep_learning/nlp/Bert-for-classification/)
참고하여 BERT의 한국어 적용 연습

In [1]:
!pip install pytorch-transformers


Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |█▉                              | 10kB 20.7MB/s eta 0:00:01[K     |███▊                            | 20kB 1.6MB/s eta 0:00:01[K     |█████▋                          | 30kB 1.9MB/s eta 0:00:01[K     |███████▍                        | 40kB 2.2MB/s eta 0:00:01[K     |█████████▎                      | 51kB 2.0MB/s eta 0:00:01[K     |███████████▏                    | 61kB 2.2MB/s eta 0:00:01[K     |█████████████                   | 71kB 2.4MB/s eta 0:00:01[K     |██████████████▉                 | 81kB 2.6MB/s eta 0:00:01[K     |████████████████▊               | 92kB 2.5MB/s eta 0:00:01[K     |██████████████████▋             | 102kB 2.7MB/s eta 0:00:01[K     |████████████████████▍           | 112kB 2.7MB/s eta 0:00:01[K     |██████████████████████▎     

In [2]:
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_transformers import BertTokenizer, BertForSequenceClassification, BertConfig
from torch.optim import Adam
import torch.nn.functional as F

In [3]:
#데이터 불러오기
!git clone https://github.com/e9t/nsmc.git

Cloning into 'nsmc'...
remote: Enumerating objects: 14763, done.[K
remote: Total 14763 (delta 0), reused 0 (delta 0), pack-reused 14763[K
Receiving objects: 100% (14763/14763), 56.19 MiB | 7.41 MiB/s, done.
Resolving deltas: 100% (1749/1749), done.
Checking out files: 100% (14737/14737), done.


In [4]:
train_df = pd.read_csv('./nsmc/ratings_train.txt', sep='\t')
test_df = pd.read_csv('./nsmc/ratings_test.txt', sep='\t')

In [5]:
train_df.shape, test_df.shape

((150000, 3), (50000, 3))

In [6]:
#학습 속도 향상을 위해 데이터 크기를 줄이자
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

train_df = train_df.sample(frac=0.3, random_state=999)
test_df = test_df.sample(frac=0.3, random_state=999)

In [7]:
#1열이 본문, 2열이 label
train_df.head()

Unnamed: 0,id,document,label
103096,1269448,이 영화(제작과정포함)를 접한 후 결론 → 샤론스톤은 쓰레기다.,0
4038,9867479,넘 멋져열. 몇번씩 보게되는~~,1
135345,9924513,호러 공포영화 정말 좋아해서 거의 다 봤어요. 최근 본 공포물중 최고입니다. 쫄깃쫄...,1
119472,7074014,웃으면서 잘 봤네요,1
64192,8359338,어음...이거 다 보고 딱 생각난대사가 ...뭐지...? 왜 이렇게 갑자기 끝나? ...,0


In [8]:
#Custom DataLoader
class NsmcDataset(Dataset):
    ''' Naver Sentiment Movie Corpus Dataset '''
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        text = self.df.iloc[idx, 1]
        label = self.df.iloc[idx, 2]
        return text, label

In [9]:
nsmc_train_dataset = NsmcDataset(train_df)
train_loader = DataLoader(nsmc_train_dataset, batch_size=2, shuffle=True, num_workers=2)

In [None]:
device = torch.device("cuda")
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
#bert-base-multilingual-cased는 한국어도 포함됨
#Wordpiece 방법 사용
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')
model.to(device)

In [17]:
optimizer = Adam(model.parameters(), lr=1e-6)

itr = 1
p_itr = 1000
epochs = 1
total_loss = 0
total_len = 0
total_correct = 0

model.train()
for epoch in range(epochs):
    
    for text, label in train_loader:
        optimizer.zero_grad()
        
        # encoding and zero padding
        encoded_list = [tokenizer.encode(t, add_special_tokens=True) for t in text]
        padded_list =  [e + [0] * (512-len(e)) for e in encoded_list]
        
        sample = torch.tensor(padded_list)
        sample, label = sample.to(device), label.to(device)
        labels = torch.tensor(label)
        outputs = model(sample, labels=labels)
        loss, logits = outputs

        pred = torch.argmax(F.softmax(logits), dim=1)
        correct = pred.eq(labels)
        total_correct += correct.sum().item()
        total_len += len(labels)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        
        if itr % p_itr == 0:
            print('[Epoch {}/{}] Iteration {} -> Train Loss: {:.4f}, Accuracy: {:.3f}'.format(epoch+1, epochs, itr, total_loss/p_itr, total_correct/total_len))
            total_loss = 0
            total_len = 0
            total_correct = 0

        itr+=1



[Epoch 1/1] Iteration 1000 -> Train Loss: 0.4443, Accuracy: 0.802
[Epoch 1/1] Iteration 2000 -> Train Loss: 0.4299, Accuracy: 0.800
[Epoch 1/1] Iteration 3000 -> Train Loss: 0.4361, Accuracy: 0.788
[Epoch 1/1] Iteration 4000 -> Train Loss: 0.4247, Accuracy: 0.798
[Epoch 1/1] Iteration 5000 -> Train Loss: 0.4483, Accuracy: 0.791
[Epoch 1/1] Iteration 6000 -> Train Loss: 0.4316, Accuracy: 0.798
[Epoch 1/1] Iteration 7000 -> Train Loss: 0.4105, Accuracy: 0.811
[Epoch 1/1] Iteration 8000 -> Train Loss: 0.4111, Accuracy: 0.817
[Epoch 1/1] Iteration 9000 -> Train Loss: 0.4353, Accuracy: 0.802
[Epoch 1/1] Iteration 10000 -> Train Loss: 0.4223, Accuracy: 0.807
[Epoch 1/1] Iteration 11000 -> Train Loss: 0.4088, Accuracy: 0.811
[Epoch 1/1] Iteration 12000 -> Train Loss: 0.4054, Accuracy: 0.819
[Epoch 1/1] Iteration 13000 -> Train Loss: 0.4180, Accuracy: 0.808
[Epoch 1/1] Iteration 14000 -> Train Loss: 0.3874, Accuracy: 0.819
[Epoch 1/1] Iteration 15000 -> Train Loss: 0.4213, Accuracy: 0.803
[Epo

In [18]:
# evaluation
model.eval()

nsmc_eval_dataset = NsmcDataset(test_df)
eval_loader = DataLoader(nsmc_eval_dataset, batch_size=2, shuffle=False, num_workers=2)

total_loss = 0
total_len = 0
total_correct = 0

for text, label in eval_loader:
    encoded_list = [tokenizer.encode(t, add_special_tokens=True) for t in text]
    padded_list =  [e + [0] * (512-len(e)) for e in encoded_list]
    sample = torch.tensor(padded_list)
    sample, label = sample.to(device), label.to(device)
    labels = torch.tensor(label)
    outputs = model(sample, labels=labels)
    _, logits = outputs

    pred = torch.argmax(F.softmax(logits), dim=1)
    correct = pred.eq(labels)
    total_correct += correct.sum().item()
    total_len += len(labels)

print('Test accuracy: ', total_correct / total_len)

  app.launch_new_instance()


Test accuracy:  0.8233882258817254
