# 🌊 Deep Dive Into K-UDA
> **작성자 : 오수지 (2022020660)**

안녕하세요, 고려대학교 산업경영공학부 DSBA 연구실 석사과정 오수지입니다.  
이번 노트북 튜토리얼에선 Semi-Supervised Learning 방법론, 그 중에서도 **`UDA`**에 대한 실험을 진행할 예정입니다. (🔎 Semi-Supervised Learning에 대한 기본적인 내용은 [유튜브 튜토리얼 영상](https://youtu.be/_uE2_Bl9wTE)을 참고해주세요.)

UDA, 즉 **Unsupervised Data Augmentation**은 Consistency Regularization 계열의 방법론으로 일반적으로 분류 문제에서 사용하는 `Supervised Loss`와 오리지널 데이터와 Augmentation을 적용한 데이터의 Output을 유사하게 만드는 `Consistency Loss`를 통해 모델을 학습하는 Semi-supervised Learning 방법론입니다. 특히 Labeled 데이터를 10%만 사용해도 전체 Labeled 데이터를 사용해서 학습하는 것보다 좋은 성능을 보일 수 있다는 사실을 실험적으로 증명한 방법론이기도 합니다.

수업에서 배운 방법론들 중 Computer Vision 분야의 데이터셋에 초점을 맞춘 타 방법론들과는 달리 UDA는 NLP 분야의 데이터셋에도 쉽게 적용 가능하다는 점이 인상 깊었는데, UDA 관련 자료를 검색해보다 대부분의 코드가 논문에서 사용한 IMDB 영어 데이터셋을 기반으로 하고, 한국어 데이터셋으로 실험을 진행한 자료는 하나도 없는 것을 발견했습니다..🤯

그래서 이번 튜토리얼에선 IMDB 데이터셋의 한국어 버전이라 할 수 있는 NSMC 데이터셋(네이버 영화 리뷰 데이터를 이용해 제작된 영화 리뷰 감성 분류 데이터셋)을 이용해 UDA의 성능을 확인해보겠습니다! (최초 **`K-UDA`**!😎) 과연 NSMC 데이터셋에 대해서도 적은 Labeled 데이터로 좋은 성능을 낼 수 있을까요?

<img src="images/uda.png" width="700">

**📌 이번 튜토리얼의 목표**
>1. NSMC 데이터셋에 대한 Fully Supervised 성능 확인
>2. NSMC 데이터셋에 대한 Labeled 데이터 개수별 UDA 성능 확인
>3. NSMC 데이터셋에 대해 UDA의 추가적인 Setting에 따른 성능 확인

## 🛠 환경 설정

In [1]:
#!pip install easydict
#!pip install datasets

In [1]:
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import pickle as pickle
from tqdm.auto import tqdm, trange
from datasets import load_dataset
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
import random
import torch
import numpy as np
from easydict import EasyDict
import warnings
warnings.filterwarnings(action='ignore')
import pickle
import random

def get_pickle(pickle_path):
    f = open(pickle_path, "rb")
    data = pickle.load(f)
    f.close()
    return data

def fix_seed(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True # this can slow down speed
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    
def calc_accuracy(X, Y):
    max_vals, max_indices = torch.max(X, 1)
    accuracy = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return accuracy

fix_seed(42)

In [2]:
args = EasyDict({'model_name': 'klue/bert-base', 
                 'lr': 1e-5, 
                 'batch_size': 16, 
                 'uda_mode': True,
                 'epochs': 10})

## 0️⃣ Overview

- UDA는 Supervised Loss + Consistency Loss를 통해 모델을 학습하는데 이 중 Consistency Loss를 학습하기 위해선 오리지널 데이터에 변형을 가한 데이터가 필요합니다.
- NLP 분야에서의 변형은 정말 간단한 Random Deletion, 유의어 교체부터 TF-IDF 점수가 낮은 단어를 교체하는 방법, 오리지널 문장을 번역 후 재번역하는 방식의 Back Translation이 있는데 논문에선 Back Translation을 활용했을 때 가장 좋은 성능을 보였다고 이야기하고 있습니다. 해당 튜토리얼에서도 논문과 마찬가지로 Back Translation으로 Consistency Loss를 학습할 데이터를 준비했습니다.
- Back Translation 데이터를 준비하기 위해 Papago 번역기와 Google 번역기를 둘 다 시도해봤지만, Papago API의 긴 소요 시간으로 인해 Papago 번역기가 더 높은 재번역 퀄리티를 보임에도 최종적으론 Google 번역기를 선택했습니다. 
- 추가적인 시간 단축을 위해 문장 길이 기준 25%~50% 구간에 속하는 데이터만을 사용하여 Back Translation 데이터 구축에 총 **24시간**이 소요되었으며, 피클 형태로 미리 저장해둔 파일을 로드해서 사용하겠습니다.

In [3]:
unsup_data = get_pickle('./data/unsup_data.pkl').reset_index()
unsup_data = unsup_data.drop(['index'], axis=1)

In [4]:
len(unsup_data)

40260

- 10개의 데이터만 임의로 확인해봤을 때, 역번역이 잘된 경우도 있지만, 역번역을 했을 때 기존 문장이 가지는 의미를 완전히 상실하는 경우도 발생하는 걸 확인할 수 있습니다. Google 번역기의 좋지 않은 역번역 성능이 최종적으로 UDA의 성능이 Fully-Supervised Setting보다 낮게 나온 것에 대한 원인 중 하나라고 생각합니다.

In [5]:
unsup_data.head(10)

Unnamed: 0,document,bt_document,label
0,아 더빙.. 진짜 짜증나네요 목소리,"오, 더빙 ... 정말 성가시다.",0
1,너무재밓었다그래서보는것을추천한다,내가 보는 것이 너무 좋았습니다,0
2,걍인피니트가짱이다.진짜짱이다♥,짐 무한이 굉장합니다.,1
3,ㄱ냥 매번 긴장되고 재밋음ㅠㅠ,매번 님프와 재미 ㅠㅠ,1
4,보면서 웃지 않는 건 불가능하다,시청하는 동안 웃지 않는 것은 불가능합니다,1
5,주제는 좋은데 중반부터 지루하다,주제는 좋지만 중간에서 지루합니다,0
6,kl2g 고추를 털어버려야 할텐데,KL2G 후추는 닦아야합니다,1
7,재밌는데 별점이 왜이리 낮은고,재미 있지만 왜 별 등급이 너무 낮습니까?,1
8,아직도 이 드라마는 내인생의 최고!,이 드라마는 여전히 내 인생에서 최고입니다!,1
9,패션에 대한 열정! 안나 윈투어!,패션에 대한 열정! Anna Wintour!,1


> 🔎 Loss가 어떻게 계산되는지 위에 보이는 데이터에서 첫번째 예시를 사용해 가볍게 확인해보겠습니다!

In [6]:
tokenizer = AutoTokenizer.from_pretrained('klue/bert-base')
doc = tokenizer(unsup_data['document'][0], max_length=16, padding='max_length', return_tensors="pt")
bt_doc = tokenizer(unsup_data['bt_document'][0], max_length=16, padding='max_length', return_tensors="pt")
label = unsup_data['label'][0]

In [7]:
model = AutoModelForSequenceClassification.from_pretrained('klue/bert-base')

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

### 🔥 Supervised Loss

In [8]:
labeled_output = model(**doc)

In [9]:
labeled_output.logits

tensor([[-0.0056,  0.1694]], grad_fn=<AddmmBackward>)

In [10]:
sup_loss = nn.CrossEntropyLoss()

In [11]:
sup_loss(labeled_output.logits, torch.tensor([label]))

tensor(0.7844, grad_fn=<NllLossBackward>)

### 🔥 Unsupervised Loss

In [12]:
unlabeled_output = model(**bt_doc)

In [13]:
unlabeled_output.logits

tensor([[0.3888, 0.1439]], grad_fn=<AddmmBackward>)

In [14]:
unsup_loss = nn.KLDivLoss()

> 💡 모델로부터 추출한 노이즈가 포함된 Unlabeled 데이터의 확률분포와 노이즈가 없는 Unlabeled 데이터의 확률분포가 크게 차이 나지 않기 때문에 일반적으로 Consistency Loss가 Supervised Loss에 비해 매우 작습니다. 이를 보완하기 위해 UDA 논문에서는 Confidence가 높은 Unlabeled 데이터에 대해서 확률분포의 차이를 크게 만드는 Sharpening 방식을 제안하고 있습니다.

In [15]:
# Sharpening 적용 X
unsup_loss(F.log_softmax(unlabeled_output.logits, dim=1), F.softmax(labeled_output.logits, dim=1))

tensor(0.0110, grad_fn=<KlDivBackward>)

In [18]:
# Sharpening 적용 O -> 적용 전과 비교하여 loss가 더 커진 것을 확인할 수 있음
unsup_loss(F.log_softmax(unlabeled_output.logits, dim=1), F.softmax(labeled_output.logits/0.85, dim=1))

tensor(0.0126, grad_fn=<KlDivBackward>)

## 1️⃣ Prepare Dataset & DataLoader for NSMC Dataset

In [3]:
class NSMCDataset(Dataset):
    def __init__(self, args, 
                 data_type='train', 
                 uda_mode=True, 
                 is_unlabeled=True,
                 n=20):
        self.uda_mode = uda_mode
        self.is_unlabeled = is_unlabeled
        # Full-Supervised Setting
        if uda_mode == False or data_type=='test':
            self.dataset = load_dataset("nsmc")[data_type]
            temp_df = pd.DataFrame({'document': self.dataset['document'],
                                     'label': self.dataset['label'],
                                     'len' : [len(doc) for doc in self.dataset['document']]})
            self.df = temp_df[(temp_df['len']>=16) & (temp_df['len']<=27)].reset_index()
            self.docs = list(self.df['document'])
            self.labels = list(self.df['label'])
        # UDA Setting
        else:
            self.dataset = load_dataset("nsmc")[data_type]
            temp_df = pd.DataFrame({'document': self.dataset['document'],
                                     'label': self.dataset['label'],
                                     'len' : [len(doc) for doc in self.dataset['document']]})
            self.df = temp_df[(temp_df['len']>=15) & (temp_df['len']<=28)].reset_index()
            self.unsup_data = get_pickle('./data/unsup_data.pkl')
            sup_docs = list(set(self.df['document']) - set(self.unsup_data['document']))
            sup_docs = random.sample(sup_docs, n)
            self.sup_data = self.df[self.df['document'].isin(sup_docs)].reset_index()
            if is_unlabeled:
                self.docs = list(self.unsup_data['document'])
                self.bt_docs = list(self.unsup_data['bt_document'])
                self.labels = list(self.unsup_data['label'])
            else:
                self.docs = list(self.sup_data['document'])
                self.labels = list(self.sup_data['label'])
            
        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name)

    def __len__(self):
        return len(self.docs)

    def __getitem__(self, idx):
        doc, label = self.docs[idx], self.labels[idx]

        inputs = self.tokenizer(
            doc,
            return_tensors='pt',
            truncation=True,
            max_length=32,
            padding='max_length',
            add_special_tokens=True
        )
        
        if not self.uda_mode or not self.is_unlabeled:
            return inputs, label
        
        bt_doc = self.bt_docs[idx]
        bt_inputs = self.tokenizer(
            bt_doc,
            return_tensors='pt',
            truncation=True,
            max_length=32,
            padding='max_length',
            add_special_tokens=True
        )
        return inputs, bt_inputs

## 2️⃣ Train K-UDA
||Fully-Supervised|N=20|N=100|N=1000|
|:---|:---:|:---:|:---:|:---:|
|Accuracy|**0.88**|0.74|0.77|0.84|

- UDA Setting에서 Labeled 데이터 개수를 늘릴수록 전체 데이터 약 40,000개로 학습을 진행했을 때의 성능과 근접해지는 것을 확인할 수 있었지만, 20개의 Labeled 데이터만으로 좋은 성능을 낸 UDA 논문의 실험 결과와는 달리 1,000개까지 Labeled 데이터 개수를 늘려도 Fully-Supervised Setting의 성능을 뛰어넘지 못했습니다.

- 이는 위에서도 언급했듯 Google 번역기의 좋지 않은 역번역 성능으로 인해 Unlabeled 데이터의 퀄리티가 좋지 않았던 것이 큰 영향을 미친 것으로 생각됩니다.
- 또한 UDA Setting에서 Supervised Loss를 학습할 데이터로 어떤 데이터가 선택되는지에 따라 성능 변화가 크게 발생하는데 시간 관계상 Seed에 따른 성능 변화를 측정하진 못했습니다. 1,000개의 Labeled 데이터만 사용한 경우에도 전체 Labeled 데이터를 사용했을 때와 Accuracy 기준 0.4 밖에 차이나지 않으므로 다른 Seed에서는 더 근접하거나 우수한 성능을 낼 수 있다고 생각합니다.

In [4]:
def train(args):
    # Prepare Dataset & DataLoader
    if not args.uda_mode:
        train_dataset = NSMCDataset(args, 'train', uda_mode=args.uda_mode)
        test_dataset = NSMCDataset(args, 'test', uda_mode=args.uda_mode)
        train_loader = DataLoader(train_dataset, batch_size=256, num_workers=0, shuffle=True)
        test_loader = DataLoader(test_dataset, batch_size=256, num_workers=0, shuffle=False)
        print(f"The number of labeled dataset : {len(train_dataset)}")
        print(f"The number of test dataset : {len(test_dataset)}")
    else:
        train_dataset = NSMCDataset(args, 'train', is_unlabeled=False, n=args.n)
        unsup_train_dataset = NSMCDataset(args, 'train', is_unlabeled=True)
        test_dataset = NSMCDataset(args, 'test', is_unlabeled=False)
        print(f"The number of labeled dataset : {len(train_dataset)}")
        print(f"The number of unlabeled dataset : {len(unsup_train_dataset)}")
        print(f"The number of test dataset : {len(test_dataset)}")

        train_loader = DataLoader(train_dataset, batch_size=4, num_workers=0, shuffle=True)
        unsup_train_loader = DataLoader(unsup_train_dataset, batch_size=256, num_workers=0, shuffle=True)
        test_loader = DataLoader(test_dataset, batch_size=256, num_workers=0, shuffle=False)
        
        unsup_train_iter = iter(unsup_train_loader)
        
    # Define Model, Optimizer, Loss
    model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    optimizer = AdamW(model.parameters(), lr=args.lr)
    sup_loss = nn.CrossEntropyLoss(reduction='mean')
    unsup_loss = nn.KLDivLoss(reduction='mean')
    best_acc = 0.0
    
    for epoch in range(args.epochs):
        # Train
        train_acc = 0.0
        model.train()
        for batch_id, (inputs, labels) in enumerate(tqdm(train_loader)):
            optimizer.zero_grad()
            inputs = {key: val.squeeze().to(device) for key, val in inputs.items()}
            labels = labels.to(device)
            preds = model(**inputs)[0]
            loss = sup_loss(preds, labels)
            if args.uda_mode:
                try:
                    original_inputs, bt_inputs = next(unsup_train_iter)
                except StopIteration:
                    unsup_train_iter = iter(unsup_train_loader)
                    original_inputs, bt_inputs = next(unsup_train_iter)
                bt_inputs = {key: val.squeeze().to(device) for key, val in bt_inputs.items()}
                bt_preds = model(**bt_inputs)[0]
                model.eval()
                with torch.no_grad():
                    original_inputs = {key: val.squeeze().to(device) for key, val in original_inputs.items()}
                    original_preds = model(**original_inputs)[0]
                model.train()
                original_probs = F.softmax(original_preds, dim=1)  # Target
                bt_probs = F.log_softmax(bt_preds, dim=1)  # Input
                consistency_loss = unsup_loss(bt_probs, original_probs).sum(dim=-1)
                loss += consistency_loss
            loss.backward()
            optimizer.step()
            train_acc += calc_accuracy(preds, labels)

        train_acc = round(train_acc / (batch_id+1), 3)
        
        # Test
        test_correct, test_total = 0, 0
        model.eval()
        for batch_id, (inputs, labels) in enumerate(test_loader):
            inputs = {key: val.squeeze().to(device) for key, val in inputs.items()}
            labels = labels.to(device)
            preds = model(**inputs)[0]
            _, predicted = torch.max(preds, 1)
            test_correct += int((predicted == labels).sum())
            test_total += len(labels)
        test_acc = round(test_correct / test_total, 3)
        if test_acc > best_acc:
            best_acc = test_acc
        
        print("="*100)
        print(f"Epoch : {epoch+1} ||| Train Acc : {train_acc} ||| Test Acc : {test_acc}")
    
    print("="*100)
    print(f"Final Best Acc : {best_acc}")
    print("="*100)

### Fully Supervised Setting

In [6]:
args.uda_mode = False
args.epochs = 5

train(args)

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 40360
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/158 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.821 ||| Test Acc : 0.855


  0%|          | 0/158 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.879 ||| Test Acc : 0.869


  0%|          | 0/158 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.901 ||| Test Acc : 0.879


  0%|          | 0/158 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.916 ||| Test Acc : 0.879


  0%|          | 0/158 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 0.932 ||| Test Acc : 0.881
Final Best Acc : 0.881


### UDA Setting

In [5]:
args.uda_mode = True
args.epochs = 10

args.n = 20
train(args)
args.n = 100
train(args)
args.n = 1000
train(args)

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 20
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.45 ||| Test Acc : 0.576


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.75 ||| Test Acc : 0.619


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.9 ||| Test Acc : 0.654


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 1.0 ||| Test Acc : 0.675


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 1.0 ||| Test Acc : 0.688


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 1.0 ||| Test Acc : 0.696


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 1.0 ||| Test Acc : 0.704


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 1.0 ||| Test Acc : 0.721


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 1.0 ||| Test Acc : 0.731


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 1.0 ||| Test Acc : 0.737
Final Best Acc : 0.737


Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 100
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.55 ||| Test Acc : 0.663


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.79 ||| Test Acc : 0.723


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.9 ||| Test Acc : 0.713


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.94 ||| Test Acc : 0.741


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 1.0 ||| Test Acc : 0.758


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 1.0 ||| Test Acc : 0.758


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 1.0 ||| Test Acc : 0.768


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 1.0 ||| Test Acc : 0.765


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 1.0 ||| Test Acc : 0.766


  0%|          | 0/25 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 1.0 ||| Test Acc : 0.768
Final Best Acc : 0.768


Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 1007
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.761 ||| Test Acc : 0.819


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.901 ||| Test Acc : 0.837


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.97 ||| Test Acc : 0.833


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.994 ||| Test Acc : 0.839


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 0.995 ||| Test Acc : 0.831


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 0.998 ||| Test Acc : 0.84


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 0.995 ||| Test Acc : 0.827


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 0.997 ||| Test Acc : 0.837


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 1.0 ||| Test Acc : 0.84


  0%|          | 0/252 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 0.999 ||| Test Acc : 0.834
Final Best Acc : 0.84


## 3️⃣ Train K-UDA with Additional Settings
> [다음 블로그](https://joungheekim.github.io/2020/12/13/code-review/)를 참고하여 논문에 나온 3가지 추가적인 세팅에 따른 UDA 성능을 측정해보았습니다.   
> ① Training Signal Annealing (TSA) : 상대적으로 적은 Labeled 데이터에 모델이 빠르게 과적합되지 않도록 방지하는 방법  
> ② Confidence-based Masking : 모델의 예측 확률에 기반해 확실한 Unlabeled 데이터만 이용하는 방법  
> ③ Sharpening Prediction : Unlabeled 데이터의 확률 분포를 변형하여 Consistency Loss를 증가시키기 위한 방법  

||Accuracy|
|:---|:---:|
|Base UDA|0.838|
|+ Training Signal Annealing|0.842|
|+ Confidence-based Masking|**0.847**|
|+ Sharpening Prediction|0.841|
|+ All|0.845|

- 최종적으로 아무 설정을 추가하지 않은 Base UDA 성능에 비해 모든 설정에서 성능 향상을 확인할 수 있었으며, 특히 Confidence-based Masking만 추가했을 때 약 0.01 정도의 큰 성능 향상을 보였습니다.
- 모든 설정을 다 추가했을 때의 성능이 Confidence-based Masking만 추가했을 때의 성능에 비해 0.002 낮게 나온 점도 확인할 수 있었는데, 이는 Seed에 따라 충분히 다른 결과를 보일 수 있는 부분이라고 생각하며 중요한 건 모든 추가적인 설정이 Base UDA에 비해 유의미한 성능 향상을 보인 것이라고 생각합니다.

In [7]:
def advanced_train(args, tsa=False, conf_mask=False, sharpening=False):
    # Prepare Dataset & DataLoader
    if not args.uda_mode:
        train_dataset = NSMCDataset(args, 'train', uda_mode=args.uda_mode)
        test_dataset = NSMCDataset(args, 'test', uda_mode=args.uda_mode)
        train_loader = DataLoader(train_dataset, batch_size=256, num_workers=0, shuffle=True)
        test_loader = DataLoader(test_dataset, batch_size=256, num_workers=0, shuffle=False)
        print(f"The number of labeled dataset : {len(train_dataset)}")
        print(f"The number of test dataset : {len(test_dataset)}")
    else:
        train_dataset = NSMCDataset(args, 'train', is_unlabeled=False, n=args.n)
        unsup_train_dataset = NSMCDataset(args, 'train', is_unlabeled=True)
        test_dataset = NSMCDataset(args, 'test', is_unlabeled=False)
        print(f"The number of labeled dataset : {len(train_dataset)}")
        print(f"The number of unlabeled dataset : {len(unsup_train_dataset)}")
        print(f"The number of test dataset : {len(test_dataset)}")

        train_loader = DataLoader(train_dataset, batch_size=4, num_workers=0, shuffle=True)
        unsup_train_loader = DataLoader(unsup_train_dataset, batch_size=256, num_workers=0, shuffle=True)
        test_loader = DataLoader(test_dataset, batch_size=256, num_workers=0, shuffle=False)
        
        unsup_train_iter = iter(unsup_train_loader)
        
    # Define Model, Optimizer, Loss
    model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    optimizer = AdamW(model.parameters(), lr=args.lr)
    sup_loss = nn.CrossEntropyLoss(reduction='none')
    unsup_loss = nn.KLDivLoss(reduction='none')
    best_acc = 0.0
    global_step = 0
    total_steps = len(train_loader) * args.epochs
    
    for epoch in range(args.epochs):
        # Train
        train_acc = 0.0
        model.train()
        for batch_id, (inputs, labels) in enumerate(tqdm(train_loader)):
            global_step += 1
            optimizer.zero_grad()
            inputs = {key: val.squeeze().to(device) for key, val in inputs.items()}
            labels = labels.to(device)
            preds = model(**inputs)[0]
            loss = sup_loss(preds, labels)
            
            # Training Signal Annealing (TSA)
            if tsa:
                tsa_threshold = (global_step / total_steps) * 0.5 + 0.5
                preds_prob = torch.exp(-loss)
                tsa_mask = preds_prob.le(tsa_threshold)
                loss = (loss * tsa_mask).mean()
            else:
                loss = loss.mean()
                
            if args.uda_mode:
                try:
                    original_inputs, bt_inputs = next(unsup_train_iter)
                except StopIteration:
                    unsup_train_iter = iter(unsup_train_loader)
                    original_inputs, bt_inputs = next(unsup_train_iter)
                bt_inputs = {key: val.squeeze().to(device) for key, val in bt_inputs.items()}
                bt_preds = model(**bt_inputs)[0]
                model.eval()
                with torch.no_grad():
                    original_inputs = {key: val.squeeze().to(device) for key, val in original_inputs.items()}
                    original_preds = model(**original_inputs)[0]
                model.train()
                
                # Sharpening Prediction
                if sharpening:
                    original_probs = F.softmax(original_preds/0.85, dim=1) 
                else:
                    original_probs = F.softmax(original_preds, dim=1) 
                bt_probs = F.log_softmax(bt_preds, dim=1)
                consistency_loss = unsup_loss(bt_probs, original_probs).sum(dim=-1)
                
                # Confidence-based Masking
                if conf_mask:
                    unsup_loss_mask = torch.max(F.softmax(original_preds), dim=1)[0]
                    unsup_loss_mask = unsup_loss_mask.ge(0.5)
                    consistency_loss = (consistency_loss * unsup_loss_mask).mean()
                else:
                    consistency_loss = consistency_loss.mean()
                loss += consistency_loss
            loss.backward()
            optimizer.step()
            train_acc += calc_accuracy(preds, labels)

        train_acc = round(train_acc / (batch_id+1), 3)
        
        # Test
        test_correct, test_total = 0, 0
        model.eval()
        for batch_id, (inputs, labels) in enumerate(test_loader):
            inputs = {key: val.squeeze().to(device) for key, val in inputs.items()}
            labels = labels.to(device)
            preds = model(**inputs)[0]
            _, predicted = torch.max(preds, 1)
            test_correct += int((predicted == labels).sum())
            test_total += len(labels)
        test_acc = round(test_correct / test_total, 3)
        if test_acc > best_acc:
            best_acc = test_acc
        
        print("="*100)
        print(f"Epoch : {epoch+1} ||| Train Acc : {train_acc} ||| Test Acc : {test_acc}")
    
    print("="*100)
    print(f"Final Best Acc : {best_acc}")
    print("="*100)

In [8]:
args.uda_mode = True
args.epochs = 10
args.n = 1000
advanced_train(args, tsa=False, conf_mask=False, sharpening=False)
advanced_train(args, tsa=True, conf_mask=False, sharpening=False)
advanced_train(args, tsa=False, conf_mask=True, sharpening=False)
advanced_train(args, tsa=False, conf_mask=False, sharpening=True)
advanced_train(args, tsa=True, conf_mask=True, sharpening=True)

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 1002
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.764 ||| Test Acc : 0.811


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.895 ||| Test Acc : 0.815


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.955 ||| Test Acc : 0.831


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.988 ||| Test Acc : 0.831


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 0.998 ||| Test Acc : 0.837


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 0.999 ||| Test Acc : 0.838


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 0.998 ||| Test Acc : 0.834


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 0.997 ||| Test Acc : 0.837


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 0.999 ||| Test Acc : 0.837


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 1.0 ||| Test Acc : 0.834
Final Best Acc : 0.838


Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 1016
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.681 ||| Test Acc : 0.732


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.815 ||| Test Acc : 0.826


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.911 ||| Test Acc : 0.828


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.969 ||| Test Acc : 0.827


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 0.99 ||| Test Acc : 0.835


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 0.993 ||| Test Acc : 0.813


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 0.996 ||| Test Acc : 0.834


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 0.994 ||| Test Acc : 0.837


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 0.998 ||| Test Acc : 0.841


  0%|          | 0/254 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 0.999 ||| Test Acc : 0.842
Final Best Acc : 0.842


Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 1004
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.775 ||| Test Acc : 0.828


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.9 ||| Test Acc : 0.837


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.961 ||| Test Acc : 0.837


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.991 ||| Test Acc : 0.84


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 0.996 ||| Test Acc : 0.838


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 0.999 ||| Test Acc : 0.846


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 0.999 ||| Test Acc : 0.846


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 0.999 ||| Test Acc : 0.847


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 1.0 ||| Test Acc : 0.846


  0%|          | 0/251 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 0.998 ||| Test Acc : 0.846
Final Best Acc : 0.847


Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 1010
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.783 ||| Test Acc : 0.818


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.911 ||| Test Acc : 0.835


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.966 ||| Test Acc : 0.828


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.99 ||| Test Acc : 0.835


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 0.994 ||| Test Acc : 0.841


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 0.999 ||| Test Acc : 0.841


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 0.999 ||| Test Acc : 0.838


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 0.999 ||| Test Acc : 0.841


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 0.999 ||| Test Acc : 0.839


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 0.998 ||| Test Acc : 0.837
Final Best Acc : 0.841


Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

The number of labeled dataset : 1012
The number of unlabeled dataset : 40260
The number of test dataset : 13667


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 1 ||| Train Acc : 0.705 ||| Test Acc : 0.817


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 2 ||| Train Acc : 0.849 ||| Test Acc : 0.817


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 3 ||| Train Acc : 0.914 ||| Test Acc : 0.827


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 4 ||| Train Acc : 0.962 ||| Test Acc : 0.832


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 5 ||| Train Acc : 0.979 ||| Test Acc : 0.837


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 6 ||| Train Acc : 0.988 ||| Test Acc : 0.841


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 7 ||| Train Acc : 0.994 ||| Test Acc : 0.845


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 8 ||| Train Acc : 0.997 ||| Test Acc : 0.844


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 9 ||| Train Acc : 0.996 ||| Test Acc : 0.843


  0%|          | 0/253 [00:00<?, ?it/s]

Epoch : 10 ||| Train Acc : 0.999 ||| Test Acc : 0.843
Final Best Acc : 0.845


> 한국어 데이터셋에 대한 UDA 학습 코드를 직접 구현하고, 성능을 확인해볼 수 있었던 유익한 시간이었습니다. 이상 마지막 튜토리얼 종료하겠습니다!

<img src="images/9_turtle.png" width="300">