위의 baseline code에서 데이터 불균형을 해결하기 위해 오버샘플링한 데이터만 바꿔 학습시켜보았습니다.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# HuggingFace transformers 설치 
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import torch
from torch.nn import functional as F
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, ElectraModel, AdamW
from transformers import Trainer, TrainingArguments
from tqdm.notebook import tqdm
from torchsummary import summary

MODEL_NAME = "monologg/koelectra-base-v3-discriminator"

In [4]:
# cuda error 정보 가져오기
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [5]:
# GPU 사용
device = torch.device("cuda") if torch.cuda.is_available() else torch.device('cpu')

# 1. 데이터 가져오기

In [6]:
data_dir = '/content/drive/MyDrive/KMWP/code/data'

train = pd.read_csv(data_dir + '/oversampled_sep_train.csv')
train = train.rename(columns={'class':'label'})
train.head(10)

Unnamed: 0,problem,label
0,한 변의 길이가 24cm인 정육각형과 둘레가 같은 정팔각형이 있습니다. 이 정팔각형...,7
1,윤아는 부추전을 똑같이 8조각으로 나누어 한 조각을 먹었습니다. 윤미는 같은 크기의...,0
2,"화단 주변에 한 변이 12m인 정팔각형 모양의 울타리를 두른다면, 울타리는 모두 몇...",7
3,"6장의 숫자 카드 0, 9, 8, 7, 2, 1가 있습니다. 이를, 한 번씩 사용하...",2
4,0.26 x 0.8을 계산해 주세요.,0
5,"현수네 집에서 미용실, 병원, 백화점까지의 거리는 각각 4/5km, 1/3km, 0...",6
6,어떤 수를 3으로 나누어야 하는 것을 잘못하여 9로 나누었더니 몫이 9가 되었습니다...,5
7,나은이네 모둠과 도진이네 모둠이 전철를 나누어 탔습니다. 나은이네 모둠 8명은 15...,0
8,"수정이가 가지고 있는 색연필의 길이는 6.1센티미터 , 볼펜의 길이는 6.5센티미터...",6
9,두발자전거 12대의 바퀴는 모두 몇 개인지 찾아보시오.,0


In [7]:
test = pd.read_csv(data_dir + '/sep_test.csv')
test.head(10)

Unnamed: 0,problem,label
0,진호가 가진 줄자의 길이는 6m 15 cm 이고 두은이가 가진 줄자의 길이는 0.2...,0
1,농구공 18개를 보관함 3개에 똑같이 나누어 담으려고 합니다. 보관함 1개에 농구공...,0
2,승준이는 우유를 매일 0.7L씩 마십니다. 승준이가 5일 동안 마실 우유를 준비하려...,0
3,원 안에 마름모의 넓이가 90cm²일 때 원의 반지름은 몇 cm일까요?,7
4,"가로가 45센티미터, 세로가 50센티미터인 직사각형 모양의 종이를 크기가 같은 정사...",7
5,주전자의 물을 2리터 200밀리리터 마시고 봤더니 5리터 100밀리리터가 남았습니다...,0
6,"숫자 카드 2, 4, 6, 5, 7를 한 번 사용하여 다섯 자리 수를 만들려고 합니...",2
7,철민이는 과자를 8개 먹었고 정윤이는 철민이가 먹은 과자의 7배를 먹었습니다. 정윤...,0
8,10 분에 14 km 를 가는 A 자동차와 15 분에 15 km를 가는 B 자동차가...,6
9,전체 타수에 대한 안타 수의 비율을 타율이라고 합니다. 어느 야구 선수는 210타수...,0


In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7096 entries, 0 to 7095
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   problem  7096 non-null   object
 1   label    7096 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 111.0+ KB


In [9]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 282 entries, 0 to 281
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   problem  282 non-null    object
 1   label    282 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 4.5+ KB


# 2. 데이터셋 불러오는 클래스 작성

In [10]:
class mpDataset(Dataset):
  
  def __init__(self, dataset):
    # 일부 값중에 NaN이 있다면 drop
    self.dataset = dataset
    self.num_labels = 8
    self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, num_labels=self.num_labels)


    print(self.dataset.describe())
  
  def __len__(self):
    return len(self.dataset)
  
  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 0:2].values
    #print(row)
    text = row[0]
    #print(text)
    y = row[1]
    #print(y)

    inputs = self.tokenizer(
        text, 
        return_tensors='pt',
        truncation=True,
        max_length=256,
        pad_to_max_length=True,
        add_special_tokens=True
        )
    
    input_ids = inputs['input_ids'][0]
    #print(input_ids)
    attention_mask = inputs['attention_mask'][0]
    #print(attention_mask)
    #print(input_ids.size())
    #print(attention_mask.size())

    return input_ids, attention_mask, y

In [11]:
train_dataset = mpDataset(train)
test_dataset = mpDataset(test)

             label
count  7096.000000
mean      3.500000
std       2.291449
min       0.000000
25%       1.750000
50%       3.500000
75%       5.250000
max       7.000000
            label
count  282.000000
mean     3.177305
std      3.000080
min      0.000000
25%      0.000000
50%      3.000000
75%      7.000000
max      7.000000


In [12]:
# input_ids, attention_mask, y 잘 출력되는지 확인
train_dataset[0]



(tensor([    2,  3757, 10084,  7918,  4070,  6592,  4051,  4036,  4139, 26112,
         22720,  4047, 16740,  4070,  2024,  4112,  3286,  4264, 22720,  4007,
          3249,  4576,  6216,    18,  3240,  3286,  4264, 22720,  4234,  3757,
         10084,  7918,  4034,  2676,  9612, 13830,  8597,  8213,    18,     3,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

# 3. 모델 만들기

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME,num_labels=8).to(device)

# 한번 실행해보기
text, attention_mask, y = train_dataset[0]
model(text.unsqueeze(0).to(device), attention_mask=attention_mask.unsqueeze(0).to(device))

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: 

SequenceClassifierOutput([('logits',
                           tensor([[ 0.0100, -0.0005,  0.0030, -0.0117, -0.1123,  0.0760,  0.0241,  0.0550]],
                                  device='cuda:0', grad_fn=<AddmmBackward0>))])

In [14]:
torch.save(model.state_dict(),"model.pt")

In [15]:
model.load_state_dict(torch.load("model.pt"))

<All keys matched successfully>

In [16]:
# 모델 레이어 보기
model.eval()

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm

# 4. 학습하기

In [17]:
epochs = 5
batch_size = 16

In [18]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

In [19]:
train_loader

<torch.utils.data.dataloader.DataLoader at 0x7fc9f0269750>

In [20]:
losses = []
accuracies = []


# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-6)

for i in range(epochs):
  total_loss = 0.0
  correct = 0
  total = 0
  batches = 0


  model.to(device)
  model.train()

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader):
    optimizer.zero_grad()
    
    y_batch = y_batch.type(torch.LongTensor)
    y_batch = y_batch.to(device)
    #print(y_batch)
    #print("y_batch의 데이터 타입:")
    #print(y_batch[0].dtype,y_batch.shape)
    
    #print("------------")
    y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
    #print(y_pred)
    #print("y_predict의 데이터 타입:")
    #print(y_pred[0].dtype)
    #print("Input_ids_batch:", input_ids_batch)
    #print("Attention_masks_batch size:", len(attention_masks_batch))

    #loss = F.cross_entropy(y_pred, y_batch)
    criterion = nn.CrossEntropyLoss()
    loss = criterion(y_pred, y_batch)
    #print(loss)
    
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

    _, predicted = torch.max(y_pred, 1)
    correct += (predicted == y_batch).sum()
    total += len(y_batch)


    batches += 1
    if batches % 100 == 0:
      print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)
  
  losses.append(total_loss)
  accuracies.append(correct.float() / total)
  print("Train Loss:", total_loss, "Accuracy:", correct.float() / total)



  0%|          | 0/444 [00:00<?, ?it/s]



Batch Loss: 200.57118344306946 Accuracy: tensor(0.3300, device='cuda:0')
Batch Loss: 379.8831835985184 Accuracy: tensor(0.4319, device='cuda:0')
Batch Loss: 530.2019748687744 Accuracy: tensor(0.5346, device='cuda:0')
Batch Loss: 652.1252608299255 Accuracy: tensor(0.6061, device='cuda:0')
Train Loss: 696.5486731529236 Accuracy: tensor(0.6328, device='cuda:0')


  0%|          | 0/444 [00:00<?, ?it/s]

Batch Loss: 86.91671848297119 Accuracy: tensor(0.8900, device='cuda:0')
Batch Loss: 155.64399698376656 Accuracy: tensor(0.8966, device='cuda:0')
Batch Loss: 210.79958713054657 Accuracy: tensor(0.9027, device='cuda:0')
Batch Loss: 255.0053346902132 Accuracy: tensor(0.9091, device='cuda:0')
Train Loss: 272.70807003974915 Accuracy: tensor(0.9111, device='cuda:0')


  0%|          | 0/444 [00:00<?, ?it/s]

Batch Loss: 33.391193345189095 Accuracy: tensor(0.9531, device='cuda:0')
Batch Loss: 59.96920017898083 Accuracy: tensor(0.9566, device='cuda:0')
Batch Loss: 86.36202728003263 Accuracy: tensor(0.9546, device='cuda:0')
Batch Loss: 108.03558336943388 Accuracy: tensor(0.9548, device='cuda:0')
Train Loss: 116.28616075962782 Accuracy: tensor(0.9556, device='cuda:0')


  0%|          | 0/444 [00:00<?, ?it/s]

Batch Loss: 17.536885544657707 Accuracy: tensor(0.9688, device='cuda:0')
Batch Loss: 31.170550785958767 Accuracy: tensor(0.9741, device='cuda:0')
Batch Loss: 43.288915142416954 Accuracy: tensor(0.9760, device='cuda:0')
Batch Loss: 54.82663235813379 Accuracy: tensor(0.9766, device='cuda:0')
Train Loss: 60.41922960802913 Accuracy: tensor(0.9765, device='cuda:0')


  0%|          | 0/444 [00:00<?, ?it/s]

Batch Loss: 10.624962389469147 Accuracy: tensor(0.9794, device='cuda:0')
Batch Loss: 18.740868531167507 Accuracy: tensor(0.9834, device='cuda:0')
Batch Loss: 28.405055329203606 Accuracy: tensor(0.9819, device='cuda:0')
Batch Loss: 35.564127430319786 Accuracy: tensor(0.9831, device='cuda:0')
Train Loss: 38.60968289338052 Accuracy: tensor(0.9835, device='cuda:0')


In [21]:
losses, accuracies

([696.5486731529236,
  272.70807003974915,
  116.28616075962782,
  60.41922960802913,
  38.60968289338052],
 [tensor(0.6328, device='cuda:0'),
  tensor(0.9111, device='cuda:0'),
  tensor(0.9556, device='cuda:0'),
  tensor(0.9765, device='cuda:0'),
  tensor(0.9835, device='cuda:0')])

# 5. 모델 평가

In [22]:
model.eval()

test_correct = 0
test_total = 0

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
  y_batch = y_batch.to(device)
  y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
  _, predicted = torch.max(y_pred, 1)
  test_correct += (predicted == y_batch).sum()
  test_total += len(y_batch)

print("Accuracy:", test_correct.float() / test_total)

  0%|          | 0/18 [00:00<?, ?it/s]



Accuracy: tensor(0.9326, device='cuda:0')


In [23]:
# 모델 저장하기
save_model_path = data_dir + '/_weights'
if os.path.exists(save_model_path) == False:
    os.mkdir(save_model_path)  	
torch.save(model.state_dict(), os.path.join(save_model_path, "koelectra-base-finetuned-mathProblem(oversampling)_Final.bin"))