
KoBERT와 해당 데이터셋을 활용하여 긍정적/부정적 데이터를 분류하는 모델을 fine-tuning하는 과정은 대략적으로 다음과 같은 단계를 포함합니다:

- 1)데이터셋 준비: 업로드한 엑셀 파일을 읽어서 pandas DataFrame으로 변환합니다.
- 2)데이터 전처리: 모델이 처리할 수 있도록 데이터를 정리합니다. 예를 들어, 필요없는 칼럼 제거, 텍스트를 토큰화, 레이블 인코딩 등을 수행합니다.
- 3)KoBERT 모델 로딩: SKTBrain의 KoBERT를 로딩합니다. Hugging Face의 Transformers 라이브러리를 사용하면 쉽게 로딩할 수 있습니다.
- 4)Fine-tuning 준비: 데이터셋을 훈련용과 검증용으로 나누고, 모델, 옵티마이저, 손실 함수를 설정합니다.
- 5)모델 훈련: 모델을 데이터셋에 fine-tuning합니다.
- 6)성능 평가: 검증 데이터셋을 사용하여 모델의 성능을 평가합니다.

# **데이터 다운로드**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

file_path = '/content/drive/MyDrive/02)2024-1학기/2_자연어처리/00_개인 프로젝트 논문 (웹툰굿즈)/데이터 및 코드/★라벨링_통합본.xlsx'
nya_df = pd.read_excel(file_path)

# 데이터 프레임 확인
nya_df.loc[302:310,:]

Unnamed: 0,brand,product_name,review,rating
302,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,무케 진짜 세상 귀엽고 내가 왜진작 이걸 안지르고 있었나 그꽛 만오처넌 있어도 없고...,1
303,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,무케 너무 귀여워요..,1
304,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,한번 떨어져서 다시 붙였는데 그 뒤로 안딸어져여 꽤 튼튼한듯 무케 귀요미 최고,1
305,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,완전 귀여워요!! 만족스,1
306,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,그냥 호랑이 형님 팬이라면 무조건 가야하는거 아닙니까!!!\n와이프에게 선물 했더니...,1
307,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,넘귀여워요 다른 무케 제품도 나왔으면 좋겠어요,1
308,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,정말 귀여워요 튼튼해요 커서 좋네요,1
309,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,한달지나도 안질려요. 너무 예뻐요 ㅠㅠ 대만족 👍,1
310,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,제꺼 사고 너무 귀여워서 선물용으로 한개 더구매요,1


In [3]:
# 추후 구분자 문제로 발생할 tap문제를 해결하기 위해 \기호 삭제
nya_df['review'] = nya_df['review'].apply(lambda x: x.replace('\n', ' '))

# 잘 제거가 되었는지 확인해보기
nya_df.loc[302:310,:]

Unnamed: 0,brand,product_name,review,rating
302,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,무케 진짜 세상 귀엽고 내가 왜진작 이걸 안지르고 있었나 그꽛 만오처넌 있어도 없고...,1
303,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,무케 너무 귀여워요..,1
304,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,한번 떨어져서 다시 붙였는데 그 뒤로 안딸어져여 꽤 튼튼한듯 무케 귀요미 최고,1
305,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,완전 귀여워요!! 만족스,1
306,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,그냥 호랑이 형님 팬이라면 무조건 가야하는거 아닙니까!!! 와이프에게 선물 했더니 ...,1
307,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,넘귀여워요 다른 무케 제품도 나왔으면 좋겠어요,1
308,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,정말 귀여워요 튼튼해요 커서 좋네요,1
309,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,한달지나도 안질려요. 너무 예뻐요 ㅠㅠ 대만족 👍,1
310,호랑이형님,[GripTok] 호랑이형님(특수형) 3종,제꺼 사고 너무 귀여워서 선물용으로 한개 더구매요,1


# **train/test 데이터셋 나누기**

In [4]:
from sklearn.model_selection import train_test_split

train_nya, test_nya = train_test_split(nya_df, test_size=0.2, random_state=42)
print("Train Reviews : ", len(train_nya))
print("Test_Reviews : ", len(test_nya))

Train Reviews :  14827
Test_Reviews :  3707


# **SktBrain/koBert 모델 받아오기**


*   Kobert 참고 코드
*   https://complexoftaste.tistory.com/2



In [5]:
# Colab 환경 설정
!pip install gluonnlp pandas tqdm
!pip install mxnet
!pip install sentencepiece
!pip install transformers
!pip install torch
!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf'
# https://github.com/SKTBrain/KoBERT 의 파일들을 Colab으로 다운로드

Collecting gluonnlp
  Downloading gluonnlp-0.10.0.tar.gz (344 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m344.5/344.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gluonnlp
  Building wheel for gluonnlp (setup.py) ... [?25l[?25hdone
  Created wheel for gluonnlp: filename=gluonnlp-0.10.0-cp310-cp310-linux_x86_64.whl size=661751 sha256=dbe162d568283e708a68241f349f103c80d2eb8ba75ddaf8852101f2c061e4cb
  Stored in directory: /root/.cache/pip/wheels/1a/1e/0d/99f55911d90f2b95b9f7c176d5813ef3622894a4b30fde6bd3
Successfully built gluonnlp
Installing collected packages: gluonnlp
Successfully installed gluonnlp-0.10.0
Collecting mxnet
  Downloading mxnet-1.9.1-py3-none-manylinux2014_x86_64.whl (49.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Dow

In [6]:
import numpy as np
np.bool = np.bool_
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook

In [7]:
# ★ Hugging Face를 통한 모델 및 토크나이저 Import
from kobert_tokenizer import KoBERTTokenizer
from transformers import BertModel

from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup
from sklearn.metrics import classification_report

In [8]:
# GPU 사용 시
device = torch.device("cuda:0")

In [9]:
# ★KoBERT 토크나이저와 모델 불러오기
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
bertmodel = BertModel.from_pretrained('skt/kobert-base-v1', return_dict=False)
vocab = nlp.vocab.BERTVocab.from_sentencepiece(tokenizer.vocab_file, padding_token='[PAD]')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/371k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/244 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'KoBERTTokenizer'.


config.json:   0%|          | 0.00/535 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/369M [00:00<?, ?B/s]

In [10]:
# train 데이터를 .tsv 파일로 저장후 로드형식으로 변환
train_nya = train_nya.iloc[:, 2:]
train_nya.to_csv("/content/drive/MyDrive/03)DSL/4_모델링_프로젝트/지원/train_review.tsv", sep='\t', index=False)
dataset_train = nlp.data.TSVDataset("/content/drive/MyDrive/03)DSL/4_모델링_프로젝트/지원/train_review.tsv",  num_discard_samples=1)

In [11]:
# test 데이터를 .tsv 파일로 저장후 로드형식으로 변환
test_nya = test_nya.iloc[:, 2:]
test_nya.to_csv("/content/drive/MyDrive/03)DSL/4_모델링_프로젝트/지원/test_review.tsv", sep='\t', index=False)
dataset_test = nlp.data.TSVDataset("/content/drive/MyDrive/03)DSL/4_모델링_프로젝트/지원/test_review.tsv",  num_discard_samples=1)

In [12]:
train_nya.loc[302:,'review':'rating']

Unnamed: 0,review,rating
302,무케 진짜 세상 귀엽고 내가 왜진작 이걸 안지르고 있었나 그꽛 만오처넌 있어도 없고...,1
17660,예뻐요.. 기대 이상입니다,1
14178,예쁘고 귀엽습니다ㅋㅋ,1
13568,완전 귀여워요 ㅠ 재구매의사 넘 많구여 모바일링 진짜 귀엽고 제 폰이랑 넘 잘어울려요^^,1
4432,사과향을 워낙에 좋아하는데 제가 좋아하는 웹툰에서 사과향 비슷하게 나는 배쓰밤을 만...,1
...,...,...
11284,마루 귀여워요 인간마루 최고,1
11964,제발 더 팔아주세요.... 너무긔야워요,1
5390,달력이 조금 작은거 빼고는 아크릴이 생각보다 크고 시소도 움직일 수 있고 만족 합니다!‘,0
860,맘에 들어요 다음에 또 주문하겠습니다,1


In [13]:
# ★
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, vocab, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, vocab=vocab, pad=pad, pair=pair)

        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        # self.labels = [print(i[label_idx]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

In [14]:
# Setting parameters
max_len = 64
batch_size = 64
warmup_ratio = 0.1
num_epochs = 10
max_grad_norm = 1
log_interval = 200
learning_rate =  5e-5

In [15]:
# ★
tok = tokenizer.tokenize
data_train = BERTDataset(dataset_train, 0, 1, tok, vocab, max_len, True, False)
data_test = BERTDataset(dataset_test, 0, 1, tok, vocab, max_len, True, False)

In [16]:
train_dataloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, num_workers=5)
test_dataloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size, num_workers=5)



In [17]:
class BERTClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes=2,
                 dr_rate=None,
                 params=None):
        super(BERTClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate

        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)

    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)

        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device))
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)

In [18]:
#모델
model = BERTClassifier(bertmodel,  dr_rate=0.5).to(device)

In [19]:
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

In [20]:
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()

t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total)



In [21]:
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

In [22]:
for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))

    #모델을 평가모드로 전환
    model.eval()

    # 예측 확률을 저장할 리스트 초기화
    all_probabilities = []

    # DataLoader를 통해 배치 단위로 데이터 처리
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)

        with torch.no_grad():
            # **모델을 통해 로짓을 계산
            logits = model(token_ids, valid_length, segment_ids)

            # **softmax를 적용하여 로짓을 확률로 변환
            probabilities = F.softmax(logits, dim=-1)
            all_probabilities.extend(probabilities.cpu().numpy())

        # 확률
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        test_acc += calc_accuracy(out, label)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 1 batch id 1 loss 0.6016029119491577 train acc 0.71875
epoch 1 batch id 201 loss 0.383847713470459 train acc 0.8812966417910447
epoch 1 train acc 0.8828829816559743


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 1 test acc 0.918843147282291


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 2 batch id 1 loss 0.22012002766132355 train acc 0.921875
epoch 2 batch id 201 loss 0.12058918923139572 train acc 0.9243625621890548
epoch 2 train acc 0.9261555859061749


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 2 test acc 0.9215599430157803


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 3 batch id 1 loss 0.19621963798999786 train acc 0.921875
epoch 3 batch id 201 loss 0.17623691260814667 train acc 0.9451181592039801
epoch 3 train acc 0.9466610490176424


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 3 test acc 0.9291030464640561


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 4 batch id 1 loss 0.1018403172492981 train acc 0.953125
epoch 4 batch id 201 loss 0.11180564016103745 train acc 0.9641635572139303
epoch 4 train acc 0.9656190482157178


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 4 test acc 0.9199207334891876


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 5 batch id 1 loss 0.19088003039360046 train acc 0.96875
epoch 5 batch id 201 loss 0.10673915594816208 train acc 0.9770677860696517
epoch 5 train acc 0.9779094827586207


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 5 test acc 0.901332371420222


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 6 batch id 1 loss 0.1488846391439438 train acc 0.953125
epoch 6 batch id 201 loss 0.26481541991233826 train acc 0.9860074626865671
epoch 6 train acc 0.9852849964915799


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 6 test acc 0.9320664085330216


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 7 batch id 1 loss 0.12843286991119385 train acc 0.953125
epoch 7 batch id 201 loss 0.1203262135386467 train acc 0.9923818407960199
epoch 7 train acc 0.9924568965517241


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 7 test acc 0.9307194257744009


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 8 batch id 1 loss 0.19039934873580933 train acc 0.96875
epoch 8 batch id 201 loss 0.13434205949306488 train acc 0.9951803482587065
epoch 8 train acc 0.9953529094827587


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 8 test acc 0.9323358050847458


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 9 batch id 1 loss 0.14245665073394775 train acc 0.96875
epoch 9 batch id 201 loss 0.09536200761795044 train acc 0.9975901741293532
epoch 9 train acc 0.9975080818965517


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 9 test acc 0.932312974868498


  0%|          | 0/232 [00:00<?, ?it/s]

epoch 10 batch id 1 loss 0.12376844882965088 train acc 0.96875
epoch 10 batch id 201 loss 0.08409203588962555 train acc 0.9974347014925373
epoch 10 train acc 0.9973733836206896


  0%|          | 0/58 [00:00<?, ?it/s]

epoch 10 test acc 0.9320435783167738


In [23]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluate(model, dataloader, device):
    model.eval()  # 모델을 평가 모드로 설정
    predictions, true_labels = [], []

    with torch.no_grad():  # 평가 시에는 기울기 계산을 하지 않음
        for batch in dataloader:
            # 배치를 GPU로 이동
            token_ids, valid_length, segment_ids, label = batch
            token_ids = token_ids.long().to(device)
            segment_ids = segment_ids.long().to(device)
            label = label.long().to(device)

            # 평가를 위한 데이터 준비
            outputs = model(token_ids, valid_length, segment_ids)

            # 로그트 출력에서 가장 높은 확률을 가진 인덱스를 예측값으로 사용
            logits = outputs.detach().cpu().numpy()
            label_ids = label.cpu().numpy()

            # 예측값과 실제 라벨값 저장
            predictions.extend(logits.argmax(axis=-1))
            true_labels.extend(label_ids)

    # 성능 지표 계산
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='binary')
    accuracy = accuracy_score(true_labels, predictions)
    cls_report = classification_report(true_labels, predictions)

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "classification_report": cls_report
    }

# 모델 평가
evaluation_metrics = evaluate(model, test_dataloader, device)
print(f"Accuracy: {evaluation_metrics['accuracy']:.4f}")
print(f"Precision: {evaluation_metrics['precision']:.4f}")
print(f"Recall: {evaluation_metrics['recall']:.4f}")
print(f"F1 Score: {evaluation_metrics['f1']:.4f}")
print("\nClassification Report:\n", evaluation_metrics['classification_report'])


Accuracy: 0.9320
Precision: 0.9627
Recall: 0.9603
F1 Score: 0.9615

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.72      0.71       432
           1       0.96      0.96      0.96      3275

    accuracy                           0.93      3707
   macro avg       0.83      0.84      0.84      3707
weighted avg       0.93      0.93      0.93      3707

