## 프로젝트 목표
---

1. 모델과 데이터를 정상적으로 불러오고, 작동하는 것을 확인하였다.
 * klue/bert-base를 NSMC 데이터셋으로 fine-tuning 하여, 모델이 정상적으로 작동하는 것을 확인하였다.
2. Preprocessing을 개선하고, fine-tuning을 통해 모델의 성능을 개선시켰다.
 * Validation accuracy를 90% 이상으로 개선하였다.
3. 모델 학습에 Bucketing을 성공적으로 적용하고, 그 결과를 비교분석하였다.
 * Bucketing task을 수행하여 fine-tuning 시 연산 속도와 모델 성능 간의 trade-off 관계가 발생하는지 여부를 확인하고, 분석한 결과를 제시하였다.

### 라이브러리 import

In [19]:
import os
import tensorflow as tf
import numpy as np
import transformers
import datasets
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
print(tf.config.list_physical_devices('GPU'))
print(tf.test.gpu_device_name())
os.environ["TOKENIZERS_PARALLELISM"] = "true"

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
/device:GPU:0


2025-03-19 04:41:49.647582: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-03-19 04:41:49.647644: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


### STEP 1. NSMC 데이터 분석 및 Huggingface dataset 구성
---
huggingface nsmc dataset을 확인해보면 위와 같이 구성되어 있습니다.

Dataset dictionary안에 train dataset, test dataset으로 구성되어 있고 각 Dataset은 ‘id’, ‘document’, ‘label’로 구성되어 있습니다.

In [20]:
huggingface_nsmc_dataset = load_dataset('nsmc')
print(huggingface_nsmc_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})


In [21]:
train = huggingface_nsmc_dataset['train']
cols = train.column_names
cols

['id', 'document', 'label']

In [22]:
for i in range(5):
    for col in cols:
        print(col, ":", train[col][i])
    print('\n')

id : 9976970
document : 아 더빙.. 진짜 짜증나네요 목소리
label : 0


id : 3819312
document : 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
label : 1


id : 10265843
document : 너무재밓었다그래서보는것을추천한다
 : 0l


id : 9045019
document : 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
label : 0


id : 6483659
document : 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
label : 1




In [23]:
huggingface_nsmc_dataset = huggingface_nsmc_dataset.remove_columns(["id"])
huggingface_nsmc_dataset = huggingface_nsmc_dataset.rename_column("label", "labels")

In [24]:
train = huggingface_nsmc_dataset['train']
cols = train.column_names
cols

['document', 'labels']

### STEP 2. klue/bert-base model 및 tokenizer 불러오기
---
* output_loading_info=True: To show detailed information about loaded and skipped weights.

In [25]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model, loading_info = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 2 , output_loading_info=True)
# tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
# # Load the model with loading info
# model, loading_info = AutoModelForMaskedLM.from_pretrained("klue/bert-base", output_loading_info=True)

# Inspect loaded weights
print("Loaded weights:", loading_info["missing_keys"])

# Inspect unused weights (weights in the checkpoint not used by the model)
print("Unused weights:", loading_info["unexpected_keys"])

# Inspect any mismatched weights
print("Mismatched weights:", loading_info["mismatched_keys"])

# Inspect any error messages during loading
print("Error messages:", loading_info["error_msgs"])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded weights: ['classifier.bias', 'classifier.weight']
Unused weights: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
Mismatched weights: []
Error messages: []


### STEP 3. 위에서 불러온 tokenizer으로 데이터셋을 전처리하고, model 학습 진행해 보기
---
* MAX_LENGTH: 40
* return_token_type_ids는 문장이 한개이상일 때 나뉘는걸 보여줍니다. (해당 내용은 task에 필요없으므로 제거합니다)

In [26]:
MAX_LENGTH = 40
def tokenize_function(data):
    return tokenizer(data["document"], truncation=True, padding=True, max_length=MAX_LENGTH, 
        return_token_type_ids = False)

In [27]:
tokenized_datasets = huggingface_nsmc_dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_datasets['train']
test_dataset = tokenized_datasets['test']

In [28]:
print(train_dataset.shape)

(150000, 4)


In [29]:
print(train_dataset[:3])

{'document': ['아 더빙.. 진짜 짜증나네요 목소리', '흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나', '너무재밓었다그래서보는것을추천한다'], 'labels': [0, 1, 0], 'input_ids': [[2, 1376, 831, 2604, 18, 18, 4229, 9801, 2075, 2203, 2182, 4243, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1963, 18, 18, 18, 11811, 2178, 2088, 28883, 16516, 2776, 18, 18, 18, 18, 10737, 2156, 2015, 2446, 2232, 6758, 2118, 1380, 6074, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}


**Trainer를 활용한 학습**

---
Trainer를 사용하기 위해서는 TrainingArguments를 통해 학습 관련 설정을 미리 지정해야 합니다.


In [30]:
# Split into Training and Validation
split_datasets = train_dataset.train_test_split(test_size=0.2, seed=42)

# Access the split datasets
train_split = split_datasets["train"]
eval_split = split_datasets["test"]

In [66]:
print(train_split.shape)
print(eval_split.shape)

(120000, 4)
(30000, 4)


In [31]:
output_dir = os.getenv('HOME')+'/aiffel/transformers'

# training_args = TrainingArguments(
#     output_dir,                                         # output이 저장될 경로
#     evaluation_strategy="epoch",           #evaluation하는 빈도
#     learning_rate = 2e-5,                         #learning_rate
#     per_device_train_batch_size = 8,   # 각 device 당 batch size
#     per_device_eval_batch_size = 8,    # evaluation 시에 batch size
#     num_train_epochs = 3,                     # train 시킬 총 epochs
#     weight_decay = 0.01,                        # weight decay
# )

training_args = TrainingArguments(
    output_dir,                     # output이 저장될 경로
    evaluation_strategy="epoch",    #evaluation하는 빈도
    learning_rate=2e-5,             #learning_rate
    per_device_train_batch_size=16, # 각 device 당 batch size 
    per_device_eval_batch_size=64,  # evaluation 시에 batch size
    num_train_epochs=3,             # train 시킬 총 epochs
    warmup_steps=500,               # learning rate scheduler에 따른 warmup_step 설정  
    do_train=True,                  # train 수행여부
    do_eval=True,                   # eval 수행여부
    eval_steps=1000,
    group_by_length=False,    
    weight_decay=0.01,              # weight decay
)

In [32]:
from evaluate import load

metric = load('glue', 'mrpc')

def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    return metric.compute(predictions=predictions, references = labels)

In [33]:
trainer = Trainer(
    model=model,           # 학습시킬 model
    args=training_args,           # TrainingArguments을 통해 설정한 arguments
    train_dataset=train_split,    # training dataset
    eval_dataset=eval_split,       # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2627,0.257416,0.896133,0.896244
2,0.1946,0.301868,0.899133,0.899388
3,0.1264,0.397808,0.900933,0.901144


TrainOutput(global_step=22500, training_loss=0.21029578145345051, metrics={'train_runtime': 2085.4675, 'train_samples_per_second': 172.623, 'train_steps_per_second': 10.789, 'total_flos': 7399998432000000.0, 'train_loss': 0.21029578145345051, 'epoch': 3.0})

In [34]:
trainer.save_model(output_dir + "/manual_saved_model")

In [35]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.39560624957084656,
 'eval_accuracy': 0.90034,
 'eval_f1': 0.9019306843006436,
 'eval_runtime': 30.6661,
 'eval_samples_per_second': 1630.465,
 'eval_steps_per_second': 25.5,
 'epoch': 3.0}

In [59]:
del model

### STEP 4. Fine-tuning을 통하여 모델 성능(accuarcy) 향상시키기
---

### STEP 5. Bucketing을 적용하여 학습시키고, STEP 4의 결과와의 비교
---

In [60]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model, loading_info = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 2 , output_loading_info=True)
# tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
# # Load the model with loading info
# model, loading_info = AutoModelForMaskedLM.from_pretrained("klue/bert-base", output_loading_info=True)

# Inspect loaded weights
print("Loaded weights:", loading_info["missing_keys"])

# Inspect unused weights (weights in the checkpoint not used by the model)
print("Unused weights:", loading_info["unexpected_keys"])

# Inspect any mismatched weights
print("Mismatched weights:", loading_info["mismatched_keys"])

# Inspect any error messages during loading
print("Error messages:", loading_info["error_msgs"])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded weights: ['classifier.bias', 'classifier.weight']
Unused weights: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
Mismatched weights: []
Error messages: []


In [61]:
MAX_LENGTH = 40
def tokenize_bucket(data):
    return tokenizer(data["document"], truncation=True, padding=False, max_length=MAX_LENGTH, 
        return_token_type_ids = False)
    
tokenized_bucket_datasets = huggingface_nsmc_dataset.map(tokenize_bucket, batched=True)
train_bucket_dataset = tokenized_bucket_datasets['train']
test_bucket_dataset = tokenized_bucket_datasets['test']

split_bucket_datasets = train_bucket_dataset.train_test_split(test_size=0.2, seed=42)

# Access the split datasets
train_bucket_split = split_bucket_datasets["train"]
eval_bucket_plit = split_bucket_datasets["test"]

In [62]:
training_bucket_args = TrainingArguments(
    output_dir,                     # output이 저장될 경로
    evaluation_strategy="epoch",    #evaluation하는 빈도
    learning_rate=2e-5,             #learning_rate
    per_device_train_batch_size=16, # 각 device 당 batch size 
    per_device_eval_batch_size=64,  # evaluation 시에 batch size
    num_train_epochs=3,             # train 시킬 총 epochs
    group_by_length=True,    
    weight_decay=0.01,              # weight decay
)

In [63]:
from transformers import DataCollatorWithPadding

# Initialize data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Pass the collator to the Trainer
bucket_trainer = Trainer(
    model=model,
    args=training_bucket_args,
    train_dataset=train_bucket_split,
    eval_dataset=eval_bucket_plit,
    data_collator=data_collator,  # Dynamic padding applied here
    compute_metrics=compute_metrics,  
    tokenizer=tokenizer
)

bucket_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2584,0.263224,0.8955,0.896237
2,0.2015,0.30587,0.898767,0.900109
3,0.1214,0.400213,0.8995,0.899904


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=22500, training_loss=0.20572020467122396, metrics={'train_runtime': 2169.6068, 'train_samples_per_second': 165.929, 'train_steps_per_second': 10.371, 'total_flos': 3740584096289280.0, 'train_loss': 0.20572020467122396, 'epoch': 3.0})

In [64]:
bucket_trainer.save_model(output_dir + "/bucket_saved_model")

In [42]:
bucket_trainer.evaluate(test_dataset)

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


{'eval_loss': 0.3911927044391632,
 'eval_accuracy': 0.89982,
 'eval_f1': 0.9012518482010843,
 'eval_runtime': 29.9782,
 'eval_samples_per_second': 1667.878,
 'eval_steps_per_second': 26.086,
 'epoch': 3.0}

## 테스트 결과

---
|지표|NB|B|sst2|
|:---|:---:|---:|---:|
|train dataset size|120,000|120,000|120,000|
|eval dataset size|30,000|30,000|30,000|
|test dataset size|50,000|50,000|50,000|
|global step|22,500|22,500|22,500|
|훈련 시간|34:45|36:29|34:21|
|훈련 손실|0.210|0.210|0.082|
|훈련 정확도|0.901|0.900|0.897|
|훈련 F1 점수|0.901|0.900|X|
|훈련 실행 시간(초)|2,085|21,90|2,058|
|초당 훈련 샘플 수|172.6|164.3|174.8|
|초당 훈련 단계 수|10.78|10.27|10.93|
|테스트 시간|00:30|00:29|00:29|
|테스트 손실|0.395|0.391|0.648|
|테스트 정확도|0.900|0.900|0.896|
|테스트 F1 점수|0.902|0.901|X|
|패딩 방식|40 고정|동적 길이|40 고정|
1. ***최초 실행시 lecture의 Arguments 상향 하여 90% 이상 Accuracy, F1 달성***
   * Arguments 변경: per_device_train_batch_size = 8 --> 16, per_device_eval_batch_size = 8 --> 64
   * tokenizer's MAX_LENGTH = 40 ==> 이전 프로젝트에서 사용된 네이버 영화 감성 분석 데이터의 분포를 참조.
  ![mrpc_result](./mrpc_result.png)

2. ***Bucketing 결과***
    * 소요 시간 단축을 예상 했으나 조금 더 걸림(36:29, 34:45(not bucketing))
    * Accuracy, F1 score도 낮음
![mrpc_bucketing_result](./mrpc_bucketing_result.png)

3. ***동일한 조건에서 메트릭 변경 (mprc-->sst2)***
    * 정확도가 조금 떨어짐.
![sst2_first_result](./sst2_first_result.png)
![sst2_test_first_result](./sst2_test_first_result.png)

4. ***실행 후 tokenizer config 파일을 통해 확인 한 정보***
    * vocab size:32000, tensor type:pt


## 회고
 * 평가지표는 감정분석(sst2)보다 두 문장의 유사도(mprc)의 정확도가 높은 것은
   생성한 클래스가 AutoModelForSequenceClassification이고 입력 파라미터 중 이진 분류 파라미터를 설정했기 때문으로 유추 하고 있으나 더 깊은 학습 필요 함.
 * 모델을 삭제 하지 않고 다른 파라미터를 조정하여 진행하니 train loss는 계속 감소함. 정확한 진행을 위해 지표 변경 후 실행 시 모델 삭제 후 재 생성 하는 것이 좋음.
 * Bucketing 결과를 보면 무언가 적용이 잘 안 된 것 같아 효과가 나타나지 않은 것 같음. 관련 예제나 API 분석이 필요 함.