<a href="https://colab.research.google.com/github/hanghae-plus-AI/AI-1-jhyeon-kim/blob/main/Chapter3_%EC%8B%AC%ED%99%94%EA%B3%BC%EC%A0%9C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 심화 과제 문제
이번 과제는 자연어 task 중 하나인 MNLI를 해결하는 모델을 HuggingFace로 학습하는 것입니다. MNLI를 요약하면 다음과 같습니다.

- **입력**: premise에 해당하는 문장과 hypothesis에 해당하는 문장 두 개가 입력으로 들어옵니다.
- **출력:** 분류 문제로, 두 문장이 들어왔을 때 다음 세 가지를 예측하시면 됩니다.
    - **Entailment:** 두 문장에 논리적 모순이 없습니다.
    - **Neutral:** 두 문장은 논리적으로 관련이 없습니다.
    - **Contradiction:** 두 문장 사이에 논리적 모순이 존재합니다.

이 때, 다음 요구사항이 담긴 colab notebook을 만들어내시면 됩니다:

- [x]  `load_dataset("nyu-mll/glue", "mnli")` 로 dataset을 불러옵니다.
    - 학습 때는 `train` split만 활용하셔야 합니다. 나머지 split은 사용불가입니다.
    - Validation data가 필요한 경우, `train` split에서 가져오셔야 합니다.
- [x]  `trainer.train()`를 통해 학습된 log가 남아있어야 합니다.
- [x]  Dataset의 `validation_matched`에 대한 성능을 출력하고, 50%를 넘기셔야 합니다.

이전 과제와 똑같이 validation data 유무, 모델 architecture, hyper-parameter 등은 위의 조건만 만족한다는 가정 하에서 마음대로 수정하셔도 됩니다.

# 데이터셋 준비

In [1]:
!pip install transformers datasets evaluate accelerate scikit-learn

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━

In [2]:
import random
import evaluate
import numpy as np

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [3]:
from datasets import load_dataset

# MNLI dataset 로드
mnli_ds = load_dataset("nyu-mll/glue", "mnli")
mnli_ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

(…)alidation_matched-00000-of-00001.parquet:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

(…)dation_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

test_matched-00000-of-00001.parquet:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

test_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

In [4]:
# 학습에 사용할 train split 로드
mnli_train = mnli_ds["train"]

# train split에서 validation split 분리
mnli_split = mnli_train.train_test_split(test_size=0.1)
mnli_train, mnli_val = mnli_split['train'], mnli_split['test']

# validation_matched 데이터 준비 (학습 데이터와 분포가 유사한 검증용 데이터)
mnli_validation_matched = mnli_ds["validation_matched"]

print(len(mnli_train), len(mnli_val), len(mnli_validation_matched))

353431 39271 9815


In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def preprocess_function(data):
    return tokenizer(data["premise"], data["hypothesis"], truncation=True)

mnli_train_tokenized = mnli_train.map(preprocess_function, batched=True)
mnli_val_tokenized = mnli_val.map(preprocess_function, batched=True)
mnli_validation_matched_tokenized = mnli_validation_matched.map(preprocess_function, batched=True)


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



Map:   0%|          | 0/353431 [00:00<?, ? examples/s]

Map:   0%|          | 0/39271 [00:00<?, ? examples/s]

Map:   0%|          | 0/9815 [00:00<?, ? examples/s]

In [6]:
mnli_train_tokenized[0].keys()

dict_keys(['premise', 'hypothesis', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])

# 시도 1. from_pretrained 사용 (고군분투)

In [7]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

training_args = TrainingArguments(
    output_dir='mnli_transformer',  # 모델과 로그를 저장할 폴더
    num_train_epochs=10,  # epoch 수
    per_device_train_batch_size=32,  # 배치 크기
    per_device_eval_batch_size=32,  # 검증 데이터 배치 크기
    logging_strategy="epoch",
    evaluation_strategy="epoch",  # 매 epoch 후 평가
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
)


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [9]:
for param in model.bert.parameters():
    param.requires_grad = False

In [10]:
import evaluate

accuracy = evaluate.load("accuracy")

def compute_metrics(pred):
    logits, labels = pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [11]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0878,1.076591,0.395534
2,1.0813,1.072777,0.403122
3,1.0771,1.069807,0.409233


KeyboardInterrupt: 

epoch 당 accuracy 증가폭이 너무 작아보임!
이런 증가폭으로는 10회 내에 accuracy 0.5 이상 안 될 듯 싶어서 lr 증가시켜서 학습 이어가보기!

In [13]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',  # 모델과 로그 저장 경로
    num_train_epochs=10,  # 전체 epoch 수
    per_device_train_batch_size=32,  # 배치 크기
    per_device_eval_batch_size=32,  # 평가용 배치 크기
    logging_strategy="epoch",  # 로그 전략
    evaluation_strategy="epoch",  # 평가 전략
    save_strategy="epoch",  # 저장 전략
    learning_rate=5e-5,  # 새로운 learning rate (더 크게 설정)
    load_best_model_at_end=True,  # 가장 좋은 모델을 저장
    resume_from_checkpoint=True  # 체크포인트에서 이어서 학습
)



In [14]:
trainer = Trainer(
    model=model,  # 학습된 모델을 로드
    args=training_args,  # 수정된 학습 인자
    train_dataset=mnli_train_tokenized,  # 학습 데이터셋
    eval_dataset=mnli_val_tokenized,  # 검증 데이터셋
    compute_metrics=compute_metrics,  # 성능 측정 함수
    tokenizer=tokenizer
)

In [15]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
4,1.0748,1.069618,0.409997


KeyboardInterrupt: 

- 진행 속도(시간)가 너무 느려서 batch 사이즈 늘려서 이어 학습!
- 학습속도(accuracy 증가폭)도 아직 너무 조심스럽게 느껴져서 lr 도 더 크게 해보기

In [16]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',  # 모델과 로그 저장 경로
    num_train_epochs=10,  # 전체 epoch 수
    per_device_train_batch_size=128,  # 배치 크기
    per_device_eval_batch_size=128,  # 평가용 배치 크기
    logging_strategy="epoch",  # 로그 전략
    evaluation_strategy="epoch",  # 평가 전략
    save_strategy="epoch",  # 저장 전략
    learning_rate=1e-3,  # 새로운 learning rate (더 크게 설정)
    load_best_model_at_end=True,  # 가장 좋은 모델을 저장
    resume_from_checkpoint=True  # 체크포인트에서 이어서 학습
)



In [17]:
trainer = Trainer(
    model=model,  # 학습된 모델을 로드
    args=training_args,  # 수정된 학습 인자
    train_dataset=mnli_train_tokenized,  # 학습 데이터셋
    eval_dataset=mnli_val_tokenized,  # 검증 데이터셋
    compute_metrics=compute_metrics,  # 성능 측정 함수
    tokenizer=tokenizer
)

In [18]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
	per_device_train_batch_size: 128 (from args) != 32 (from trainer_state.json)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
5,1.0738,1.067191,0.416109


KeyboardInterrupt: 

batch 크기 더 늘려서 시간 좀 더 빨리 학습하게..!

In [19]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',  # 모델과 로그 저장 경로
    num_train_epochs=10,  # 전체 epoch 수
    per_device_train_batch_size=256,  # 배치 크기
    per_device_eval_batch_size=256,  # 평가용 배치 크기
    logging_strategy="epoch",  # 로그 전략
    evaluation_strategy="epoch",  # 평가 전략
    save_strategy="epoch",  # 저장 전략
    learning_rate=1e-3,  # lr 은 이정도로 유지
    load_best_model_at_end=True,  # 가장 좋은 모델을 저장
    resume_from_checkpoint=True  # 체크포인트에서 이어서 학습
)



In [20]:
trainer = Trainer(
    model=model,  # 학습된 모델을 로드
    args=training_args,  # 수정된 학습 인자
    train_dataset=mnli_train_tokenized,  # 학습 데이터셋
    eval_dataset=mnli_val_tokenized,  # 검증 데이터셋
    compute_metrics=compute_metrics,  # 성능 측정 함수
    tokenizer=tokenizer
)

In [21]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
	per_device_train_batch_size: 256 (from args) != 32 (from trainer_state.json)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
6,1.0725,1.067981,0.411067


KeyboardInterrupt: 

오.. validation loss 는 오히려 좀 오르고 accuracy 도 낮아져버렸다..

- lr 다시 줄여보기
- 그리고 시간은 여전히 오래 걸려서 돌아보니 model.to('cuda') 를 안함..! gpu 에 올려서 다시 실행

In [22]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=10,
    per_device_train_batch_size=256,
    per_device_eval_batch_size=256,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    load_best_model_at_end=True,
    resume_from_checkpoint=True  # 체크포인트에서 이어서 학습
)



In [23]:
model = model.to('cuda')

In [24]:
trainer = Trainer(
    model=model,  # GPU로 이동한 모델
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [25]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
7,1.0723,1.067642,0.410506


KeyboardInterrupt: 

- 시간이 여전히 오래 걸려서 더 빨리 해보려는 시도!
  - dataloader_num_workers 추가
- epochs 더 필요할 듯 싶어서 늘리기

In [27]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=256,
    per_device_eval_batch_size=256,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=4,  # 병렬 데이터 로딩 활성화
)



In [28]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [29]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
8,1.0713,1.067854,0.413562
9,1.071,1.065827,0.416898
10,1.0703,1.065005,0.417331


KeyboardInterrupt: 

- 어느 세월에 0.5를 가려나.. 해서 lr 다시 높여보기
- 더 빠르게 해보려고 batch 사이즈 512 로 높이기

In [30]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=512,
    per_device_eval_batch_size=512,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=4,  # 병렬 데이터 로딩 활성화
)



In [31]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [32]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
	per_device_train_batch_size: 512 (from args) != 32 (from trainer_state.json)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
11,1.0695,1.065116,0.420947


KeyboardInterrupt: 

살짝 불안해서 저장해두기 + 구글 드라이브에도

In [33]:
trainer.save_model()

In [34]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [38]:
!cp -r /content/mnli_transformer /content/drive/MyDrive/saved_models/

- dataloader_num_workers 수 늘려보기
- lr 도 1e-3 으로 올려보기(epoch 기다리는 것 너무 답답)

In [39]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=512,
    per_device_eval_batch_size=512,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-3,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
)



In [40]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [41]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
12,1.0693,1.064857,0.416618


KeyboardInterrupt: 

- lr 다시 줄여보기
- A100 으로 돌리고 있으니 GPU 여유 있는 듯한데..? 1024 batch size 시도

In [42]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=1024,
    per_device_eval_batch_size=1024,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
)



In [43]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [44]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
	per_device_train_batch_size: 1024 (from args) != 32 (from trainer_state.json)
  checkpoint_rng_state = torch.load(rng_file)


Epoch,Training Loss,Validation Loss,Accuracy
13,1.0691,1.06425,0.4211


KeyboardInterrupt: 

### 이상한 점 발견
batch size 도대체 적용이 안 된 것 같은 느낌의 출력과 속도...!

[143659/165675 09:16 < 18:21, 19.98 it/s, Epoch 13.01/15]

resume_from_checkpoint 를 False 로 해봐야 될 듯 하다.
아래 warning 도 결국 새로운 설정값이 적용이 안 되었다는 건가 하는 생각
```
Warning: The following arguments do not match the ones in the `trainer_state.json` within the checkpoint directory:
	per_device_train_batch_size: 1024 (from args) != 32 (from trainer_state.json)
```
그렇다면 batch size 128로 늘리는 것부터 다시!


In [45]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=4,  # 병렬 데이터 로딩 활성화
)



In [46]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [47]:
trainer.train(resume_from_checkpoint=False) # 요거 False 로 해보기 (학습된 모델의 가중치가 날아가진 않으므로 안심하고..)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0703,1.064456,0.423111


KeyboardInterrupt: 

불안하니 중간 저장..

In [48]:
trainer.save_model()

In [49]:
!cp -r /content/mnli_transformer /content/drive/MyDrive/saved_models/

In [50]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
)



In [51]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [52]:
trainer.train(resume_from_checkpoint=False)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0684,1.063285,0.425403


KeyboardInterrupt: 

In [53]:
trainer.save_model()

In [54]:
trainer.evaluate(mnli_validation_matched_tokenized)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0684,1.062971,0.42863


{'eval_loss': 1.062970757484436, 'eval_accuracy': 0.42862964849719815}

아직 갈 길이 멀다.. 어떻게 빠르게 학습할 수 있을까

In [55]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-3, # 다시 높여보기..
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
    fp16=True, # fp16 활성화
)



In [56]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [57]:
trainer.train(resume_from_checkpoint=False)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0824,1.063037,0.422169
2,1.0826,1.060087,0.425861
3,1.0814,1.066181,0.418884
4,1.0794,1.06769,0.41481


KeyboardInterrupt: 

In [58]:
trainer.evaluate(mnli_validation_matched_tokenized)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0824,1.063037,0.422169
2,1.0826,1.060087,0.425861
3,1.0814,1.066181,0.418884
4,1.0794,1.058301,0.433826


{'eval_loss': 1.0583009719848633, 'eval_accuracy': 0.43382577687213447}

1e-3 으로 lr 하면 불안정한 느낌...ㅠㅠ 다시 lr 은 줄여봐야겠다.

batch 는 늘려봐야지!


In [60]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=256,
    per_device_eval_batch_size=256,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4, # 다시 줄여보기..
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
    fp16=True, # fp16 활성화
)



In [61]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [62]:
trainer.train(resume_from_checkpoint=False)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0677,1.057827,0.43592
2,1.0676,1.057961,0.435258
3,1.067,1.057779,0.435283


KeyboardInterrupt: 

In [63]:
trainer.save_model()

In [64]:
trainer.evaluate(mnli_validation_matched_tokenized)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0677,1.057827,0.43592
2,1.0676,1.057961,0.435258
3,1.067,1.058862,0.431686


{'eval_loss': 1.0588622093200684, 'eval_accuracy': 0.4316861946001019}

참지 못하고.. 다시 lr 높여보기 + batch size up

In [65]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=512,
    per_device_eval_batch_size=512,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-3,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
    fp16=True, # fp16 활성화
)



In [66]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [67]:
trainer.train(resume_from_checkpoint=False)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0739,1.05941,0.432304
2,1.0726,1.058116,0.433475
3,1.0719,1.06827,0.393904


KeyboardInterrupt: 

In [68]:
trainer.evaluate(mnli_validation_matched_tokenized)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0739,1.05941,0.432304
2,1.0726,1.058116,0.433475
3,1.0719,1.063773,0.42812


{'eval_loss': 1.0637727975845337, 'eval_accuracy': 0.4281202241467142}


### batch size 정하는 기준 무작정 메모리 잘 쓰겠다고 늘리는게 맞나 의심이 들기 시작..

[찾아본 링크](https://www.linkedin.com/advice/0/what-best-batch-size-optimizing-deep-learning-kw4if)

smaller batch sizes (like 32 or 64) offer more robust learning with noisier gradients, while larger batch sizes (like 128 or 256) provide faster, but potentially less stable, training.

라고 해서.. 왠지 줄여봐야 하나 하는 생각.


In [69]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=15,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-3,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
)



In [70]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [71]:
trainer.train(resume_from_checkpoint=False)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0916,1.06323,0.419724
2,1.0902,1.090477,0.346388
3,1.0892,1.063227,0.414912
4,1.0865,1.06772,0.390135
5,1.0858,1.066102,0.422755


KeyboardInterrupt: 

흠... 뭔가 validation loss 는 중간에 늘어나기도 하고 해서 과적합 비슷한 상황 같기도 함.

그리고 accuracy 너무 안정적이지 못하다..

일단 과적합 막을 수 있는 weight decay 도 추가하고,
lr 은 안정적이 되도록 다시 줄이기..

(중간 평가는 한번 해보기)

In [72]:
trainer.evaluate(mnli_validation_matched_tokenized)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0916,1.06323,0.419724
2,1.0902,1.090477,0.346388
3,1.0892,1.063227,0.414912
4,1.0865,1.06772,0.390135
5,1.0858,1.082511,0.36271


{'eval_loss': 1.0825108289718628, 'eval_accuracy': 0.36271013754457465}

오우... 너무 나쁘다 빨리 고쳐보기

In [75]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
    weight_decay=0.01,
)



In [76]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [77]:
trainer.train(resume_from_checkpoint=False)

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0695,1.058596,0.434035
2,1.0679,1.058402,0.429808
3,1.0665,1.058068,0.432609


TrainOutput(global_step=16569, training_loss=1.0679749739724786, metrics={'train_runtime': 423.5672, 'train_samples_per_second': 2503.246, 'train_steps_per_second': 39.118, 'total_flos': 5.996678183133964e+16, 'train_loss': 1.0679749739724786, 'epoch': 3.0})

In [78]:
trainer.evaluate(mnli_validation_matched_tokenized)

{'eval_loss': 1.0580675601959229,
 'eval_accuracy': 0.4376974019358125,
 'eval_runtime': 3.1442,
 'eval_samples_per_second': 3121.636,
 'eval_steps_per_second': 48.979,
 'epoch': 3.0}

In [80]:
training_args = TrainingArguments(
    output_dir='mnli_transformer',
    num_train_epochs=5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
    weight_decay=0.01,
)



In [81]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [83]:
trainer.train(resume_from_checkpoint=False)

Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,1.0706,1.063245,0.0036,0.424384
2,1.069,1.061872,0.0036,0.423468


KeyboardInterrupt: 

# 2. from_config 사용 시도 (금방 성공)

흐읍.... from_pretrained 사용해서 어느 세월에 0.5 넘길지 모르겠어서 from_config 로 시도

In [84]:
from transformers import BertConfig

config = BertConfig()

config.hidden_size = 64  # BERT layer의 기본 hidden dimension
config.intermediate_size = 64  # FFN layer의 중간 hidden dimension
config.num_hidden_layers = 2  # BERT layer의 개수
config.num_attention_heads = 4  # Multi-head attention에서 사용하는 head 개수
config.num_labels = 3  # 마지막에 예측해야 하는 분류 문제의 class 개수

model = AutoModelForSequenceClassification.from_config(config)

TrainingArguments 는 동일하게 적용해보기

In [85]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='mnli_transformer_2',
    num_train_epochs=10,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    load_best_model_at_end=True,
    resume_from_checkpoint=True,
    dataloader_num_workers=8,  # 병렬 데이터 로딩 활성화
    weight_decay=0.01,
 )



In [86]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [87]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0159,0.958258,0.533829
2,0.9381,0.913613,0.56436
3,0.8867,0.89658,0.577093
4,0.8553,0.897849,0.57969
5,0.8343,0.904045,0.577627
6,0.8158,0.908601,0.57643


TrainOutput(global_step=33138, training_loss=0.8910333597852164, metrics={'train_runtime': 519.6502, 'train_samples_per_second': 6801.325, 'train_steps_per_second': 106.283, 'total_flos': 76833119391840.0, 'train_loss': 0.8910333597852164, 'epoch': 6.0})

허어어.. 왜 from_config 로 하니까 처음부타 0.5 이상에서 잘 늘어난다..


In [88]:
trainer.save_model()

In [89]:
trainer.evaluate(mnli_validation_matched_tokenized)

{'eval_loss': 0.8867551684379578,
 'eval_accuracy': 0.5913397860417728,
 'eval_runtime': 1.3629,
 'eval_samples_per_second': 7201.584,
 'eval_steps_per_second': 112.995,
 'epoch': 6.0}

### 고민
뭔가 이상하다. 도대체 왜 from_pre_trained 로 할 때는 정확도가 그렇게나 정체되었는데.. from_config는 이렇게 순탄하게 학습이 된다는 게 이상하다..

# 시도 3. 다시 pre_trained 도 해보기

혹시.. base model 의 파라미터를 굳혀놓은 게 문제가 되었을까...?

요거를 전체 학습될 수 있게 풀어두고 재시도 해봐야겠다.

In [90]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

training_args = TrainingArguments(
    output_dir='mnli_transformer',  # 모델과 로그를 저장할 폴더
    num_train_epochs=10,  # epoch 수
    per_device_train_batch_size=32,  # 배치 크기
    per_device_eval_batch_size=32,  # 검증 데이터 배치 크기
    logging_strategy="epoch",
    evaluation_strategy="epoch",  # 매 epoch 후 평가
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [91]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [94]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mnli_train_tokenized,
    eval_dataset=mnli_val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [95]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4989,0.44432,0.82389
2,0.3564,0.443451,0.83512
3,0.2432,0.503209,0.829518
4,0.1678,0.603354,0.829034
5,0.1233,0.712487,0.82855


TrainOutput(global_step=55225, training_loss=0.27794848710177117, metrics={'train_runtime': 3114.3873, 'train_samples_per_second': 1134.833, 'train_steps_per_second': 35.464, 'total_flos': 8.678007771291683e+16, 'train_loss': 0.27794848710177117, 'epoch': 5.0})

In [96]:
trainer.evaluate(mnli_validation_matched_tokenized)

{'eval_loss': 0.44881507754325867,
 'eval_accuracy': 0.834538970962812,
 'eval_runtime': 5.8593,
 'eval_samples_per_second': 1675.114,
 'eval_steps_per_second': 52.395,
 'epoch': 5.0}

In [97]:
trainer.save_model()

In [98]:
!cp -r /content/mnli_transformer /content/drive/MyDrive/saved_models/

## 학습된 모델로 출력해보기

In [99]:
from transformers import AutoTokenizer
import torch
import numpy as np

# 5개 예시 문장 쌍
sentence_pairs = [
    ("A man is eating food.", "A man is having a meal."),  # entailment
    ("A woman is playing a guitar.", "A woman is baking a cake."),  # contradiction
    ("A child is playing outside.", "A child is running in the park."),  # neutral
    ("Two cars are in a race.", "Vehicles are competing in a contest."),  # entailment
    ("The boy is swimming in the pool.", "The boy is standing by the pool."),  # contradiction
]

# 학습 시 사용한 tokenizer를 이용해 입력 문장을 토크나이즈
model.eval()  # 모델을 평가 모드로 전환
model.to('cuda')

for premise, hypothesis in sentence_pairs:
    # 각 문장 쌍을 토크나이즈
    inputs = tokenizer(premise, hypothesis, return_tensors="pt", padding=True, truncation=True)
    inputs = {key: value.to('cuda') for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=1)
    predicted_label = np.argmax(probabilities.cpu().numpy(), axis=1)
    label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}
    predicted_label_str = label_map[predicted_label[0]]

    # 결과 출력
    print(f"Premise: {premise}")
    print(f"Hypothesis: {hypothesis}")
    print(f"Prediction: {predicted_label_str} (with probabilities {probabilities.cpu().numpy()})\n")


Premise: A man is eating food.
Hypothesis: A man is having a meal.
Prediction: entailment (with probabilities [[0.9881435  0.00954655 0.00230997]])

Premise: A woman is playing a guitar.
Hypothesis: A woman is baking a cake.
Prediction: contradiction (with probabilities [[4.1793723e-04 4.3592607e-03 9.9522275e-01]])

Premise: A child is playing outside.
Hypothesis: A child is running in the park.
Prediction: contradiction (with probabilities [[0.0041269  0.08000788 0.9158653 ]])

Premise: Two cars are in a race.
Hypothesis: Vehicles are competing in a contest.
Prediction: entailment (with probabilities [[0.9590854  0.03452769 0.00638698]])

Premise: The boy is swimming in the pool.
Hypothesis: The boy is standing by the pool.
Prediction: contradiction (with probabilities [[0.02728368 0.05855465 0.9141617 ]])



호... 정말 5개 중에 4개 정도 맞춘다..

## ⁉️ 느낌/궁금한 점
충격적...😳
아래 코드를 없앤 것 하나만으로 이렇게 accuaracy 가 다르다니..


```
for param in model.bert.parameters():
    param.requires_grad = False
```

흠... 신기하다..
두 문장의 논리적 관계를 따지는 작업은 그냥 BERT 가 하던(=학습된) 작업과는 결이 많이 달라서 그런 걸까?
태스크가 좀 이질적이라고 볼 수 있는 것일까..?

그리고 이렇게 업스트림과 다운스트림이 다른 경우에는 사전학습된 모델을 가져와서 단지 특성 추출기로만 쓰는게 아니라, 그 자체도 학습시켜줄 필요가 있다고 볼 수 있는 걸까?

## 📕 관련 책 내용

- 출처: 파이토치와 트랜스포머를 활용한 자연어 처리와 컴퓨터비전 심층학습 (pp.226-228)

- 전이 학습은 크게 2가지로 분류
  - 특징 추출(feature extraction)
  - 미세 조정(fine-tunning)

- 위 2가지의 차이는 사전학습된 **모델의 가중치를 '동결(freeze)' 하는지 여부**로, 완전 동결하면 '특징 추출' 로 사전학습모델을 사용한다고 볼 수 있고, 일부만 동결하거나 동결하지 않고 타겟 도메인 학습을 진행한다면 '미세조정(fine-tunning)' 한다고 볼 수 있다고 구분해줌.

- 물론 fine-tunning 에서는 동결의 정도를 달리 할 수 있는데, 이때 고려할 점 중 하나로 이 책에서 소개한 것은 **"소스 도메인"과 "타겟 도메인" 간의 유사성** 이다.
  - 유사성이 낮다면 동결 비율을 낮추는 것이, 유사성이 높다면 동결 비율을 높이는 것이 추천되는 것으로 이해되었는데, 직관적으로 좀 동의가 된다.
  - 이번 과제에서 from_pre_trainned 를 사용해 학습할 때, 베이스모델 전체를 동결시켰더니 학습이 어려웠던 것이 이것과 관련있지 않을까? 하는 생각이 든다.
