<a href="https://colab.research.google.com/github/mjiii25/posco-academy/blob/main/19th_NLP_Test_baseline_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KLUE-TC Classification Task

이 Notebook은 KLUE-TC (YNAT) Task를 수행하는 예시 코드입니다. Kaggle Leaderboard의 Benchmark가 이 코드를 그대로 실행해서 얻은 결과입니다.

이 코드와 실습 코드, 이전 과목 코드 등을 참고하여 여러분의 AI를 작성해 보세요.

하단의 링크를 통해 과제를 위한 Kaggle Competition에 참가할 수 있습니다: 
https://www.kaggle.com/t/3e0bdda08ecd645d9b2fef4a90b283c3 

In [None]:
! pip install datasets transformers evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import torch.nn as nn
import evaluate
import numpy as np

## 산출물 생성 함수 정의
아래의 코드블럭을 실행하면 됩니다.

In [None]:
# Parameter로 입력된 모델을 테스트 하고 그 결과를 result.csv 파일로 저장합니다. 
def export_result(model, tokenizer = None):
  from pandas import DataFrame
  test_dataset = load_dataset("klue",'ynat')['validation']
  id = []
  pred_label_list = []
  true_label_list = []
  is_warned = False
  with torch.no_grad():
    for idx, datum in enumerate(test_dataset):
        if idx % 100 == 0:
          print("test {}th data".format(idx))
        id.append(datum['guid'])    
        
        if tokenizer is not None:
              tokenized = tokenizer(datum['title'],return_tensors='pt')
              predicted_label = model(**tokenized)[0]
        else:
          # model만 인자로 넘겨준 경우, Model은 숫자 하나 혹은 Tensor를 리턴해야 합니다
          # 혹은 아래 코드를 변경해서 pred_label_list 리스트에 한번에 숫자 하나가 입력 되도록 만들면 됩니다.
          predicted_label = model(datum['title'])

        if isinstance(predicted_label,list):
          predicted_label = torch.argmax(predicted_label[0],dim=1).item()
        elif isinstance(predicted_label,torch.Tensor):
          predicted_label = torch.argmax(predicted_label,dim=1).item()
            
        true_value = datum['label']
        if predicted_label > 6 and not is_warned:
          print("predicted_label의 값이 6보다 큽니다. 출력된 값을 한번 점검해주세요.")
          is_warned = True
        pred_label_list.append(predicted_label)
        true_label_list.append(true_value)
  df = DataFrame({"guid":id,
                  "Category":pred_label_list})
  # save df to result.csv file
  df.to_csv("./result.csv", index=False)

## 데이터셋 불러오기

#### 주의사항: 본 과제에서는 validation set에 대한 성능을 평가하기 때문에 load_dataset("klue","ynat")['validation'] 등을 통해 validation set을 모델에 학습시킨 경우 0점 처리 됩니다.


In [None]:
from datasets import load_dataset

# 본 과제에서는 validation set에 대한 성능을 평가하기 때문에
# load_dataset("klue","ynat")['validation'] 등을 통해 
# validation set을 모델에 학습시킨 경우 0점 처리 됩니다.
dataset = load_dataset("klue",'ynat')



  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['guid', 'title', 'label', 'url', 'date'],
        num_rows: 45678
    })
    validation: Dataset({
        features: ['guid', 'title', 'label', 'url', 'date'],
        num_rows: 9107
    })
})

## 모델 설계

#### 자연어처리 과목 혹은 이전 실습시간에 배운 내용을 바탕으로, KLUE-TC task를 위한 AI모델을 만들어보세요.


In [None]:
# Define Your AI
# 아래의 코드 블럭은 예시 AI입니다. 이를 활용해도 되고, 무시해도 됩니다.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_checkpoint = "bert-base-cased"
some_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint,num_labels=7).to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

## 모델 학습 과정 설계

#### 자연어처리 과목 혹은 이전 실습시간에 배운 내용을 바탕으로, 앞서 정의한 AI를 학습할 수 있는 코드를 작성해보세요. KLUE-TC Data말고 다른 Data를 추가로 사용해도 됩니다.

In [None]:
# 여러분의 AI를 학습해보세요.
# 아래의 코드 블럭 3개는 실습 시간의 코드를 그대로 가져왔습니다. 이를 활용해도 되고, 무시해도 됩니다.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
metric = evaluate.load("accuracy")
def preprocess_function(examples):
    return some_tokenizer(examples['title'], truncation=True)

# 데이터 전체를 Encode합니다. 
encoded_dataset = dataset.map(preprocess_function, batched=True)
print(encoded_dataset["train"])



Dataset({
    features: ['guid', 'title', 'label', 'url', 'date', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 45678
})


In [None]:
# 실습 시간의 세팅을 그대로 가져왔습니다
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]
batch_size=16
args = TrainingArguments(
    f"checkpoint",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    do_train=True,
    do_eval=True,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# Trainer Class의 eval_dataset parameter에 encoded_dataset['validation']을 넘겨주는 것은 부정행위가 아닙니다 (학습에 쓰이는 것이 아니라 성능평가에 쓰이기 때문)
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=some_tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: date, url, title, guid. If date, url, title, guid are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 14275
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.5289,1.735726,0.301087
2,1.4913,1.738183,0.303942
3,1.4537,1.758257,0.287581
4,1.4387,1.732808,0.270012
5,1.4269,1.742444,0.26551


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: date, url, title, guid. If date, url, title, guid are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 16
Saving model checkpoint to checkpoint/checkpoint-2855
Configuration saved in checkpoint/checkpoint-2855/config.json
Model weights saved in checkpoint/checkpoint-2855/pytorch_model.bin
tokenizer config file saved in checkpoint/checkpoint-2855/tokenizer_config.json
Special tokens file saved in checkpoint/checkpoint-2855/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: date, url, title, guid. If date, url, title, guid are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this messag

TrainOutput(global_step=14275, training_loss=1.4787368523218585, metrics={'train_runtime': 1382.454, 'train_samples_per_second': 165.206, 'train_steps_per_second': 10.326, 'total_flos': 1668246257938200.0, 'train_loss': 1.4787368523218585, 'epoch': 5.0})

## AI 산출물 생성

#### 여러분이 학습한 AI를 export_result 메서드에 인자로 입력해서, 산출물을 생성하세요. 

In [None]:
# export_result(//Your AI object//)

# 아래의 코드 블럭은 예시 코드입니다. 이를 활용해도 되고, 무시해도 됩니다.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("./checkpoint/checkpoint-14275")
test_dataset = load_dataset("klue",'ynat')['validation']
export_result(model,some_tokenizer)

loading configuration file ./IMDB-finetuned-BERT/checkpoint-14275/config.json
Model config BertConfig {
  "_name_or_path": "./IMDB-finetuned-BERT/checkpoint-14275",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_

  0%|          | 0/2 [00:00<?, ?it/s]



  0%|          | 0/2 [00:00<?, ?it/s]

Accuracy: {'accuracy': nan}


  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
