<a href="https://colab.research.google.com/github/iljf/NLU_project_team1/blob/main/KLUE_roberta_base_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KLUE/Roberta-base BASELINE MODEL

In [25]:
!pip install sentence-transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [26]:
import math
import logging
from datetime import datetime

import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from sentence_transformers import SentenceTransformer,  LoggingHandler, losses, models, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample

In [27]:
logging.basicConfig(
    format="%(asctime)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=logging.INFO,
    handlers=[LoggingHandler()],
)

In [28]:
model_name = "klue/roberta-base"

In [29]:
train_batch_size = 32
num_epochs = 4
model_save_path = "output/training_klue_sts_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

In [30]:
embedding_model = models.Transformer(model_name)

Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for

In [31]:
pooler = models.Pooling(
    embedding_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
)

In [32]:
model = SentenceTransformer(modules=[embedding_model, pooler])

2022-05-30 12:54:45 - Use pytorch device: cuda


In [33]:
datasets = load_dataset("klue", "sts")

2022-05-30 12:54:45 - Reusing dataset klue (/root/.cache/huggingface/datasets/klue/sts/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


  0%|          | 0/2 [00:00<?, ?it/s]

In [34]:
datasets.keys()

dict_keys(['train', 'validation'])

In [35]:
datasets["train"][0]

{'guid': 'klue-sts-v1_train_00000',
 'labels': {'binary-label': 1, 'label': 3.7, 'real-label': 3.714285714285714},
 'sentence1': '숙소 위치는 찾기 쉽고 일반적인 한국의 반지하 숙소입니다.',
 'sentence2': '숙박시설의 위치는 쉽게 찾을 수 있고 한국의 대표적인 반지하 숙박시설입니다.',
 'source': 'airbnb-rtt'}

In [36]:
testsets = load_dataset("kor_nlu", "sts")

2022-05-30 12:54:46 - Reusing dataset kor_nlu (/root/.cache/huggingface/datasets/kor_nlu/sts/1.0.0/4facbba77df60b0658056ced2052633e681a50187b9428bd5752ebd59d332ba8)


  0%|          | 0/3 [00:00<?, ?it/s]

In [37]:
testsets.keys()

dict_keys(['train', 'validation', 'test'])

In [38]:
testsets["test"][0]

{'filename': 2,
 'genre': 1,
 'id': 24,
 'score': 2.5,
 'sentence1': '한 소녀가 머리를 스타일링하고 있다.',
 'sentence2': '한 소녀가 머리를 빗고 있다.',
 'year': 6}

In [39]:
train_samples = []
dev_samples = []
test_samples = []

for phase in ["train", "validation"]:
    examples = datasets[phase]

    for example in examples:
        score = float(example["labels"]["label"]) / 5.0  

        inp_example = InputExample(
            texts=[example["sentence1"], example["sentence2"]], 
            label=score,
        )

        if phase == "validation":
            dev_samples.append(inp_example)
        else:
            train_samples.append(inp_example)

for example in testsets["test"]:
    score = float(example["score"]) / 5.0

    if example["sentence1"] and example["sentence2"]:
        inp_example = InputExample(
            texts=[example["sentence1"], example["sentence2"]],
            label=score,
        )

    test_samples.append(inp_example)

In [40]:
train_samples[0].texts, train_samples[0].label

(['숙소 위치는 찾기 쉽고 일반적인 한국의 반지하 숙소입니다.',
  '숙박시설의 위치는 쉽게 찾을 수 있고 한국의 대표적인 반지하 숙박시설입니다.'],
 0.74)

In [41]:
test_samples[0].texts, test_samples[0].label

(['한 소녀가 머리를 스타일링하고 있다.', '한 소녀가 머리를 빗고 있다.'], 0.5)

In [42]:
train_dataloader = DataLoader(
    train_samples,
    shuffle=True,
    batch_size=32,
)
train_loss = losses.CosineSimilarityLoss(model=model)

In [43]:
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
    dev_samples,
    name="sts-dev",
)

In [44]:
warmup_steps = math.ceil(len(train_dataloader) * num_epochs  * 0.3)  # 10% of train data for warm-up
logging.info(f"Warmup-steps: {warmup_steps}")

2022-05-30 12:54:48 - Warmup-steps: 438


In [45]:
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=num_epochs,
    evaluation_steps=1000,
    warmup_steps=warmup_steps,
    output_path=model_save_path,
)



Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/365 [00:00<?, ?it/s]

2022-05-30 12:57:43 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 0:
2022-05-30 12:57:45 - Cosine-Similarity :	Pearson: 0.8619	Spearman: 0.8579
2022-05-30 12:57:45 - Manhattan-Distance:	Pearson: 0.8588	Spearman: 0.8541
2022-05-30 12:57:45 - Euclidean-Distance:	Pearson: 0.8593	Spearman: 0.8543
2022-05-30 12:57:45 - Dot-Product-Similarity:	Pearson: 0.8525	Spearman: 0.8458
2022-05-30 12:57:45 - Save model to output/training_klue_sts_klue-roberta-base-2022-05-30_12-54-42


Iteration:   0%|          | 0/365 [00:00<?, ?it/s]

2022-05-30 13:00:48 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 1:
2022-05-30 13:00:49 - Cosine-Similarity :	Pearson: 0.8784	Spearman: 0.8781
2022-05-30 13:00:49 - Manhattan-Distance:	Pearson: 0.8802	Spearman: 0.8752
2022-05-30 13:00:49 - Euclidean-Distance:	Pearson: 0.8809	Spearman: 0.8760
2022-05-30 13:00:49 - Dot-Product-Similarity:	Pearson: 0.8683	Spearman: 0.8647
2022-05-30 13:00:49 - Save model to output/training_klue_sts_klue-roberta-base-2022-05-30_12-54-42


Iteration:   0%|          | 0/365 [00:00<?, ?it/s]

2022-05-30 13:03:53 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 2:
2022-05-30 13:03:55 - Cosine-Similarity :	Pearson: 0.8870	Spearman: 0.8856
2022-05-30 13:03:55 - Manhattan-Distance:	Pearson: 0.8854	Spearman: 0.8802
2022-05-30 13:03:55 - Euclidean-Distance:	Pearson: 0.8864	Spearman: 0.8815
2022-05-30 13:03:55 - Dot-Product-Similarity:	Pearson: 0.8762	Spearman: 0.8700
2022-05-30 13:03:55 - Save model to output/training_klue_sts_klue-roberta-base-2022-05-30_12-54-42


Iteration:   0%|          | 0/365 [00:00<?, ?it/s]

2022-05-30 13:06:57 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 3:
2022-05-30 13:06:58 - Cosine-Similarity :	Pearson: 0.8871	Spearman: 0.8882
2022-05-30 13:06:58 - Manhattan-Distance:	Pearson: 0.8870	Spearman: 0.8830
2022-05-30 13:06:58 - Euclidean-Distance:	Pearson: 0.8877	Spearman: 0.8835
2022-05-30 13:06:58 - Dot-Product-Similarity:	Pearson: 0.8769	Spearman: 0.8729
2022-05-30 13:06:58 - Save model to output/training_klue_sts_klue-roberta-base-2022-05-30_12-54-42


In [46]:
model = SentenceTransformer(model_save_path)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')

2022-05-30 13:07:00 - Load pretrained SentenceTransformer: output/training_klue_sts_klue-roberta-base-2022-05-30_12-54-42
2022-05-30 13:07:01 - Use pytorch device: cuda


In [47]:
test_evaluator(model, output_path=model_save_path)

2022-05-30 13:07:01 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-05-30 13:07:05 - Cosine-Similarity :	Pearson: 0.7695	Spearman: 0.7609
2022-05-30 13:07:05 - Manhattan-Distance:	Pearson: 0.7655	Spearman: 0.7647
2022-05-30 13:07:05 - Euclidean-Distance:	Pearson: 0.7647	Spearman: 0.7637
2022-05-30 13:07:05 - Dot-Product-Similarity:	Pearson: 0.7416	Spearman: 0.7311


0.7646656116983064