## 설치

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install wandb -qU

In [4]:
import torch
import torch.optim as optim

import numpy as np
import tqdm as tqdm

from transformers import(
    AutoModel,
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    default_data_collator,
    EarlyStoppingCallback,
    get_cosine_schedule_with_warmup,
    get_linear_schedule_with_warmup,
    get_constant_schedule,
    AdamW
)

In [5]:
import wandb
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmjchoi[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## ReadCsv

In [6]:
import pandas as pd
data = pd.DataFrame()
for i in range(1,7):
    temp_data = pd.read_csv(f'https://raw.githubusercontent.com/cmj-dev/groomProject/main/data/gen_train_data_{i}.csv')
    data = pd.concat([data,temp_data])

In [7]:
data

Unnamed: 0,masked,original
0,쉴드 아니라 국가가해준거야.,쉴드가 아니라 국가가 면제해준거야.
1,짐승 들어ㅋㅋ,짐승입장도 들어봐야지ㅋㅋ
2,깐라고기까지,깐부라고 부르기까지 하다니
3,어머니 간 해 주기 싫어서 도망쳤다고...?,어머니 간 이식 해 주기 싫어서 도망쳤다고...?
4,##한 기념 내가 사 줄게,고생한 기념으로 내가 저녁 사 줄게
...,...,...
12334,일을 해서 숙련도가 좋겠지?,그동안 일을 꾸준히 해서 숙련도가 좋겠지?
12335,당구장에서 알바하는데 아저씨들이 나중에 밥한끼사준다면서알려달라고 그래 ; ;,당구장에서 알바하는데 아저씨들이 나중에 밥한끼사준다면서 번호알려달라고 그래;;
12336,ㅠㅠ눈빛도 완전 그윽하게 쳐다보는데 그만둘 그냥..,ㅠㅠ눈빛도 완전 그윽하게 쳐다보는데 그만둘까봐 그냥..
12337,근데니까 다른인 친구로 속이더라,근데 누구냐니까 다른 남자인 친구로 속이더라


In [8]:
data = data.sample(frac=1,random_state=43)

In [9]:
spliter = int(len(data)*0.8)

In [10]:
train_df = data.iloc[:spliter,:]
dev_df = data.iloc[spliter:,:]

In [11]:
train_df

Unnamed: 0,masked,original
12012,##면 여자 다고?,바이면 남자 여자 다 좋아한다고?
7675,사람이었어?,[UNK] 남편은 뭐하는 사람이었어?
5326,ㅋㅋ 게야,ㅋㅋ 그래도 잘생긴 게 최고야
7621,,당연히 결혼해야지
8923,전쟁도 없고 질병도 어느 정도는되고 말이야,전쟁도 없고 질병도 어느 정도는 해결되고 말이야
...,...,...
11899,새이나 다름 없어!,거의 새상품이나 다름 없댔어!
2430,,네가 이해해라
11265,다리에 들어가,다리에 힘이 많이 들어가네
7676,,정권교체만이 답이다


In [12]:
dev_df

Unnamed: 0,masked,original
1399,수를 저 정권을 보낼 수,무슨 수를 써야 저 정권을 보낼 수 있는거야
11322,김치는 내가 먹을 모르는데 옆에서 냄새 풍기면 진짜 별로임,"김치는 내가 먹을 땐 모르는데, 옆에서 냄새 풍기면 진짜 별로임"
3281,,너 미쳤어?
11309,사람 배구 선수 아냐,저 사람 그 배구 선수 아냐?
5105,조선족은 우리나라 투표 수 있나?,조선족은 우리나라 대통령 투표할 수 있나?
...,...,...
2064,애가? 울린 애가 거지,운 애가 잘못이냐? 울린 애가 잘못한 거지
10517,에이 늙 않았,에이 그정도로 늙지는 않았어
7985,##서 [UNK] 도 보이고 [UNK] 도 보이네,과거 사진에서 [UNK]도 보이고 [UNK]도 보이네
2303,##만큼는 것도,노력한만큼 돌아오는 것도 없어


In [13]:
from datasets import Dataset

train_data = Dataset.from_pandas(train_df)
dev_data = Dataset.from_pandas(dev_df)

In [14]:
train_data

Dataset({
    features: ['masked', 'original', '__index_level_0__'],
    num_rows: 9871
})

## Tokenize

In [15]:
model_name = 'skt/kogpt2-base-v2'
max_length = 256

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          bos_token='</s>', eos_token='</s>', unk_token='<unk>',
                                          pad_token='<pad>', mask_token='<mask>')

In [17]:
def tokenizeWithLabel(data):
    tokenized_datas = tokenizer(
        f"<unused0> <unused1> {data['masked']} <unused2> {data['original']} <unused3>",
        max_length=max_length,
        padding="max_length"
    )
    tokenized_datas['labels']=tokenized_datas["input_ids"]
    return tokenized_datas

In [18]:
train_tokenized_datasets = train_data.map(tokenizeWithLabel, remove_columns=train_data.column_names)
dev_tokenized_datasets = dev_data.map(tokenizeWithLabel, remove_columns=dev_data.column_names)

  0%|          | 0/9871 [00:00<?, ?ex/s]

  0%|          | 0/2468 [00:00<?, ?ex/s]

## Train

In [20]:
sweep_configuration = {
    'method': 'grid',
    'name': 'sweep',
    'metric': {'goal': 'minimize', 'name': 'eval/loss'},
    'parameters': 
    {
        'batch_size': {'values': [256]},
        'epochs': {'values': [10]},
        'lr': {'values': [1e-5]},
        'scheduler': {'values': ['linear', 'cosine', 'constant']}
     }
}

In [21]:
max_batch_size = 32
def train():
    torch.cuda.empty_cache()
    model = AutoModelForCausalLM.from_pretrained(model_name)
    grouped_params = model.parameters()
    run = wandb.init(config=sweep_configuration, entity="groom2team")
    batch_size = wandb.config.batch_size if wandb.config.batch_size < max_batch_size else max_batch_size
    gradient_accumulation_steps= wandb.config.batch_size // max_batch_size
    epochs = wandb.config.epochs
    total_steps = int(len(train_tokenized_datasets)/wandb.config.batch_size*epochs)
    learning_rate = wandb.config.lr
    data_collator = default_data_collator
    grouped_params = model.parameters()
    optimizer=AdamW(grouped_params, lr=learning_rate)
    scheduler_type = wandb.config.scheduler
    if scheduler_type == 'linear':
        scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
                                                    num_warmup_steps=0,
                                                    num_training_steps=total_steps)
    elif scheduler_type == 'cosine':
        scheduler=get_cosine_schedule_with_warmup(optimizer=optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)
    elif scheduler_type == 'constant':
        scheduler=get_constant_schedule(optimizer=optimizer)
    optimizers = optimizer, scheduler
    args = TrainingArguments(
        f"{model_name}-finetuned",
        evaluation_strategy = "steps",
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=epochs,
        gradient_accumulation_steps = gradient_accumulation_steps,
        report_to="wandb",
        run_name="utopia",
        logging_steps = total_steps//200,
        eval_steps = total_steps//100,
        save_steps = total_steps//100,
        weight_decay=0.0,
        save_total_limit = 2,
        load_best_model_at_end=True
    )
    trainer = Trainer(
        model,
        args,
        train_dataset=train_tokenized_datasets,
        eval_dataset=dev_tokenized_datasets,
        data_collator=data_collator,
        tokenizer=tokenizer,
        callbacks = [EarlyStoppingCallback(early_stopping_patience=10)],
        optimizers=optimizers
    )
    trainer.train()# train 하고
    trainer.save_model(output_dir= 'pytorch_finetuned') # trainer에서 실행된 model save
    artifact = wandb.Artifact(name='pytorch_finetuned', type='model') # wandb에 해당 모델 version 관리.
    artifact.add_dir('pytorch_finetuned', name='best_model_at_end')
    run.log_artifact(artifact)

In [22]:
sweep_id = wandb.sweep(sweep=sweep_configuration, project='pj3_generater_gpt2', entity='groom2team')
count = 3

Create sweep with ID: 7duagugh
Sweep URL: https://wandb.ai/groom2team/pj3_gen_gpt2/sweeps/7duagugh


In [None]:
wandb.agent(sweep_id, function=train, count=count)

[34m[1mwandb[0m: Agent Starting Run: e2wdqyei with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	lr: 1e-05
[34m[1mwandb[0m: 	scheduler: linear
ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmjchoi[0m ([33mgroom2team[0m). Use [1m`wandb login --relogin`[0m to force relogin


***** Running training *****
  Num examples = 9871
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 8
  Total optimization steps = 380
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss
3,12.2581,10.326203
6,6.5073,3.138832
9,1.6811,0.75909
12,0.7176,0.675427
15,0.6033,0.558822
18,0.5383,0.493798
21,0.4856,0.451489
24,0.4496,0.408962
27,0.4031,0.36791
30,0.3542,0.331047


***** Running Evaluation *****
  Num examples = 2468
  Batch size = 32
Saving model checkpoint to skt/kogpt2-base-v2-finetuned/checkpoint-3
Configuration saved in skt/kogpt2-base-v2-finetuned/checkpoint-3/config.json
Model weights saved in skt/kogpt2-base-v2-finetuned/checkpoint-3/pytorch_model.bin
tokenizer config file saved in skt/kogpt2-base-v2-finetuned/checkpoint-3/tokenizer_config.json
Special tokens file saved in skt/kogpt2-base-v2-finetuned/checkpoint-3/special_tokens_map.json
Deleting older checkpoint [skt/kogpt2-base-v2-finetuned/checkpoint-9] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2468
  Batch size = 32
Saving model checkpoint to skt/kogpt2-base-v2-finetuned/checkpoint-6
Configuration saved in skt/kogpt2-base-v2-finetuned/checkpoint-6/config.json
Model weights saved in skt/kogpt2-base-v2-finetuned/checkpoint-6/pytorch_model.bin
tokenizer config file saved in skt/kogpt2-base-v2-finetuned/checkpoint-6/tokenizer_config.json
Special tokens f

In [None]:
wandb.finish() # wandb 종료