본 실습에서는 Huggingface [Transformers](https://github.com/huggingface/transformers)와 [Datasets](https://github.com/huggingface/datasets) 라이브러리를 사용합니다.

In [None]:
!pip install transformers



In [None]:
!pip install datasets==1.17.0

# 1. 데이터셋 load

Huggingface Datasets를 통해 데이터를 load해보겠습니다.

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('smilegate-ai/kor_unsmile')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]



Downloading and preparing dataset None/None (download: 1.39 MiB, generated: 4.93 MiB, post-processed: Unknown size, total: 6.32 MiB) to /root/.cache/huggingface/datasets/parquet/smilegate-ai--kor_unsmile-e0f75c6e3be1af78/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/290k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/smilegate-ai--kor_unsmile-e0f75c6e3be1af78/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset["train"][0]

{'문장': '일안하는 시간은 쉬고싶어서 그런게 아닐까',
 '여성/가족': 0,
 '남성': 0,
 '성소수자': 0,
 '인종/국적': 0,
 '연령': 0,
 '지역': 0,
 '종교': 0,
 '기타 혐오': 0,
 '악플/욕설': 0,
 'clean': 1,
 '개인지칭': 0,
 'labels': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}

In [None]:
unsmile_labels = ["여성/가족","남성","성소수자","인종/국적","연령","지역","종교","기타 혐오","악플/욕설","clean"]
# 개인지칭의 경우, 추가 정보이므로 분류 대상에서 제외했습니다.

# 2. Model load

학습을 위해 Pretrained language model (PLM) 을 활용해보겠습니다.

In [None]:
from transformers import BertForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
import torch
import numpy as np

In [None]:
model_name = 'beomi/kcbert-base'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/250k [00:00<?, ?B/s]

bert model에 학습 데이터 전달을 위해 tokenizing 작업을 수행합니다.

In [None]:
def preprocess_function(examples):
    tokenized_examples = tokenizer(str(examples["문장"]))
    tokenized_examples['labels'] = torch.tensor(examples["labels"], dtype=torch.float)
    # multi label classification 학습을 위해선 label이 float 형태로 변형되어야 합니다.
    # huggingface datasets 최신 버전에는 'map' 함수에 버그가 있어서 변형이 올바르게 되지 않습니다.

    return tokenized_examples

In [None]:
tokenized_dataset = dataset.map(preprocess_function)
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'labels', 'attention_mask', 'token_type_ids'])



  0%|          | 0/3737 [00:00<?, ?ex/s]

  0%|          | 0/15005 [00:00<?, ?ex/s]

In [None]:
tokenized_dataset['train'][0]

  (isinstance(x, np.ndarray) and (x.dtype == np.object or x.shape != array[0].shape))


AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
num_labels=len(unsmile_labels) # Label 갯수

model = BertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    problem_type="multi_label_classification"
)
model.config.id2label = {i: label for i, label in zip(range(num_labels), unsmile_labels)}
model.config.label2id = {label: i for i, label in zip(range(num_labels), unsmile_labels)}

In [None]:
model.config.label2id

# 3. Model training

In [None]:
from sklearn.metrics import label_ranking_average_precision_score

In [None]:
def compute_metrics(x):
    return {
        'lrap': label_ranking_average_precision_score(x.label_ids, x.predictions),
    }

In [None]:
batch_size = 16

In [None]:
args = TrainingArguments(
    output_dir="model_output",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='lrap',
    greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [None]:
trainer.train()

In [None]:
trainer.save_model()

# 4. Model test

직접 학습하신 모델을 사용하실 경우, 아래 코드로 실행해주세요

In [None]:
from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(
    model = model,
    tokenizer = tokenizer,
    device=0,
    return_all_scores=True,
    function_to_apply='sigmoid'
    )

기학습된 모델을 사용하실 경우, 아래 코드로 실행해주세요

In [None]:
from transformers import TextClassificationPipeline, BertForSequenceClassification, AutoTokenizer


model_name = 'smilegate-ai/kor_unsmile'


model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pipe = TextClassificationPipeline(
     model=model,
     tokenizer=tokenizer,
     device=0,     # cpu: -1, gpu: gpu number
     return_all_scores=True,
     function_to_apply='sigmoid'
     )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/250k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/721k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
for result in pipe("이래서 여자는 게임을 하면 안된다")[0]:
    print(result)

{'label': '여성/가족', 'score': 0.8253052234649658}
{'label': '남성', 'score': 0.0397251695394516}
{'label': '성소수자', 'score': 0.012144332751631737}
{'label': '인종/국적', 'score': 0.023181892931461334}
{'label': '연령', 'score': 0.010315303690731525}
{'label': '지역', 'score': 0.018454886972904205}
{'label': '종교', 'score': 0.011270025745034218}
{'label': '기타 혐오', 'score': 0.020734025165438652}
{'label': '악플/욕설', 'score': 0.057331427931785583}
{'label': 'clean', 'score': 0.14010529220104218}


In [None]:
from huggingface_hub import InferenceApi

In [None]:
api_key = "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"inference = InferenceApi(repo_id="distilbert-base-uncased", token=api_key)

SyntaxError: invalid syntax (ipython-input-8-56626828.py, line 1)

# 6. api 제공

In [None]:
# app.py
from fastapi import FastAPI
from pydantic import BaseModel

# 3. FastAPI 애플리케이션 초기화
app = FastAPI()

# 4. 요청 본문(request body)의 데이터 모델 정의
class TextInput(BaseModel):
    text: str

# 5. API 엔드포인트 정의
@app.post("/predict_unsmile/")
async def predict_unsmile(input: TextInput):
    try:
        # 모델 예측 수행
        results = pipe(input.text)
        return {"input_text": input.text, "predictions": results}
    except Exception as e:
        return {"error": str(e)}

# 실행 방법:
# 1. 터미널에서 `pip install fastapi uvicorn transformers`
# 2. `uvicorn app:app --host 0.0.0.0 --port 8000` 실행
# 3. 웹 브라우저에서 `http://127.0.0.1:8000/docs` 로 접속하여 API 문서 확인 및 테스트 가능

# 5. Model evaluation

In [None]:
def get_predicated_label(output_labels, min_score):
    labels = []
    for label in output_labels:
        if label['score'] > min_score:
            labels.append(1)
        else:
            labels.append(0)
    return labels

In [None]:
import tqdm
from transformers.pipelines.base import KeyDataset

predicated_labels = []

for out in tqdm.tqdm(pipe(KeyDataset(dataset['valid'], '문장'))):
    predicated_labels.append(get_predicated_label(out, 0.5))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(dataset['valid']['labels'], predicated_labels))