### 문서 분류 모델
문서 분류(document classification)란 문서가 주어졌을 때 해당 문서의 범주를 분류하는 과정이다.     
영화 리뷰가 긍정인지 부정인지 등 분류하는 것이 대표적인 예시이다.    
네이버 영화 리뷰 말뭉치(NSMC)를 가지고 분류해보는 실습을 진행한다.     
이처럼 문장의 극성을 분류하는 과제를 감정 분석(sentiment analysis)라고 한다.    
문장을 토큰화 한 뒤 시작과 끝을 알리는 CLS와 SEP토큰을 앞 뒤에 붙인다.    
이를 bert모델에 입력하고 문장 수준의 벡터를 얻는다. 이 벡터에 작은 추가 모듈을 덧붙여서 문장을 분류한다.     
이처럼 미리 학습된 모델에 추가 모듈을 붙여서 학습하는 과정을 fine-tuning이라고 한다.    


In [1]:
!pip install ratsnlp
from google.colab import drive
drive.mount('/gdrive',force_remount = True)

Mounted at /gdrive


In [27]:
#이번 실습에서 kcbert-base모델을 NSMC데이터로 파인튜닝해본다.
import torch
from ratsnlp.nlpbook.classification import ClassificationTrainArguments
args = ClassificationTrainArguments(
    pretrained_model_name = "beomi/kcbert-base", #pretrained된 모델
    downstream_corpus_name = "nsmc", #다운스트림 데이터의 이름
    downstream_model_dir = "gdrive/My Drive/nlpbook/checkpoint-doccls", #파인튜닝된 모델의 체크포인트가 저장될 위치
    batch_size = 32 if torch.cuda.is_available() else 4,
    learning_rate = 5e-5,
    max_seq_length = 128,#토큰기준 입력 문장 최대길이
    epochs = 1,
    seed = 42
)

In [28]:
from ratsnlp import nlpbook
nlpbook.set_seed(args)

set seed: 42


In [4]:
#각종 로그를 출력하는 로거 설정
nlpbook.set_logger(args)

INFO:ratsnlp:Training/evaluation parameters ClassificationTrainArguments(pretrained_model_name='beomi/kcbert-base', downstream_task_name='document-classification', downstream_corpus_name='nsmc', downstream_corpus_root_dir='/content/Korpora', downstream_model_dir='gdrive/My Drive/nlpbook/checkpoint-doccls', max_seq_length=128, save_top_k=1, monitor='min val_loss', seed=42, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=3, batch_size=32, cpu_workers=2, fp16=False, tpu_cores=0)
INFO:ratsnlp:Training/evaluation parameters ClassificationTrainArguments(pretrained_model_name='beomi/kcbert-base', downstream_task_name='document-classification', downstream_corpus_name='nsmc', downstream_corpus_root_dir='/content/Korpora', downstream_model_dir='gdrive/My Drive/nlpbook/checkpoint-doccls', max_seq_length=128, save_top_k=1, monitor='min val_loss', seed=42, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=3, batch_siz

In [5]:
from Korpora import Korpora
Korpora.fetch(
    corpus_name = args.downstream_corpus_name,
    root_dir = args.downstream_corpus_root_dir,
    force_download = True
)#데이터셋 다운로드

[nsmc] download ratings_train.txt: 14.6MB [00:00, 41.5MB/s]                            
[nsmc] download ratings_test.txt: 4.90MB [00:00, 33.3MB/s]                            


In [29]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(
    args.pretrained_model_name,
    do_lower_case = False
)#pretrained model

In [7]:
#학습 데이터를 배치 단위로 모델에 전달하기 위한 데이터 로더 설정
from ratsnlp.nlpbook.classification import NsmcCorpus, ClassificationDataset
corpus = NsmcCorpus() #nsmc데이터를 문장과 레이블로 읽어들인다.
train_dataset = ClassificationDataset(
    args = args,
    corpus = corpus,
    tokenizer = tokenizer,
    mode = 'train'
)

INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:loading train data... LOOKING AT /content/Korpora/nsmc/ratings_train.txt
INFO:ratsnlp:loading train data... LOOKING AT /content/Korpora/nsmc/ratings_train.txt
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 71.607 s]
INFO:ratsnlp:tokenize sentences [took 71.607 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence: 아 더빙.. 진짜 짜증나네요 목소리
INFO:ratsnlp:sentence: 아 더빙.. 진짜 짜증나네요 목소리
INFO:ratsnlp:tokens: [CLS] 아 더 ##빙 . . 진짜 짜증나네 ##요 목소리 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

In [17]:
print(train_dataset.features[0])

ClassificationFeatures(input_ids=[2, 2170, 832, 5045, 17, 17, 7992, 29734, 4040, 10720, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

[개인 데이터셋으로 문서 분류 모델 전처리 방법](ratsgo.github.io/docs/doc_cls/detail)

In [18]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
train_dataloader = DataLoader(
    train_dataset,
    batch_size = args.batch_size,
    sampler = RandomSampler(train_dataset,replacement=False), #배치 만들 때 랜덤으로 뽑는다는 뜻
    collate_fn = nlpbook.data_collator, #뽑은 인스턴스들을 배치로 만드는 역할, 모아주고 텐서로 바꿔준다.
    drop_last = False,
    num_workers = args.cpu_workers
)

val_dataset = ClassificationDataset(
    args = args,
    corpus = corpus,
    tokenizer = tokenizer,
    mode = 'test'
)
val_dataloader = DataLoader(
    val_dataset,
    batch_size = args.batch_size,
    sampler = SequentialSampler(val_dataset),
    collate_fn = nlpbook.data_collator,
    drop_last = False,
    num_workers = args.cpu_workers
)

INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:loading test data... LOOKING AT /content/Korpora/nsmc/ratings_test.txt
INFO:ratsnlp:loading test data... LOOKING AT /content/Korpora/nsmc/ratings_test.txt
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 19.131 s]
INFO:ratsnlp:tokenize sentences [took 19.131 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence: 굳 ㅋ
INFO:ratsnlp:sentence: 굳 ㅋ
INFO:ratsnlp:tokens: [CLS] 굳 ㅋ [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

In [30]:
#pretrained model 불러오기
from transformers import BertConfig, BertForSequenceClassification
pretrained_model_config = BertConfig.from_pretrained(
    args.pretrained_model_name,
    num_labels = corpus.num_labels,
)
model = BertForSequenceClassification.from_pretrained(
    args.pretrained_model_name,
    config = pretrained_model_config
)

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at beomi/kcbert-base and are newly i

In [38]:
#모델 파인튜닝하기 (학습)
#파이토치 라이트닝에서 제공하는 LightningModule클래스를 사용해서 태스크를 정의한다.
from ratsnlp.nlpbook.classification import ClassificationTask
task = ClassificationTask(model,args) #모델과 설정한 설정값들 전달하기


In [49]:

trainer = nlpbook.get_trainer(args)
trainer.fit(
    task,
    train_dataloaders = train_dataloader,
    val_dataloaders  = val_dataloader
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True, used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  rank_zero_warn(
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                          | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 108 M 
--------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.680   Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

  self.pid = os.fork()


Validation: 0it [00:00, ?it/s]

  self.pid = os.fork()


In [53]:
#학습을 끝낸 모델을 실전 투입하기
#inference란 학습을 마친 모델로 실제 과제를 수행하는 행위를 말한다.
#이번 실습에서는 앞에서 학습을 마친 모델을 가지고 웹서비스를 만들어보려고 한다.
#대강 개념은 문장을 받아 해당 문장이 긍정인지 부정인지 답변하는 웹서비스이다.
#문장을 토큰화한뒤 모델의 입력값으로 만들고, 모델에 넣어서 답변을 한다.
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30000, 768, padding_idx=0)
      (position_embeddings): Embedding(300, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [75]:
def inference_fn(sentence):
  inputs = tokenizer(
      [sentence],
      max_length = args.max_seq_length,
      padding = "max_length",
      truncation = True
  )
  with torch.no_grad():
    outputs = model(**{k : torch.tensor(v) for k,v in inputs.items()})
    prob = outputs.logits.softmax(dim=1),
    positive_prob = round(prob[0][0][1].item(),4)
    negative_prob = round(prob[0][0][0].item(),4)
    pred = "긍정 (positive)" if torch.argmax(prob[0]) == 1 else "부정 (negative)"
  return {
      'sentence':sentence,
      'prediction':pred,
      'positive_data' : f"긍정 {positive_prob}",
      'negative_data' : f"부정 {negative_prob}",
      'positive_width' : f"긍정 {positive_prob*100}%",
      'negative_width' : f"부정 {negative_prob*100}%",
  }



In [77]:
#flask웹에서 작동시키려고 하니까 에러가 뜬다.
#colab에서 flask 새 버전이 작동되지 않는다는 것이고 따라서 그냥 모델의 출력값만 확인해보자

result = inference_fn("싫어")
print(result)
#잘 작동하는 것을 볼 수 있다.

tensor([0.9961, 0.0039])
{'sentence': '싫어', 'prediction': '부정 (negative)', 'positive_data': '긍정 0.0039', 'negative_data': '부정 0.9961', 'positive_width': '긍정 0.38999999999999996%', 'negative_width': '부정 99.61%'}


In [63]:
from ratsnlp.nlpbook.classification import get_web_service_app
from flask import Flask, request, jsonify, render_template
def get_web_service_app(inference_fn, is_colab=True):

    app = Flask(__name__, template_folder='')
    if is_colab:
        from flask_ngrok import run_with_ngrok
        run_with_ngrok(app)
    else:
        from flask_cors import CORS
        CORS(app)

    @app.route('/')
    def index():
        return render_template('index.html')

    @app.route('/api', methods=['POST'])
    def api():
        query_sentence = request.json
        output_data = inference_fn(query_sentence)
        response = jsonify(output_data)
        return response

    return app
app = get_web_service_app(inference_fn)
app.run()

 * Serving Flask app 'ratsnlp.nlpbook.classification.deploy'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
Exception in thread Thread-18:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 497, in _make_request
    conn.request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3

In [51]:
#다음은 문서쌍 분류이다.
#두개의 문서가 주어졌을 때 두 문서간의 연관관계를 참, 거짓, 중립으로 분류하는 태스크이다.
#나는 출근했다. 나는 백수이다. -> 거짓 과 같은 형식으로 진행된다.
#이 부분은 관심이 많지 않으므로 다른 것들을 먼저 하고 진행할 예정이다.
