# GPT(Generative Pre-trained Transformer) 2

* 참고: https://github.com/NLP-kr/tensorflow-ml-nlp-tf2

* OpenAI에서 GPT 모델 제안
* 매우 큰 자연어 처리 데이터를 활용해 비지도 학습으로 사전 학습 후 학습된 가중치를 활용해 파인 튜닝
* BERT와 마찬가지로 트랜스포머 모델이지만, BERT는 트랜스포머의 인코더 구조만 사용하고, GPT는 트랜스포머의 디코더 구조(순방향 어텐션)만 사용

* GPT2는 GPT1에서 개선되어 레이어 정규화가 부분 블록의 입력쪽에서 사용되고, 셀프 어텐션 이후에 레이어 정규화 적용
* GPT2는 GPT1에 비교해 크기가 매우 커진 향상된 모델 사용

## 라이브러리

In [None]:
!pip install transformers==2.11.0
!pip install tensorflow==2.2.0
!pip install sentencepiece==0.1.85
!pip install gluonnlp==0.9.1
!pip install mxnet==1.6.0

Collecting transformers==2.11.0
  Using cached transformers-2.11.0-py3-none-any.whl (674 kB)
Collecting tokenizers==0.7.0 (from transformers==2.11.0)
  Using cached tokenizers-0.7.0.tar.gz (81 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sacremoses (from transformers==2.11.0)
  Using cached sacremoses-0.1.1-py3-none-any.whl (897 kB)
Building wheels for collected packages: tokenizers
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... [?25l[?25herror
[31m  ERROR: Failed building wheel for t

## 데이터 다운로드

* https://raw.githubusercontent.com/NLP-kr/tensorflow-ml-nlp-tf2/master/7.PRETRAIN_METHOD/data_in/KOR/finetune_data.txt

In [None]:
!mkdir -p gpt2
!wget https://raw.githubusercontent.com/NLP-kr/tensorflow-ml-nlp-tf2/master/7.PRETRAIN_METHOD/data_in/KOR/finetune_data.txt \
      -O gpt2/finetune_data.txt


--2024-04-01 09:48:04--  https://raw.githubusercontent.com/NLP-kr/tensorflow-ml-nlp-tf2/master/7.PRETRAIN_METHOD/data_in/KOR/finetune_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24570 (24K) [text/plain]
Saving to: ‘gpt2/finetune_data.txt’


2024-04-01 09:48:04 (29.6 MB/s) - ‘gpt2/finetune_data.txt’ saved [24570/24570]



In [None]:
!pip3 install mxnet-mkl==1.6.0 numpy==1.23.1



In [None]:
import os
import numpy as np

import gluonnlp as nlp
from gluonnlp.data import SentencepieceTokenizer
from nltk.tokenize import sent_tokenize

import tensorflow as tf
from keras.utils import pad_sequences

from transformers import TFGPT2LMHeadModel

## 사전 학습 모델

* https://www.dropbox.com/s/nzfa9xpzm4edp6o/gpt_ckpt.zip

In [None]:
!wget https://www.dropbox.com/s/nzfa9xpzm4edp6o/gpt_ckpt.zip -O gpt_ckpt.zip
!unzip -o gpt_ckpt.zip

--2024-04-01 09:48:39--  https://www.dropbox.com/s/nzfa9xpzm4edp6o/gpt_ckpt.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.72.18, 2620:100:6017:18::a27d:212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.72.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/nzfa9xpzm4edp6o/gpt_ckpt.zip [following]
--2024-04-01 09:48:40--  https://www.dropbox.com/s/raw/nzfa9xpzm4edp6o/gpt_ckpt.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucbc5bc2502ef0abc4c659e014c7.dl.dropboxusercontent.com/cd/0/inline/CQNZyBJpXOvdqrs9PgZNmRwWbdEFM5bEaYrUhh8N1HTZnODKu3xh8WlTI9XS6YPt6AhbP6ZGC5vthMEOHlw3VYhEXB8w17fqrTBky4knyQWaqlE2MnkOTT4LSrEYbfRnG3fXlv6To-FtD6K62D99V9s6/file# [following]
--2024-04-01 09:48:40--  https://ucbc5bc2502ef0abc4c659e014c7.dl.dropboxusercontent.com/cd/0/inline/CQNZyBJpXOvdqrs9PgZNmRwWbdEFM5bEaYrUhh8N1HTZnODKu3xh8WlTI9XS6YPt6AhbP6ZGC5vthMEOHlw3VYhEXB8w17fqrTBk

In [None]:
class GPT2Model(tf.keras.Model):
  def __init__(self, dir_path):
    super(GPT2Model, self).__init__()
    self.gpt2=TFGPT2LMHeadModel.from_pretrained(dir_path)

  def call(self, inputs):
    return self.gpt2(inputs)[0]

In [None]:
BASE_MODE_PATH='./gpt_ckpt'
gpt_model=GPT2Model(BASE_MODE_PATH)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ./gpt_ckpt.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [None]:
from gluonnlp.vocab import BERTVocab

BATCH_SIZE=16
NUM_EPOCHS=10
MAX_LEN=30
TOKENIZER_PATH='./gpt_ckpt/gpt2_kor_tokenizer.spiece'

tokenizer=SentencepieceTokenizer(TOKENIZER_PATH)
vocab=nlp.vocab.BERTVocab.from_sentencepiece(TOKENIZER_PATH,
                                       mask_token=None,
                                       sep_token=None,
                                       cls_token=None,
                                       unknown_token='<unk>',
                                       padding_token='<pad>',
                                       bos_token='<s>',
                                       eos_token='</s>'
                                       )

In [None]:
def tf_top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=99999):
  _logits=logits.numpy()
  top_k=min(top_k, logits.shape[-1])
  if top_k>0:
    indices_to_remove=logits<tf.math.top_k(logits, top_k)[0][...,-1,None]
    _logits[indices_to_remove]=filter_value

  if top_p>0.0:
      sorted_logits=tf.sort(logits,direction='DESCENDING')
      sorted_indices=tf.argsort(logits, direction='DESCENDING')
      cumulative_probs=tf.math.cumsum(tf.nn.softmax(sorted_logits, axis=1),axis=-1)

      sorted_indices_to_remove=cumulative_probs>top_p
      sorted_indices_to_remove=tf.concat([False],sorted_indices_to_remove[...,:-1],axis=0)
      indices_to_remove=sorted_indices[sorted_indices_to_remove].numpy().tolist()

      _logits[indices_to_remove]=filter_value
  return tf.constant([_logits])

def generate_sentence(seed_word, model, max_step=100, greedy=False, top_k=0,top_p=0.):
  sentence=seed_word
  toked=tokenizer(sentence)

  for _ in range(max_step):
    input_ids=tf.constant([vocab[vocab.bos_token],]+vocab[toked])[None,:]
    outputs=model(input_ids)[:,-1,:]
    if greedy:
      gen=vocab.to_tokens(tf.argmax(outputs, axis=-1).numpy().tolist()[0])
    else:
      output_logit=tf_top_k_top_p_filtering(outputs[0],top_k=top_k,top_p=top_p)
      gen=vocab.to_tokens(tf.random.categorical(output_logit, 1).numpy().tolist()[0])[0]

    if gen=='</s>':
      break
      sentence+=gen.replace('-',' ')
      toked=tokenizer(sentence)

    return sentence


In [None]:
generate_sentence('방금', gpt_model, greedy=True)

RuntimeError: When enable_sampling is True, We must specify "nbest_size > 1" or "nbest_size = -1", and "alpha". "nbest_size" is enabled only on unigram mode ignored in BPE-dropout. when "nbest_size = -1" , this method samples from all candidates on the lattice instead of nbest segmentations.

## 데이터 준비

## 모델 학습

# GPT2 네이버 영화 리뷰 분류

## 데이터 다운로드

## 데이터 준비

* https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt
* https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt


## 모델 학습

## 모델 평가