AI Programming - SW Lee

# Lab 06: GPT2 Model for Language Understanding
## Exercise: Building a Korean Chatbot
This exercise is taken from Github Storage for "What is Natural Language Processing?" by Wonjoon Yu.<br>
https://github.com/ukairia777/tensorflow-nlp-tutorial

In [1]:
#define necessary libraries
RunningInCOLAB = 'google.colab' in str(get_ipython()) #check execution environment

if RunningInCOLAB: # if you run this code in colab
    from tqdm.notebook import tqdm # use tqdm.notebook
else:              # else
    from tqdm import tqdm # use tqdm

import os
os.environ["KERAS_BACKEND"] = "tensorflow" # set the keras backend to tensorflow

import tensorflow as tf
import keras
from transformers import AutoTokenizer # tokenizer
from transformers import TFGPT2LMHeadModel #model

The GPT2 Model transformer for TensorFlow with a language modeling head on top (linear layer with weights tied to the input embeddings).

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

a single Tensor with input_ids only and nothing else: `model(inputs_ids)`

a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: `model([input_ids, attention_mask])` or `model([input_ids, attention_mask, token_type_ids])`

a dictionary with one or several input Tensors associated to the input names given in the docstring: `model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`

https://huggingface.co/transformers/v3.0.2/index.html

In [2]:
### START CODE HERE ###
# define tokenizer and model
# find & assign tokenizer and model; 'skt/kogpt2-base-v2'

tokenizer = AutoTokenizer.from_pretrained('skt/kogpt2-base-v2',bos_token='<s>',eos_token='</s>',
                                          unk_token='<unk>', pad_token='<pad>', mask_token='<mask>'
                                          ,clean_up_tokenization_spaces=True)        # define various tokens while loading tokenizer
model = TFGPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2',from_pt=True)            # load KOGPT model
#load tokenizer and model as pretrained model and set special tokens
### END CODE HERE ###

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.83M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/513M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.2.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'lm_head.weight', 'transformer.h.8.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

In [3]:
#summary of created model
model.summary()

Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  125164032 
 er)                                                             
                                                                 
Total params: 125164032 (477.46 MB)
Trainable params: 125164032 (477.46 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [4]:
# configuration of created model
model.config

GPT2Config {
  "_name_or_path": "skt/kogpt2-base-v2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "author": "Heewon Jeon(madjakarta@gmail.com)",
  "bos_token_id": 0,
  "created_date": "2021-04-28",
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "license": "CC-BY-NC-SA 4.0",
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 3,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
  

In [5]:
# check special token setting
print(tokenizer.bos_token_id) # begin of sentence is <s>
print(tokenizer.eos_token_id) # end of sentense is </s>
print(tokenizer.pad_token_id) # padding is <pad>
print(tokenizer.unk_token_id) # unknown is <unk>

print('-' * 10)

for i in range(10):
    print(i, tokenizer.decode(i)) # list of token decoding
print(tokenizer.decode(51200)) # Check for undefined token

0
1
3
5
----------
0 <s>
1 </s>
2 <usr>
3 <pad>
4 <sys>
5 <unk>
6 <mask>
7 <d>
8 </d>
9 <unused0>



In [6]:
# import libraries to load datasets from csv file
import pandas as pd
import urllib.request

Import a Korean chatbot dataset made by songys: <br>
https://github.com/songys/Chatbot_data <br>
To find more Korean dataset, check this site: <br>
https://github.com/ko-nlp/Korpora

In [7]:
#Download the dataset from the provided URL and save it as ChatBotData.csv
urllib.request.urlretrieve("https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv",
                           filename="ChatBotData.csv")
train_data = pd.read_csv('ChatBotData.csv') #Load the downloaded CSV file into a pandas DataFrame

In [8]:
#show loaded datasets
display(train_data)

Unnamed: 0,Q,A,label
0,12시 땡!,하루가 또 가네요.,0
1,1지망 학교 떨어졌어,위로해 드립니다.,0
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.,0
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.,0
4,PPL 심하네,눈살이 찌푸려지죠.,0
...,...,...,...
11818,훔쳐보는 것도 눈치 보임.,티가 나니까 눈치가 보이는 거죠!,2
11819,훔쳐보는 것도 눈치 보임.,훔쳐보는 거 티나나봐요.,2
11820,흑기사 해주는 짝남.,설렜겠어요.,2
11821,힘든 연애 좋은 연애라는게 무슨 차이일까?,잘 헤어질 수 있는 사이 여부인 거 같아요.,2


In [9]:
#manufacturing datasets
def get_chat_data():

    bos_token = tokenizer.bos_token_id          # get token_id for begin of sentence token
    eos_token = tokenizer.eos_token_id          # get token_id for end of sentence token
    unk_token = tokenizer.unk_token_id          # get token_id for unknown word token
    max_token_value = model.config.vocab_size   # range of words that can be processed

    conversations = [] # empty list to append sentences
    for question, answer in zip(train_data.Q.to_list(), train_data.A.to_list()):

        ### START CODE HERE ###

        qna_line = tokenizer.encode('<usr>' + question + '<sys>' + answer)  # encode q & a dialog line

        dialog = [bos_token]        # replace overshooting tokens with unk and enclose with bos and eos
        for token in qna_line:
            if token<max_token_value: # if token is less than max_token_value
                dialog.append(token) # append token
            else:
                dialog.append(unk_token) # append unknown token
        dialog.append(eos_token) # append end of sentence token

        ### END CODE HERE ###

        conversations.append(dialog) # finally insert created sentence
    return conversations # return all sentences


In [10]:
# fill pad in sentences for uniform length
chat_data = keras.utils.pad_sequences(get_chat_data(), padding='post', value=tokenizer.pad_token_id)

In [11]:
# data shuffling and setting size of batch
buffer = 500
batch_size = 32
#create tensorflow dataset
dataset = tf.data.Dataset.from_tensor_slices(chat_data)
dataset = dataset.shuffle(buffer).batch(batch_size,drop_remainder=True)

In [12]:
# extract as batch size
for batch in dataset.take(1):
    print(batch.shape) # print size
    print(batch[0])    # print one sample

(32, 47)
tf.Tensor(
[    0     2  9718  7182  7601 10648  8006     4 41664  8102  8084   376
     1     3     3     3     3     3     3     3     3     3     3     3
     3     3     3     3     3     3     3     3     3     3     3     3
     3     3     3     3     3     3     3     3     3     3     3], shape=(47,), dtype=int32)


In [13]:
# show one decoded sentence
str = tokenizer.decode(batch[0])
print(str)

<s><usr> 꽃다발 받았어<sys> 부러워요!</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [14]:
# show one encoded sentence
print(tokenizer.encode(str))

[0, 2, 9718, 7182, 7601, 10648, 8006, 4, 41664, 8102, 8084, 376, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


In [15]:
# set optimizer
adam = keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
# how many steps do we have?
steps = len(train_data) // batch_size + 1
print(steps)

370


In standard text generation fine-tuning, since we are predicting the next token given the text we have seen thus far, the labels are just the shifted encoded tokenized input. However, GPT's CLM (causal language model) uses look-ahead masks to hide the next tokens, which has the same effect as the labels are automatically shifted inside the model. Therefore, we can set as `labels=input_ids`.

In [16]:
# number of iteration
EPOCHS = 3

for epoch in range(EPOCHS):
    epoch_loss = 0

    for batch in tqdm(dataset, total=steps):
        with tf.GradientTape() as tape:
            #forward
            ### START CODE HERE ###
            result = model(batch, labels=batch, training=True) # insert batch to model with label and train
            loss = result[0] # get loss
            batch_loss = tf.reduce_mean(loss) # calculate mean loss

            ### END CODE HERE ###
        #backpropagation
        grads = tape.gradient(batch_loss, model.trainable_variables) # gradient
        adam.apply_gradients(zip(grads, model.trainable_variables)) # optimizer step
        epoch_loss += batch_loss / steps # loss per epoch

    print('[Epoch: {:>4}] cost = {:>.9}'.format(epoch + 1, epoch_loss)) # print current state

  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    1] cost = 1.25304067


  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    2] cost = 1.00575387


  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    3] cost = 0.889541388


In [17]:
# make question example with tokens
text = '오늘도 좋은 하루!'
sent = '<usr>' + text + '<sys>'

In [18]:
# concatenate begin of sentence and encoded question
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(sent)
input_ids = tf.convert_to_tensor([input_ids]) # convert to type of tensor

In [19]:
#model's output
output = model.generate(input_ids, max_length=50, do_sample=True, eos_token_id=tokenizer.eos_token_id)

In [20]:
# decode output
# split based on '<sys>' and remove end of sentence token
decoded_sentence = tokenizer.decode(output[0].numpy().tolist())
decoded_sentence.split('<sys> ')[1].replace('</s>', '')

'좋은 하루만 있을 거예요.'

In [21]:
# extract top 10 tokens
output = model.generate(input_ids, max_length=50, do_sample=True, top_k=10)
tokenizer.decode(output[0].numpy().tolist())

'<s><usr> 오늘도 좋은 하루!<sys> 축하해요!</s>'

In [22]:
# make function
def return_answer_by_chatbot(user_text):
  sent = '<usr>' + user_text + '<sys>' # make question
  input_ids = [tokenizer.bos_token_id] + tokenizer.encode(sent) # concatenate begin of sentence token
  input_ids = tf.convert_to_tensor([input_ids]) # convert to tensor
  output = model.generate(input_ids, max_length=50, do_sample=True, top_k=20) # extract top 20 tokens
  sentence = tokenizer.decode(output[0].numpy().tolist()) # decode model's output
  chatbot_response = sentence.split('<sys> ')[1].replace('</s>', '') #split question and answer and remove question
  return chatbot_response # return answer

In [23]:
return_answer_by_chatbot('안녕! 반가워~') # example

'안녕이네요.'

In [24]:
return_answer_by_chatbot('너는 누구야?')# example

'저는 이 사람입니다.'

In [25]:
return_answer_by_chatbot('나랑 영화보자')# example

'같이 보자고 해보세요.'

In [26]:
return_answer_by_chatbot('너무 심심한데 나랑 놀자')# example

'같이 놀자고 말해보세요.'

In [27]:
return_answer_by_chatbot('영화 해리포터 재밌어?')# example

'영화는 영화처럼 여러명이 함께해요.'

In [28]:
return_answer_by_chatbot('너 딥 러닝 잘해?')# example

'딥 러닝이 가능한지 생각해보세요.'

In [29]:
return_answer_by_chatbot('커피 한 잔 할까?')# example

'제가 마시기 좋은 음료 추천 좀 해주세요.'

(c) 2024 SW Lee