# GPT-2 Fine-Tunning Model with New Dataset: GPT-2News

![Breaking News](https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fstatic.dnaindia.com%2Fsites%2Fdefault%2Ffiles%2Fstyles%2Ffull%2Fpublic%2F2017%2F06%2F02%2F580712-breaking-news.jpg&f=1&nofb=1&ipt=98058e9d5b86187d7f576d67f85f612bdd58c3b5ca30bb60bca7750b4846be90)

# Overview
- This Project purpose is fine-tuning GPT-2 with LoRA method.
- I used [fancyzhx/ag_news](https://huggingface.co/datasets/fancyzhx/ag_news) dataset to train model into model that mimic news style.


In [1]:
# datasets: hugging face libary for open datasets
from datasets import load_dataset

# load data set from specific hugging face hugging face
# this will return `DatasetcDict` class
dataset = load_dataset("fancyzhx/ag_news")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [2]:
dataset # check dataset information

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [3]:
from datasets import DatasetDict # DatasetDict class import

# We don't have to use all of dataset for fine-tunning, we just datasets about 2000 ~ 3000
# So, we will shuffle(for a more various) and select some datasets
train_small = dataset["train"].shuffle(seed=42).select(range(2000))
valid_small = dataset["test"].shuffle(seed=42).select(range(500))

# Make new datasetdict
small_dataset = DatasetDict({
  "train": train_small,
  "validation": valid_small
})

In [4]:
print(small_dataset) # check new dataset information

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
})


In [5]:
from transformers import GPT2Tokenizer

# Load GPT2Tokenizer from pre-tranied GPT2
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Set up pad token to padding. GPT2 doesn't have pad token, because it is casual LM.
tokenizer.pad_token = tokenizer.eos_token

# make tokenizer function for datasetdict
def tokenize_function(examples):
  # `truncation = True` means if number of token is longger than max_length, model will cut tokens until 128.
  # `padding = max_length` means if number of token is shorter than max_length, model will fill empty space by padding token.
  # `max_length = 128` means token maximum size.
  # For a fast tunning and learning, I restrict max token just 128.
  result = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

  # make labels colunm model can caculate loss and update weight
  result["labels"] = result["input_ids"].copy()

  return result

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [6]:
# make new Datasetdict object that is tokenized by tokenize_function(use GPT2 tokenizer).
# `DatasetDict.map(x: def)` function apply function `x` every row in `DatasetDict.Dataset`s.
# `batched=True` means `x` function apply with batch.
# `remove_columns=["text", "label"]` means delete colums in dataset.
tokenized_datasets = small_dataset.map(tokenize_function, batched=True, remove_columns=["text", "label"])

# new datasets has two colums `input_ids` and `attention_mask`.
# `input_ids`: row number for vector that specific word vectorized.
# `attention_mask`: 0 and 1 masking vector to indicate which vector have to use in attention
# -------- colums example ----------
# | input_ids |	attention_mask | | labels |
# | ---- | ---- | ---- |
# | [101, 12, 18, ...] |	[1, 1, 1, 1, ...] | [101, 12, 18, ...] |
print(tokenized_datasets)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
})


In [7]:
from peft import LoraConfig, get_peft_model # import LoRA related classes from PEFT(Parameter-Efficient Fine-Tuning) libray
from transformers import GPT2LMHeadModel

# import GPT2 casual LM version as pre-trained
model = GPT2LMHeadModel.from_pretrained("gpt2")

# to use LoRA(Low-Rank Adaptation) method, we can use LoraConfig class for set LoRA parameters
lora_config = LoraConfig(
  r=8, # rank in this LoRA. to be efficient, I set low rank slightly
  lora_alpha=16, # lora_alpha means adapter matrix's influence, (lora_alpha / r) is magnifying power that decide how many focus on LoRA adapter
  target_modules=["c_attn"], # c_attn(Cross-Attention Layer) is most efficient layer in transformer. It was mentioned in paper, table 5 of 'LoRA: Low-Rank Adaptation of Large Language Models'(2021).
  lora_dropout=0.05, # set percentage of lora dropout. I set 5%. It think that is very suitable.
  bias="none", # doesn't find new bias. ues original bias. Our purpose is sufficient only updating weights.
  task_type="CAUSAL_LM" # set as `CASUAL_LM` to work same as GPT2
)

# make new peft model with LoraConfig class from GPT2
model = get_peft_model(model, lora_config)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



In [8]:
from transformers import TrainingArguments, Trainer # classes for transformer model's learning

# TrainingArguments class is for containing argumetns that will use in training
training_args = TrainingArguments(
  output_dir="./results", # directory that will be stored model that is completed training.
  # set when will evaluate model evaluation. 'ephoc' means when ephoc done, model will be evaluated.
  # we can use parameters like 'no', 'ephoc' or 'step' in `eval_strategy`
  eval_strategy="epoch",
  learning_rate=2e-4, # learning late to update LoRA adapter weight. It is setted more bigger value than full-weight fine-tunning
  per_device_train_batch_size=4, # train batch size
  per_device_eval_batch_size=4, # test batch size
  num_train_epochs=5, # we will prcoess 5 ephocs
  weight_decay=0.01, # to prevent overfitting, set weight decay percentage as 1%.
  logging_dir="./logs", # directory that will be stored rate of loss changing.
  save_strategy="no", # we don't save check point model
  report_to="none" # doesn't use third-party for loss tracking
)

# Trainer class is for training specific model with taring arguments
trainer = Trainer(
  model=model, # target model we will tarin
  args=training_args, # tarining arguments
  train_dataset=tokenized_datasets["train"], # tarin datasets
  eval_dataset=tokenized_datasets["validation"], # test datasets
)

In [9]:
trainer.train() # train model

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,2.0447,1.585969
2,1.5957,1.530692
3,1.5501,1.501412
4,1.5275,1.490116
5,1.5158,1.487668


TrainOutput(global_step=2500, training_loss=1.646754296875, metrics={'train_runtime': 318.2826, 'train_samples_per_second': 31.419, 'train_steps_per_second': 7.855, 'total_flos': 655495004160000.0, 'train_loss': 1.646754296875, 'epoch': 5.0})

In [11]:
merged_model = model.merge_and_unload() # make new merged model that is merged LoRA updater

merged_model.save_pretrained("./gpt2-news-merged") # save model in storage

In [12]:
from transformers import AutoModelForCausalLM, GPT2Tokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained("./gpt2-news-merged") # load model with using AutoTransformer class
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") # Tokenizer is same GPT2's

# generate generator with using pipeline
generator = pipeline("text-generation", # task: text-generation
                     model=model, # model = gpt2-news-merged
                     tokenizer=tokenizer) # translate `input_ids` into natural language. with using this tokenizer(GPT2)

Device set to use cuda:0


In [13]:
prompt = "Breaking News" # prompt. model will be generating from this sentence

outputs = generator(prompt, # prompt
                    max_length=128, # max token length
                    num_return_sequences=3, # number of generting sequence
                    do_sample=True, # sampling true
                    top_k=50) # sampling only for top 50 vector

# print generated outputs
for i, out in enumerate(outputs):
    print(f"--- Generated #{i+1} ---")
    print(out["generated_text"])
    print()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


--- Generated #1 ---
Breaking News: New York Times Is In The Business Of Selling Paper for Profit

--- Generated #2 ---
Breaking News: A New York Times report from the United States indicates that President Barack Obama will be returning to his post as Secretary of State in his first week in office.

--- Generated #3 ---
Breaking News & Rumors A report from The Daily Mail has claimed that former Conservative MP David Cameron has been suspended from the party after a controversial speech in which he accused the party of being anti-Islamic.



# GPT-2 vs GPT-2**News**

In [16]:
from transformers import GPT2LMHeadModel, AutoModelForCausalLM, GPT2Tokenizer, pipeline

# ---------- GPT2 ------------
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# ---------- GPT2News ----------
model = AutoModelForCausalLM.from_pretrained("./gpt2-news-merged")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2news_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# --------- Test --------------
prompt = "Seoul, South Korea"

gpt2_outputs = gpt2_generator(prompt, max_length=128, num_return_sequences=3, do_sample=True, top_k=50)
gpt2news_outputs = gpt2news_generator(prompt, max_length=128, num_return_sequences=3, do_sample=True, top_k=50)

# 1. 원본 GPT-2의 생성 결과 출력
print("=============================================")
print("          🤖 Original GPT-2 Results          ")
print("=============================================")
print(f"PROMPT: '{prompt}'\n")

for i, output in enumerate(gpt2_outputs):
    print(f"--- Generated #{i+1} ---")
    # 파이프라인 결과는 딕셔너리 형태이므로 'generated_text' 키로 접근합니다.
    print(output['generated_text'])
    print() # 줄바꿈

# 2. 파인튜닝된 GPT-2 News 모델의 생성 결과 출력
print("\n=============================================")
print("      ✨ Fine-tuned 'GPT-2 News' Results      ")
print("=============================================")
print(f"PROMPT: '{prompt}'\n")

for i, output in enumerate(gpt2news_outputs):
    print(f"--- Generated #{i+1} ---")
    print(output['generated_text'])
    print() # 줄바꿈

Device set to use cuda:0
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-st

          🤖 Original GPT-2 Results          
PROMPT: 'Seoul, South Korea'

--- Generated #1 ---
Seoul, South Korea, July 31, 2017. REUTERS/Kim Hong-Ji

The new law comes as Beijing and Seoul have increasingly focused on the so-called "comfort women" - women who are forced to perform sex acts under pressure from their husbands.

The South Korean government said the legislation had been drafted over the past two years and aimed not only at the most vulnerable, but also those with pre-existing mental health problems.

"The law should be used to punish those who have such disabilities. We are also concerned about the psychological impact on their families," a ministry statement said.

"We will use it as a deterrent against those who dare to call themselves 'comfort women,'" it said.

Many men have been forced to perform sex acts under pressure from their wives, who are not protected under the law, according to the rights group.

In January, North Korea's ministry of women's affairs publish