# GPT-2 Fine-Tuning Model with New Dataset: GPT-2News

![Breaking News](https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fstatic.dnaindia.com%2Fsites%2Fdefault%2Ffiles%2Fstyles%2Ffull%2Fpublic%2F2017%2F06%2F02%2F580712-breaking-news.jpg&f=1&nofb=1&ipt=98058e9d5b86187d7f576d67f85f612bdd58c3b5ca30bb60bca7750b4846be90)

# Overview
- This Project purpose is fine-tuning GPT-2 with LoRA method.
- I used [fancyzhx/ag_news](https://huggingface.co/datasets/fancyzhx/ag_news) dataset to train model into model that mimic news style.


In [17]:
# datasets: hugging face libary for open datasets
from datasets import load_dataset

# load data set from specific hugging face hugging face
# this will return `DatasetcDict` class
dataset = load_dataset("fancyzhx/ag_news")

In [18]:
dataset # check dataset information

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [19]:
from datasets import DatasetDict # DatasetDict class import

# We don't have to use all of dataset for fine-tunning, we just datasets about 2000 ~ 3000
# So, we will shuffle(for a more various) and select some datasets
train_small = dataset["train"].shuffle(seed=42).select(range(2000))
valid_small = dataset["test"].shuffle(seed=42).select(range(500))

# Make new datasetdict
small_dataset = DatasetDict({
  "train": train_small,
  "validation": valid_small
})

In [20]:
print(small_dataset) # check new dataset information

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
})


In [21]:
from transformers import GPT2Tokenizer

# Load GPT2Tokenizer from pre-tranied GPT2
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Set up pad token to padding. GPT2 doesn't have pad token, because it is casual LM.
tokenizer.pad_token = tokenizer.eos_token

# make tokenizer function for datasetdict
def tokenize_function(examples):
  # `truncation = True` means if number of token is longger than max_length, model will cut tokens until 128.
  # `padding = max_length` means if number of token is shorter than max_length, model will fill empty space by padding token.
  # `max_length = 128` means token maximum size.
  # For a fast tunning and learning, I restrict max token just 128.
  result = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

  # make labels colunm model can caculate loss and update weight
  result["labels"] = result["input_ids"].copy()

  return result

In [22]:
# make new Datasetdict object that is tokenized by tokenize_function(use GPT2 tokenizer).
# `DatasetDict.map(x: def)` function apply function `x` every row in `DatasetDict.Dataset`s.
# `batched=True` means `x` function apply with batch.
# `remove_columns=["text", "label"]` means delete colums in dataset.
tokenized_datasets = small_dataset.map(tokenize_function, batched=True, remove_columns=["text", "label"])

# new datasets has two colums `input_ids` and `attention_mask`.
# `input_ids`: row number for vector that specific word vectorized.
# `attention_mask`: 0 and 1 masking vector to indicate which vector have to use in attention
# -------- colums example ----------
# | input_ids |	attention_mask | | labels |
# | ---- | ---- | ---- |
# | [101, 12, 18, ...] |	[1, 1, 1, 1, ...] | [101, 12, 18, ...] |
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
})


In [23]:
from peft import LoraConfig, get_peft_model # import LoRA related classes from PEFT(Parameter-Efficient Fine-Tuning) libray
from transformers import GPT2LMHeadModel

# import GPT2 casual LM version as pre-trained
model = GPT2LMHeadModel.from_pretrained("gpt2")

# to use LoRA(Low-Rank Adaptation) method, we can use LoraConfig class for set LoRA parameters
lora_config = LoraConfig(
  r=8, # rank in this LoRA. to be efficient, I set low rank slightly
  lora_alpha=16, # lora_alpha means adapter matrix's influence, (lora_alpha / r) is magnifying power that decide how many focus on LoRA adapter
  target_modules=["c_attn"], # c_attn(Cross-Attention Layer) is most efficient layer in transformer. It was mentioned in paper, table 5 of 'LoRA: Low-Rank Adaptation of Large Language Models'(2021).
  lora_dropout=0.05, # set percentage of lora dropout. I set 5%. It think that is very suitable.
  bias="none", # doesn't find new bias. ues original bias. Our purpose is sufficient only updating weights.
  task_type="CAUSAL_LM" # set as `CASUAL_LM` to work same as GPT2
)

# make new peft model with LoraConfig class from GPT2
model = get_peft_model(model, lora_config)



In [24]:
from transformers import TrainingArguments, Trainer # classes for transformer model's learning

# TrainingArguments class is for containing argumetns that will use in training
training_args = TrainingArguments(
  output_dir="./results", # directory that will be stored model that is completed training.
  # set when will evaluate model evaluation. 'ephoc' means when ephoc done, model will be evaluated.
  # we can use parameters like 'no', 'ephoc' or 'step' in `eval_strategy`
  eval_strategy="epoch",
  learning_rate=2e-4, # learning late to update LoRA adapter weight. It is setted more bigger value than full-weight fine-tunning
  per_device_train_batch_size=4, # train batch size
  per_device_eval_batch_size=4, # test batch size
  num_train_epochs=5, # we will prcoess 5 ephocs
  weight_decay=0.01, # to prevent overfitting, set weight decay percentage as 1%.
  logging_dir="./logs", # directory that will be stored rate of loss changing.
  save_strategy="no", # we don't save check point model
  report_to="none" # doesn't use third-party for loss tracking
)

# Trainer class is for training specific model with taring arguments
trainer = Trainer(
  model=model, # target model we will tarin
  args=training_args, # tarining arguments
  train_dataset=tokenized_datasets["train"], # tarin datasets
  eval_dataset=tokenized_datasets["validation"], # test datasets
)

In [25]:
trainer.train() # train model

Epoch,Training Loss,Validation Loss
1,2.0372,1.584247
2,1.5952,1.531527
3,1.5507,1.503057
4,1.5283,1.491394
5,1.5162,1.488445


TrainOutput(global_step=2500, training_loss=1.64554443359375, metrics={'train_runtime': 325.4463, 'train_samples_per_second': 30.727, 'train_steps_per_second': 7.682, 'total_flos': 655495004160000.0, 'train_loss': 1.64554443359375, 'epoch': 5.0})

In [26]:
merged_model = model.merge_and_unload() # make new merged model that is merged LoRA updater

merged_model.save_pretrained("./gpt2-news-merged") # save model in storage

In [27]:
from transformers import AutoModelForCausalLM, GPT2Tokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained("./gpt2-news-merged") # load model with using AutoTransformer class
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") # Tokenizer is same GPT2's

# generate generator with using pipeline
generator = pipeline("text-generation", # task: text-generation
                     model=model, # model = gpt2-news-merged
                     tokenizer=tokenizer) # translate `input_ids` into natural language. with using this tokenizer(GPT2)

Device set to use cuda:0


In [28]:
prompt = "Breaking News" # prompt. model will be generating from this sentence

outputs = generator(prompt, # prompt
                    max_length=128, # max token length
                    num_return_sequences=3, # number of generting sequence
                    do_sample=True, # sampling true
                    top_k=50) # sampling only for top 50 vector

# print generated outputs
for i, out in enumerate(outputs):
    print(f"--- Generated #{i+1} ---")
    print(out["generated_text"])
    print()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


--- Generated #1 ---
Breaking News: New York Times Is In The Business Of Selling Paper for Profit

--- Generated #2 ---
Breaking News: A New York Times Reporter Calls for President to 'Make America Great Again' AP A New York Times reporter called for President Donald Trump to "make America great again" on Monday, tweeting the slogan # #AmericaFirst.

--- Generated #3 ---
Breaking News & Rumors A report from The Daily Mail has claimed that former Conservative MP David Cameron has been suspended from the party after a controversial speech in which he accused the party of being anti-Islamic.



# GPT-2 vs GPT-2**News**

In [29]:
from transformers import GPT2LMHeadModel, AutoModelForCausalLM, GPT2Tokenizer, pipeline

# ---------- GPT2 ------------
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# ---------- GPT2News ----------
model = AutoModelForCausalLM.from_pretrained("./gpt2-news-merged")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2news_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# --------- Test --------------
prompt = "Seoul, South Korea"

gpt2_outputs = gpt2_generator(prompt, max_length=128, num_return_sequences=3, do_sample=True, top_k=50)
gpt2news_outputs = gpt2news_generator(prompt, max_length=128, num_return_sequences=3, do_sample=True, top_k=50)

# 1. 원본 GPT-2의 생성 결과 출력
print("=============================================")
print("          🤖 Original GPT-2 Results          ")
print("=============================================")
print(f"PROMPT: '{prompt}'\n")

for i, output in enumerate(gpt2_outputs):
    print(f"--- Generated #{i+1} ---")
    # 파이프라인 결과는 딕셔너리 형태이므로 'generated_text' 키로 접근합니다.
    print(output['generated_text'])
    print() # 줄바꿈

# 2. 파인튜닝된 GPT-2 News 모델의 생성 결과 출력
print("\n=============================================")
print("      ✨ Fine-tuned 'GPT-2 News' Results      ")
print("=============================================")
print(f"PROMPT: '{prompt}'\n")

for i, output in enumerate(gpt2news_outputs):
    print(f"--- Generated #{i+1} ---")
    print(output['generated_text'])
    print() # 줄바꿈

Device set to use cuda:0
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-st

          🤖 Original GPT-2 Results          
PROMPT: 'Seoul, South Korea'

--- Generated #1 ---
Seoul, South Korea, and South Korea are considered the most important oil reserves in the world. However, many of the countries that rely on oil for their livelihood are also rich in natural resources, including oil from the Bakken, which is a huge, natural resource.

The United States is the world's largest producer of oil, with a combined total of about 2.3 billion barrels of oil. The United States produces about 25 percent of the world's oil, and has a significant share of the global population. However, the United States does not have the largest reserves of oil, and only about 20 percent of its oil is produced in the country that produces it.

In 2014, the U.S. reported a total of $1.7 trillion in assets abroad, or nearly $1.5 trillion in the United States. The United States relies on the Bakken for its oil production, which is expected to reach $100 billion by 2020.

The United States 