# **Finetune GPT-2 using Domain-Adapting Technique: A Comprehensive Guide**

This notebook provides a detailed guide on how to fine-tune GPT-2 for generating domain-specific text. Designed for practitioners and enthusiasts of natural language processing (NLP), this notebook covers the end-to-end process of adapting a pre-trained GPT-2 model to suit specific text generation tasks.



## **What is Domain Adapting Fine-Tuning?**

**Domain Adapting Fine-Tuning** refers to the practice of further training a pre-trained model on domain-specific data to improve its performance and relevance for particular tasks or content areas

## **Why is Domain Adapting Fine-Tuning Important?**

**Relevance and Accuracy**

* `Contextual Understanding`: Domain-specific fine-tuning helps the model understand and generate text that is relevant to the specific domain, improving its accuracy and usefulness for specialized tasks.

* `Terminology and Jargon`: The model becomes familiar with domain-specific terminology and jargon, leading to better interpretation and generation of domain-related content.

**Performance Improvement**

* `Task-Specific Tuning`: Fine-tuning allows the model to perform better on tasks that require domain-specific knowledge, such as classifying legal documents, summarizing medical reports, or generating financial news.

* `Reduced Generalization Gap`: Fine-tuning narrows the gap between the model’s general capabilities and its performance on specific, domain-related tasks.

**Efficiency**

* `Less Training Data Needed`: By starting with a pre-trained model, domain adapting fine-tuning typically requires less domain-specific data compared to training a model from scratch.

* `Time and Resource Savings`: Fine-tuning is often faster and more resource-efficient than training a new model, as it builds on existing knowledge.

## **About datset**

The News Articles Dataset is a comprehensive collection of news articles scraped from thenews.com.pk, covering a span from 2015 to the present. This dataset includes articles related to business and sports, offering a rich source of information for various types of text analysis and research.

[dataset download link](https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles)

## **Install transformers**

In [2]:
!pip install transformers -q

# **Importing libraries**

In [3]:
import pandas as pd
import numpy as np
import re
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer


## **Pre-Processing**

In [4]:
def cleaning(s):
    s = str(s)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W,\s',' ',s)
    s = re.sub("\d+", "", s)
    s = re.sub('\s+',' ',s)
    s = re.sub('[!@#$_]', '', s)
    s = s.replace("co","")
    s = s.replace("https","")
    s = s.replace("[\w*"," ")
    return s

In [7]:
df = pd.read_csv("Articles.csv", encoding="ISO-8859-1")
df = df.dropna()
text_data = open('Articles.txt', 'w')
for idx, item in df.iterrows():
  article = cleaning(item["Article"])
  text_data.write(article)
text_data.close()

## **Function for load_datset**

**Block Size**:

The block_size determines how many tokens are included in each block or chunk of text. If block_size is set to 128, the text will be split into sequences of 128 tokens each.

In [8]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

## **Function for load data collator**

In [9]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator

## **function for train the model**

In [10]:

def train(train_file_path, model_name, output_dir, overwrite_output_dir,
          per_device_train_batch_size, num_train_epochs, save_steps):

    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    train_dataset = load_dataset(train_file_path, tokenizer)
    data_collator = load_data_collator(tokenizer)

    tokenizer.save_pretrained(output_dir)

    model = GPT2LMHeadModel.from_pretrained(model_name)

    model.save_pretrained(output_dir)

    training_args = TrainingArguments(
            output_dir=output_dir,
            overwrite_output_dir=overwrite_output_dir,
            per_device_train_batch_size=per_device_train_batch_size,
            num_train_epochs=num_train_epochs,
        )

    trainer = Trainer(
            model=model,
            args=training_args,
            data_collator=data_collator,
            train_dataset=train_dataset,
    )

    trainer.train()
    trainer.save_model()



## **Parameters for train the model**

In [11]:
# you need to set parameters
train_file_path = "/content/Articles.txt"
model_name = 'gpt2'
output_dir = '/content/result'
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 3.0
save_steps = 500

## **Train the model**

In [12]:
# It takes about 30 minutes to train in colab.
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Step,Training Loss
500,3.6909
1000,3.4251
1500,3.1911
2000,3.1443
2500,3.0392
3000,3.0155


## **Function for load the model**

In [13]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model

## **Function for load the tokenizer**

In [14]:
def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

## **Function for Generate text**

In [27]:
def generate_text(sequence, max_length):

    model_path = "/content/result/checkpoint-3000"
    tokenizer_path = "/content/result"

    model = load_model(model_path)
    tokenizer = load_tokenizer(tokenizer_path)

    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    attention_mask = ids.ne(tokenizer.pad_token_id).long()

    # Generate the text
    final_outputs = model.generate(
        ids,
        attention_mask=attention_mask,
        do_sample=True,
        max_length=max_length,
        pad_token_id=tokenizer.pad_token_id,
        top_k=50,
        top_p=0.95,
    )

    # Decode and print the output
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

## **Generate Text**

In [29]:
sequence = input() # What is the sentiment of the oil price
max_len = int(input())
generate_text(sequence, max_len)

What is the sentiment of the oil price
200
What is the sentiment of the oil price? Is OPEC going to change it?" Imey said. It´s not a good day for markets to have no direction."Oil prices also slumped last week after a disappointing OPEC meeting earlier this week ended in a -day low of almost US. per barrel.But traders also said more uncertainty over the world´s main enomy has led to a weaker dollar, with the International Monetary Fund expected to raise interest rates this week, signalling a slowing outlook for growth.A weak enomy makes the dollar weaker for long-term investors with investors expecting that further growth in the world´s biggest enomy will be a major reason for the dollar to move lower.The Bank of England raised rates to keep pace with growing demand but said it was now also planning to increase its interest rate.A lower yen would ease the risk of interest rate increases, in turn leading to lower oil prices. However, traders are betting that an oil-driven U.S
