# Fine-Tuning DistilGPT for Story Generation

The realm of text generation within natural language processing is expansive and multifaceted, employing machine learning to generate novel text from provided prompts. A notable example is 'GitHub's Copilot', recognized for its prowess in code generation. However, the applications of text generation extend far beyond this:

> - **Story Generation:** Utilizing prompts like "Once upon a time," models such as DistilGPT can weave engaging and creative narratives.
>
> - **Poetry Creation:** These models are adept at crafting poetry, requiring nuanced attention to style and theme.
>
> - **Paragraph Completion:** They efficiently complete unfinished paragraphs, ensuring smooth continuity and contextual accuracy.
>
> - **Article Summarization:** They excel at condensing extensive articles into concise, essential summaries.
>
> - **Question Answering:** Provided with relevant context, these models can tackle a wide range of questions.

DistilGPT, a streamlined version of GPT, is designed for efficient performance while maintaining robust capabilities. It is particularly suited for scenarios where resource optimization is key, yet high-quality text generation is desired.

In this notebook, we focus on fine-tuning DistilGPT. Unlike models that assess both preceding and subsequent context (such as BERT), DistilGPT is a causal language model, predicting the next word based primarily on previous context. This characteristic is instrumental in progressively understanding the structure and nuances of text.

**Our Approach Includes:**

> - **Dataset Loading:** We begin by importing the data from CSV files, preparing it for the model.
>
> - **Text Tokenization and Preprocessing:** The raw text data is then tokenized and preprocessed, making it suitable for the model's consumption.
>
> - **Batch Creation:** We organize the processed data into batches, an essential step for efficient model training.
>
> - **Fine-Tuning with Pretrained Weights:** Leveraging the pre-trained weights of DistilGPT, we fine-tune the model to align closely with our specific text generation objectives.
>
> - **Model Evaluation:** Post-training, we assess the model's performance using various evaluation metrics to ensure its efficacy in generating high-quality text.


## Installing Required Libraries

In [1]:
!pip install datasets 
!pip install transformers
!pip install evaluate
!pip install bert_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Parameters for notebook execution:

We are going to execute this notebook from the terminal with the help of Papermill. So, it's better to store all the parameters need for a successful execution in one place. This way it's easy to manage the parameters. 

In [78]:
# parameters
# MODEL CHECKPOINT
MODEL_CHECKPOINT='distilgpt2'


# SAMPLE
TRAIN_ROWS=50000
TEST_ROWS=5000


# PATH OF CSV FILES
DIR_PATH = "./Downloads"
TRAIN_PATH= DIR_PATH+"/data/train_df.csv"
VALID_PATH=DIR_PATH+"/data/valid_df.csv"
TEST_PATH= DIR_PATH+"/data/test_df.csv"

# DATA PROCESSING
CONTEXT_LEN=256

# HYPERPARAMETERS
TRAIN_BS= 64
TEST_BS= 64 
EPOCHS=5

## Load the data:

The initial step involves importing data from `CSV` files. Following this, the data is transformed into a Hugging Face `Dataset` object for further processing and utilization.

In [79]:
# import libraries
from datasets import load_dataset

In [80]:
dataset = load_dataset("csv",
                       data_files={"train":TRAIN_PATH, "test": VALID_PATH})

#select a sample of dataset
dataset['train']= dataset['train'].select(range(TRAIN_ROWS))
dataset['test']=dataset['train'].select(range(TEST_ROWS))

dataset



  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'stories', 'prompts'],
        num_rows: 50000
    })
    test: Dataset({
        features: ['Unnamed: 0', 'stories', 'prompts'],
        num_rows: 5000
    })
})

In [10]:
# remove unnamed column
dataset = dataset.remove_columns(['Unnamed: 0'])

dataset

DatasetDict({
    train: Dataset({
        features: ['stories', 'prompts'],
        num_rows: 50000
    })
    test: Dataset({
        features: ['stories', 'prompts'],
        num_rows: 5000
    })
})

Let's see an example from the dataset

In [11]:
for key in dataset["train"][0]:
    print(f"{key.upper()}: {dataset['train'][0][key][:500]} \n ")

STORIES: So many times have I walked on ruins , the remainings of places that I loved and got used to.. At first I was scared , each time I could feel my city , my current generation collapse , break into the black hole that thrives within it , I could feel humanity , the way I 'm able to feel my body.. After a few hundred years , the pattern became obvious , no longer the war and damage that would devastate me over and over again in the far past was effecting me so dominantly . <newline> It 's funny , b 
 
PROMPTS:  You 've finally managed to discover the secret to immortality . Suddenly , Death appears before you , hands you a business card , and says , `` When you realize living forever sucks , call this number , I 've got a job offer for you . '' 
 


### Assumption

Our dataset consists of two distinct columns:

1. `stories`
2. `prompts`

The objective is to train a causal language model focused on story generation. Therefore, our training will primarily utilize the `stories` column. The model will be trained to complete stories, beginning from an initial segment provided by the `prompts` column.


## Data Processing: Tokenization Phase

1. **Context Window Size**: Given that the inputs are concise prompts, we opt for a smaller context window. This approach has two key advantages:
   - **Faster Training**: A smaller context window allows for quicker model training.
   - **Reduced Memory Requirements**: It significantly lessens the memory needed for processing.
   


2. **Handling Larger Documents**: In instances where documents exceed the set context window size, the following steps are taken:
   - **Chunking**: These documents are divided into multiple chunks, each aligning with the size of the context window.
   - **Discarding Small Chunks**: Occasionally, if the final chunk of a document is too small, it may be discarded to maintain consistency in input size.


In [12]:
from transformers import AutoTokenizer

In [81]:
# loading a pretrained tokenizer for the selected model 

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Let's see what the tokenizer does.

In [82]:
outputs = tokenizer(
    dataset["train"][:2]["stories"],
    truncation=True,
    max_length=CONTEXT_LEN,
    return_overflowing_tokens=True,
    return_length=True,
)

In [83]:
print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 7
Input chunk lengths: [256, 256, 240, 256, 256, 256, 129]
Chunk mapping: [0, 0, 0, 1, 1, 1, 1]


Now, let's create a tokenize function

In [84]:
def tokenize(element):
    
    outputs = tokenizer(
        element["stories"],
        truncation=True,
        max_length=CONTEXT_LEN,
        return_overflowing_tokens=True,
        return_length=True,
    )
    
    input_batch = []
    
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == CONTEXT_LEN:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

In [85]:
tokenized_datasets = dataset.map(
    tokenize, batched=True, remove_columns=dataset["train"].column_names
)
tokenized_datasets

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 114239
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 11363
    })
})

Ok!!! That was straightforward. We have completed the tokenization. Now, it is time to train the model. 

In [86]:
# we can get our tokenized text back using the decode function 
tokenizer.decode(tokenized_datasets['train'][0]['input_ids'])

"So many times have I walked on ruins, the remainings of places that I loved and got used to.. At first I was scared, each time I could feel my city, my current generation collapse, break into the black hole that thrives within it, I could feel humanity, the way I'm able to feel my body.. After a few hundred years, the pattern became obvious, no longer the war and damage that would devastate me over and over again in the far past was effecting me so dominantly. <newline> It's funny, but I felt as if after gaining what I desired so long, what I have lived for my entire life, only then, when I achieved immortality I started truly aging. <newline> <newline> 5 world wars have passed, and now they feel like a simple sickeness that would pass by every so often, I could no longer evaluate the individual human as a being of its own, the importance of mortals is merely the same as the importance of my skin cells ; They are a part of a mechanism so much more advanced, a mechanism that is so dear

## Training the Model with DistilGPT

When it comes to transformer architectures for text generation, there is a variety of options, each with its unique characteristics and strengths:

1. **GPT-2:** Known for its effectiveness in generating coherent and contextually relevant text, GPT-2 is a go-to choice for many natural language processing tasks.
2. **BART:** An encoder-decoder architecture, BART excels in tasks that require understanding and rephrasing input text, like summarization and translation.
3. **BERT:** While primarily used for understanding the context of a word in a sentence, BERT's architecture is less suited for text generation but excellent for tasks like classification and question-answering.

For our current purpose, we're choosing **DistilGPT**, a streamlined variant of the GPT architecture. DistilGPT is designed to provide the powerful capabilities of GPT, but with a structure optimized for faster training and reduced model size, making it ideal for environments where computational resources are a consideration.

In this section, our focus is to train a causal language model utilizing the DistilGPT architecture. We will:

- Begin by setting up and training DistilGPT on our dataset, harnessing its proficiency in generating coherent and engaging text.
- Progress to fine-tuning various hyperparameters to optimize the model's performance for our specific text generation requirements.
- Additionally, explore other architectures like BART and T5, especially those with an encoder-decoder framework, to compare and understand their effectiveness in different aspects of text generation.


In [87]:
from transformers import AutoModelForCausalLM, AutoConfig

Let's initialize the model

In [88]:
config = AutoConfig.from_pretrained(
    MODEL_CHECKPOINT,
    vocab_size=len(tokenizer),
    n_ctx=CONTEXT_LEN,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

In [92]:
# initialize the model
model = AutoModelForCausalLM.from_pretrained(MODEL_CHECKPOINT)
model_size = sum(t.numel() for t in model.parameters())

# how many parameters?
print(f"Distilgpt2 size: {model_size/1000**2:.1f}M parameters")

Distilgpt2 size: 81.9M parameters


### Creating batches:

HuggingFace provides us with the `DataCollatorForLanguageModeling` collator, which is designed specifically for language modeling. Besides stacking and padding batches, it also takes care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training so we don’t need to duplicate the `input_ids`.

In [93]:
from transformers import DataCollatorForLanguageModeling

In [94]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Inspect the data collator with an example

In [95]:
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])

print("Collating data:")
for key in out:
    print(f"{key} shape: {out[key].shape}")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Collating data:
input_ids shape: torch.Size([5, 256])
attention_mask shape: torch.Size([5, 256])
labels shape: torch.Size([5, 256])


### Set training parameters:

In [96]:
from transformers import Trainer, TrainingArguments
import math 

Let's set the hyperparameters for training.

In [97]:
args = TrainingArguments(
    output_dir="ai-story_gen",
    per_device_train_batch_size=TRAIN_BS,
    per_device_eval_batch_size=TEST_BS,
    evaluation_strategy="steps",
    eval_steps=1_00,
    logging_steps=1_00,
    gradient_accumulation_steps=8,
    num_train_epochs=EPOCHS,
    weight_decay=0.1,
    warmup_steps=1_00,
    lr_scheduler_type="cosine",
    learning_rate=5e-3,
    save_steps=5_00,
    fp16=True,
    push_to_hub=False,
)

Initialize the `Trainer` API instance. 

In [98]:
def compute_metrics(pred):
    # compute perplexity from the model's output
    perplexity = math.exp(-pred['eval_loss'])
    return {"perplexity": perplexity}

In [99]:
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    # compute_metrics=compute_metrics
)

In [100]:
trainer.train()



Step,Training Loss,Validation Loss
100,3.4002,3.222552
200,3.3223,3.09147
300,3.1939,3.019011
400,3.176,2.952123
500,3.0766,2.887738
600,3.0352,2.81865
700,2.9767,2.744796
800,2.8767,2.687469
900,2.8597,2.633229
1000,2.7395,2.606083


TrainOutput(global_step=1115, training_loss=3.0318873486711304, metrics={'train_runtime': 1961.3507, 'train_samples_per_second': 291.225, 'train_steps_per_second': 0.568, 'total_flos': 3.729201094773965e+16, 'train_loss': 3.0318873486711304, 'epoch': 5.0})

In [101]:
trainer.save_model("./story_distilgpt2_finetune")

## Story generator pipeline:

In [102]:
import torch
from transformers import pipeline

In [103]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline(
    "text-generation", model=model,tokenizer=tokenizer, device=device
)

Let's see an example of the text generated by our trained model

In [104]:
txt = "Once upon a time"
print(pipe(txt, num_return_sequences=1,num_beams=3,
           do_sample=True, max_new_tokens=100,
          pad_token_id=tokenizer.eos_token_id)[0]["generated_text"])



Once upon a time, there was a little girl. <newline> <newline> She was the most beautiful thing in the world. <newline> <newline> She was a beautiful girl. <newline> <newline> She had a lot of freckles on her face. <newline> <newline> She had a lot of freckles. <newline> <newline> She had a lot of freckles. <newline> <newline


## Evaluating the Language Model

Evaluating text generation, particularly for a language model, can be both subjective and dependent on specific task goals. However, several common methods are widely used for assessing the quality of generated text:

1. **Human Evaluation**: 
   - This method involves human evaluators who read and rate the generated text on various aspects like coherence, fluency, and relevance to the prompt or topic.
   - It offers a subjective but insightful evaluation, reflecting the extent to which the generated text resembles human writing.

2. **Automated Metrics**:
   - Automated metrics provide quantitative evaluations of the generated text.
   - Common metrics include:
     - **Perplexity**: Measures how well the probability distribution predicted by the model aligns with the actual text.
     - **BLEU Score**: Assesses the similarity of the generated text to a set of reference texts.
     - **ROUGE Score**: Often used in summarization tasks to compare the overlap of content between generated and reference texts.

3. **Domain-Specific Metrics**:
   - For tasks focused on specific domains (like medical or legal), domain-specific metrics evaluate the accuracy and completeness of the text within that domain.

4. **User Testing**:
   - This involves actual users interacting with the generated text and providing feedback on its quality, usefulness, and relevance.

In this notebook, our primary focus is on evaluating the model using automated metrics, particularly:

- **Perplexity**: We will calculate and analyze the perplexity of the model to understand how well it predicts the test data.
- **BERT Score**: This metric will be used to evaluate the semantic similarity between the generated text and reference texts.


In [105]:
import torch
from tqdm import tqdm

from datasets import load_metric
from evaluate import load

## __Perplexity__

In [106]:
encodings = tokenizer("\n\n".join(dataset['test']['stories']),return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (3576967 > 1024). Running this sequence through the model will result in indexing errors


In [107]:
encodings['input_ids'].shape

torch.Size([1, 3576967])

In [108]:
max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0

for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

100%|█████████▉| 6985/6987 [01:50<00:00, 63.43it/s]


In [109]:
ppl

tensor(125.9746, device='cuda:0')

In [111]:
def data(n=10):
    for i in range(n):
        
        # pass first 10 words
        yield " ".join(dataset['test'][i]['stories'].split(" ")[:10]) 

In [112]:
predictions=[]
references=[]

i=0
for out in pipe(data(),num_return_sequences=1,num_beams=3,
           do_sample=True, max_new_tokens=100,
          pad_token_id=tokenizer.eos_token_id):

    references.append(" ".join(dataset['test'][i]['stories'].split(" ")[10:100]))
    predictions.append(" ".join(out[0]['generated_text'].split(" ")[10:100]))
    
    i+=1


## __BERT score__

In [114]:
bertscore = load("bertscore")
results = bertscore.compute(predictions=predictions, references=references, lang="en")

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

In [115]:
results['f1']

[0.7954513430595398,
 0.7691985368728638,
 0.8038576245307922,
 0.788618266582489,
 0.8339035511016846,
 0.7960880994796753,
 0.8031459450721741,
 0.8546534180641174,
 0.8405726552009583,
 0.8452156782150269]