<a href="https://colab.research.google.com/github/preethimaran/Poem-Generator-using-Transformer/blob/main/Poetry_generator_using_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First we will install and import the required directories

In [None]:
!pip install pandas numpy torch matplotlib




In [None]:
import pandas as pd
import numpy as np


import torch
import torch.nn as nn
import torch.nn.functional as F # to access the softmax function we will use while calculating attention


from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader

import lightning as L

import matplotlib.pyplot as plt

Next we have to set the device on which computations are to be performed. We will set it to be cuda

In [None]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
DEVICE

device(type='cuda')

Let us now download and use the poetry dataset

In [None]:
!pip install kaggle




We will use the Poetry Foundation dataset available in Kaggle. For that first we need to upload the kaggle.json file

In [None]:
from google.colab import files
files.upload()  # This will prompt you to select kaggle.json

Saving kaggle.json to kaggle (1).json


{'kaggle (1).json': b'{"username":"preethimaran","key":"00c0a7e12937ea5ea084dd931b75d7df"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


Unzipping the csv file

In [None]:
!kaggle datasets download -d tgdivy/poetry-foundation-poems
!unzip poetry-foundation-poems.zip -d poetry_dataset


Dataset URL: https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems
License(s): GNU Affero General Public License 3.0
poetry-foundation-poems.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  poetry-foundation-poems.zip
replace poetry_dataset/PoetryFoundationData.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: poetry_dataset/PoetryFoundationData.csv  


The other option is to directly download the zip file from:https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems

Unzip the csv file and create the directory under the folder containing the ipynb file by running the below:

  *import os*

  *os.makedirs("poetry_dataset", exist_ok=True)*

And place the csv file in the above directory

Reading the file as a pandas dataframe

In [None]:
import pandas as pd

df = pd.read_csv("poetry_dataset/PoetryFoundationData.csv")
print(df.head(5))



   Unnamed: 0                                              Title  \
0           0  \r\r\n                    Objects Used to Prop...   
1           1  \r\r\n                    The New Church\r\r\n...   
2           2  \r\r\n                    Look for Me\r\r\n   ...   
3           3  \r\r\n                    Wild Life\r\r\n     ...   
4           4  \r\r\n                    Umbrella\r\r\n      ...   

                                                Poem              Poet Tags  
0  \r\r\nDog bone, stapler,\r\r\ncribbage board, ...  Michelle Menting  NaN  
1  \r\r\nThe old cupola glinted above the clouds,...     Lucia Cherciu  NaN  
2  \r\r\nLook for me under the hood\r\r\nof that ...        Ted Kooser  NaN  
3  \r\r\nBehind the silo, the Mother Rabbit\r\r\n...   Grace Cavalieri  NaN  
4  \r\r\nWhen I push your button\r\r\nyou fly off...      Connie Wanek  NaN  


In [None]:
df.columns

Index(['Unnamed: 0', 'Title', 'Poem', 'Poet', 'Tags'], dtype='object')

Looking at one of the poems

In [None]:
print(df['Poem'].iloc[0])


Dog bone, stapler,
cribbage board, garlic press
     because this window is loose—lacks
suction, lacks grip.
Bungee cord, bootstrap,
dog leash, leather belt
     because this window had sash cords.
They frayed. They broke.
Feather duster, thatch of straw, empty
bottle of Elmer's glue
     because this window is loud—its hinges clack
open, clack shut.
Stuffed bear, baby blanket,
single crib newel
     because this window is split. It's dividing
in two.
Velvet moss, sagebrush,
willow branch, robin's wing
     because this window, it's pane-less. It's only
a frame of air.



In [None]:
# Selecting poems that are ≤250 words and are not empty. This is beacuse we will be finetuning a gpt-2 transformer, and that transformer can only handle shorter text lengths
df_filtered = df[(df['Poem'].apply(lambda x: len(x.split()) <= 250)) & (df['Poem'].str.strip() != "")]
df_filtered = df_filtered.copy()

# Adding a word count column
df_filtered['word_count'] = df_filtered['Poem'].apply(lambda x: len(x.split()))

# Stats
print("Max words in Poem:", df_filtered['word_count'].max())
print("Min words in Poem:", df_filtered['word_count'].min())
print("Average words in Poem:", df_filtered['word_count'].mean())



Max words in Poem: 250
Min words in Poem: 1
Average words in Poem: 125.73514481934906


Checking the number of poems

In [None]:
len(df_filtered)

10047

In [None]:
df_filtered['Poem']

Unnamed: 0,Poem
0,"\r\r\nDog bone, stapler,\r\r\ncribbage board, ..."
1,"\r\r\nThe old cupola glinted above the clouds,..."
2,\r\r\nLook for me under the hood\r\r\nof that ...
3,"\r\r\nBehind the silo, the Mother Rabbit\r\r\n..."
4,\r\r\nWhen I push your button\r\r\nyou fly off...
...,...
13835,"\r\r\nDear Writers, I’m compiling the first in..."
13848,\r\r\nThe Wise Men will unlearn your name.\r\r...
13849,\r\r\nWe'd like to talk with you about ...
13852,\r\r\n Philosophic\r\r\nin its comple...


Installing and importing the Transformer

In [None]:
!pip install transformers



In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# We need to use GPT2LMHead model for text generation
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Defining the tokenizer to tokenize our input
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')


# Here we need to add any special tokens we are using, 'pad_token' is the keyword to define the [PAD] token.
# Note: For any other tokens we use, we must give them as a list under 'additional_special_tokens'
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# tokenizer.add_special_tokens({'additional_special_tokens': ['[POEM]','[TITLE]','[TOPIC]']})

model.resize_token_embeddings(len(tokenizer))  # Each token will be present at an embedding index. As we have added an additional [PAD] token we must resize the embedding.

# Converting the input data into PyTorch tensors. When we use tokenizer method, a dictionary is returned. The key 'input_ids' contains the data as a pytorch tensor. The key
# attention_mask contains pytorch tensor which shows which are actual data and which are tokens
# padding = True, will take the longest input data length for padding
input = tokenizer(list(df['Poem']), return_tensors='pt', padding = True, truncation = True)


In [None]:
input['input_ids'].shape

torch.Size([13854, 1024])

In [None]:
input.items()

ItemsView({'input_ids': tensor([[  201,   201,   198,  ..., 50257, 50257, 50257],
        [  201,   201,   198,  ..., 50257, 50257, 50257],
        [  201,   201,   198,  ..., 50257, 50257, 50257],
        ...,
        [  201,   201,   198,  ..., 50257, 50257, 50257],
        [  201,   201,   198,  ..., 50257, 50257, 50257],
        [  201,   201,   198,  ..., 50257, 50257, 50257]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])})

In [None]:
input['input_ids'].shape

torch.Size([13854, 1024])

Next we need to create a custom Dataset for the input Data. A custom Dataset must implement three functions:


1.   __init__: is run once when instantiating the Dataset object.
2.   __len__ : must return the number of samples in our dataset
1.   __getitem__ : function loads and returns a sample (tensor format) from the dataset at the given index idx


The Dataset retrives the dataset features in required postion, one at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.

DataLoader is an iterable that abstracts this complexity for us in an easy API.

Refer: https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html






In [None]:
from torch.utils.data import Dataset, DataLoader

from torch.utils.data import Dataset, DataLoader

class CustomPoemDataset (Dataset):
  def __init__(self, input):
    self.input = input

  def __len__(self):
    return len(self.input['input_ids'])

  def __getitem__(self, idx):
    item = {key: val[idx] for key, val in self.input.items()}
    item['labels'] = item['input_ids']
    return item


custom_dataset = CustomPoemDataset(input)
custom_loader = DataLoader(custom_dataset, batch_size =4, shuffle=True)


[Sanity Check] example of iterating with help of Dataloader, we are going to pass one batch to the pre-trained model to check if everything is working properly and display the loss

In [None]:
import torch

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(DEVICE)

for batch in custom_loader:
  input_ids = batch['input_ids'].to(DEVICE)
  attention_mask = batch['attention_mask'].to(DEVICE)
  labels = batch['labels'].to(DEVICE)
  # print(input_ids)
  # print(attention_mask)
  outputs = model(input_ids, attention_mask=attention_mask,labels=labels)
  print(outputs.loss)
  break

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


tensor(11.5523, device='cuda:0', grad_fn=<NllLossBackward0>)


The Next step now is to finetune the GPT2LMHead transformer on our poetry dataset. Here we will be using an Optimizer and learning rate too.

In [None]:
import torch
from tqdm import tqdm
import os
from torch.optim import AdamW
import time

optimizer = AdamW(model.parameters(),lr = 5e-6)


# Directory to save checkpoints
CHECKPOINT_DIR = "./checkpoints"
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

def save_checkpoint(model, optimizer, epoch, path):
    torch.save({
        "epoch": epoch,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    }, path)
    print(f"Checkpoint saved at epoch {epoch+1} to the path: {path}")

def load_checkpoint(model, optimizer, path, DEVICE):
    checkpoint = torch.load(path, map_location=DEVICE)
    model.load_state_dict(checkpoint["model_state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
    start_epoch = checkpoint["epoch"] + 1
    print(f"Checkpoint loaded. Resuming from epoch {start_epoch}")
    return start_epoch

# Load from the checkpoints if they are present or start from epoch 1
start_epoch = 0
num_epochs = 3
checkpoint_path = os.path.join(CHECKPOINT_DIR, "latest.pt")
if os.path.exists(checkpoint_path):
    start_epoch = load_checkpoint(model, optimizer, checkpoint_path, DEVICE)

for epoch in range(start_epoch, num_epochs):
    total_loss = 0
    number_of_batches = 0
    start_time = time.time()

    print(f"\nTraining epoch {epoch+1}")
    progress_bar = tqdm(custom_loader, desc=f"Epoch {epoch+1}")

    for j,batch in enumerate(progress_bar):
        inputs = batch["input_ids"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)

        optimizer.zero_grad()


        outputs = model(inputs, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        loss.backward()
        optimizer.step()

        # print(f"    Batch {j+1} loss: {loss.item()}")

        total_loss += loss.item()
        number_of_batches += 1

        # Show average loss so far
        progress_bar.set_postfix(avg_loss=total_loss/number_of_batches)

    avg_loss = total_loss / number_of_batches
    print(f"Epoch {epoch+1} loss: {avg_loss:.4f}")
    print(f"The time taken is {time.time()-start_time}")

    # Save checkpoint after each epoch
    save_checkpoint(model, optimizer, epoch, checkpoint_path)



Training epoch 1


Epoch 1: 100%|██████████| 3464/3464 [1:16:40<00:00,  1.33s/it, avg_loss=2.12]


Epoch 1 loss: 2.1231
The time taken is 4600.78006029129
Checkpoint saved at epoch 1 to the path: ./checkpoints/latest.pt

Training epoch 2


Epoch 2: 100%|██████████| 3464/3464 [1:16:48<00:00,  1.33s/it, avg_loss=1.18]


Epoch 2 loss: 1.1784
The time taken is 4608.224022626877
Checkpoint saved at epoch 2 to the path: ./checkpoints/latest.pt

Training epoch 3


Epoch 3: 100%|██████████| 3464/3464 [1:16:44<00:00,  1.33s/it, avg_loss=1.16]


Epoch 3 loss: 1.1643
The time taken is 4604.979864835739
Checkpoint saved at epoch 3 to the path: ./checkpoints/latest.pt


In [None]:
final_model_path = os.path.join(CHECKPOINT_DIR, "final_model.pt")
final_optimizer_path = os.path.join(CHECKPOINT_DIR, "final_optimizer.pt")

torch.save(model.state_dict(), final_model_path)
torch.save(optimizer.state_dict(), final_optimizer_path)

print(f"\nFinal model saved at: {final_model_path}")
print(f"Final optimizer saved at: {final_optimizer_path}")


Final model saved at: ./checkpoints/final_model.pt
Final optimizer saved at: ./checkpoints/final_optimizer.pt


In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Below code is to load your fine-tuned GPT-2 model.
# model = GPT2LMHeadModel.from_pretrained("gpt-2")
# state_dict = torch.load(final_model_path, map_location=torch.device("cpu"))
# model.eval()

# if torch.cuda.is_available():
#     model.to("cuda")

def generate_text(prompt, temperature=0.3, top_p=0.9, top_k=50):
    """Generate text continuation given a prompt"""
    inputs = tokenizer(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    output_ids = model.generate(
    **inputs,
    eos_token_id=tokenizer.eos_token_id, # Generate till transformer sees the EOS token, signalling the end of the sentence
    max_length=120,
    temperature=0.8, # This controls the randomness of text generated.
    top_p=0.9, # Model looks at the smallest set of words whose cumulative probability ≥ 0.9, then samples from that set.
    top_k=50, # At each se the model only considers the k most top words
    do_sample=True,
    repetition_penalty=1.2, # To prevent the transformer from generating repetative sentences by imposing a penalty
    no_repeat_ngram_size=3, # To prevent transformer from generating repetative trigrams
  )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Both top_p and top_k are to prevent the model from generating nonsense
# temperature controls the creativity
# repetition_penalty and no_repeat_ngram_size prevents repetitions
# If do_sample was false it will always only pick the word with the highest probability. To ensure some variation
# we give do_sample = True, and control the variation with top_p and top_k

# We need to give a prompt in order to generate a poem
prompt = "The myths"
print(generate_text(prompt))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The myths that were told about my life, and the stories I heard from other people as well. The last time you saw me was in a bar with friends at 6am when we walked down to an abandoned house for lunch; it had been ten years since your birthday—but now one of us has died (I have no memory), so there is some talk among our neighbors on how many more will be left behind if this leaves them alone again after twenty-five or thirty generations: perhaps they are all dead right? And what should happen next would not help either! Maybe even kill each others


In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Below code is to load your fine-tuned GPT-2 model.
# model = GPT2LMHeadModel.from_pretrained("gpt-2")
# state_dict = torch.load(final_model_path, map_location=torch.device("cpu"))
# model.eval()

# if torch.cuda.is_available():
#     model.to("cuda")

def generate_text(prompt, temperature=0.3, top_p=0.9, top_k=50):
    """Generate text continuation given a prompt"""
    inputs = tokenizer(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    output_ids = model.generate(
    **inputs,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_length=120,
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    do_sample=True,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
  )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


# We need to give a prompt in order to generate a poem
prompt = "The misty forest"
print(generate_text(prompt))

The misty forest of the sun-flowered moon, a dark and deep one;—where I am not yet but in my youth. This night is all this restful for me: there are no more hours that lie before them so long as they last? No longer must sleep make us weary by itself or with it be left alone to spend these two nights at ease under some tree above their house where you have stayed many years without any other than your own!
"Oh let him come back from his solitude into our room again!" said we on each side till he came out laughing


In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Below code is to load your fine-tuned GPT-2 model.
# model = GPT2LMHeadModel.from_pretrained("gpt-2")
# state_dict = torch.load(final_model_path, map_location=torch.device("cpu"))
# model.eval()

# if torch.cuda.is_available():
#     model.to("cuda")

def generate_text(prompt, temperature=0.3, top_p=0.9, top_k=50):
    """Generate text continuation given a prompt"""
    inputs = tokenizer(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    output_ids = model.generate(
    **inputs,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_length=120,
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    do_sample=True,
    repetition_penalty=1.2,
    no_repeat_ngram_size=4,
  )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


# We need to give a prompt in order to generate a poem
prompt = "The misty mountains"
print(generate_text(prompt))

The misty mountains, the white of morning-gloom: this is it. The fog that surrounds us—the light so clear in all but our eyes; we must not see what will happen when you go home and have your food or drink for breakfast on Sunday night afternoons? But if I could fly into another world where people are always talking about things like race because they know there's no one else to talk with over here at lunchtime just now a few minutes away from me would be good news! And yet how can anything make my life better than living under such an umbrella


Observations:

Initially i was feeding title, poem and the tags seperated by [TITLE], [POEM] and [TAGS] special tokens. But the model was struggling to produce any meaningful poems. After only the poems was fed, the performance is much better. It could be due to the fact that the model couldn't correctly learn the meaning of the special tokens. Also initially even longer length poems were being fed, but finally decision was made to cut out very long poems as the gpt-2 model handles shorted sequences better.