# Deep Learning Assignment 2 - Part 2
## GPT-2 Fine-Tuning for English Lyrics Generation by v jitesh kumar 160122737199


### Summary:
- Fine-tuned `GPT2-Medium` (345M parameters), not GPT2-Small.
- Dataset manually curated using publicly available English song lyrics (50 Cent sample).
- Training performed for 5 epochs with monitoring of training loss.
- After fine-tuning, sample lyrics were generated successfully using different prompts.




## Fine-tuning GPT-2 for English Lyrics Generation



# Part 2b of assignment: GPT-2 Fine-Tuning on English Lyrics Dataset
## Step 1: Install Required Libraries



In [None]:
# Install Huggingface Transformers and Datasets libraries
!pip install transformers datasets

# Also installed Accelerate for faster training (optional but useful)
!pip install accelerate


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

## Step 2: Download and Load English Lyrics Dataset


In [None]:
# Step 2: Fixing dataset loading properly

import pandas as pd

# Load the CSV again but without assuming headers
lyrics_df = pd.read_csv(filename, header=None)

# Now, the third column (index 2) contains lyrics
lyrics_list = lyrics_df[2].dropna().tolist()

# Display a sample lyric
print("\nSample lyric:")
print(lyrics_list[0])



Sample lyric:
I change places, to prevent catchin' the cases
Races, in the faces, hall at you laces
This is a hit, let's see if homicide trace this

The only thing hotter than my flow is the block (inhale and exhale)
That's why I left this snow biz, and got into show biz
Let's get this clear, it ain't on till I say it's on, (pause), it's on
I'm eatin', y'all niggas fastin' like it's Rimadon
Bowlish way in Lebanon, know 50 the bomb
I be at the edge of the bar, sippin' a Don
I keep the bottle just in case, you never know when it's on
This worries bump, I can't go wrong, my team's too strong
You want war? I take you to war, now that my money long
Why you broke? cat's buy the by lines and fantasize
The way I'm spittin', put TV's in everything I'm sittin'
While I'm hot to death, I'm gonna say this to all you playa haters
Y'all should hate the game, not the playas (c'mon)

(Chorus: repeat 2X)
I change places, to prevent catchin' the cases
Races, in the faces, hall at you laces
This is a hit

## Step 3: Load GPT-2 Tokenizer and Prepare Dataset


In [None]:
# Step 3: Load GPT-2 tokenizer and prepare dataset

from transformers import GPT2Tokenizer

# Load GPT-2 Medium tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')

# Add special tokens if missing (important for fine-tuning)
special_tokens_dict = {'pad_token': '[PAD]'}
tokenizer.add_special_tokens(special_tokens_dict)

# Tokenize the lyrics
encodings = tokenizer(lyrics_list, truncation=True, padding=True, max_length=256, return_tensors="pt")

# Display tokenized sample
print("\nTokenized input sample (IDs):")
print(encodings['input_ids'][0])


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]


Tokenized input sample (IDs):
tensor([   40,  1487,  4113,    11,   284,  2948,  4929,   259,     6,   262,
         2663,   198,    49,  2114,    11,   287,   262,  6698,    11,  6899,
          379,   345,   300,  2114,   198,  1212,   318,   257,  2277,    11,
         1309,   338,   766,   611, 19625, 12854,   428,   198,   198,   464,
          691,  1517, 37546,   621,   616,  5202,   318,   262,  2512,   357,
          259,    71,  1000,   290, 21847,  1000,     8,   198,  2504,   338,
         1521,   314,  1364,   428,  6729,   275,   528,    11,   290,  1392,
          656,   905,   275,   528,   198,  5756,   338,   651,   428,  1598,
           11,   340, 18959,   470,   319, 10597,   314,   910,   340,   338,
          319,    11,   357, 32125,   828,   340,   338,   319,   198,    40,
         1101,  4483,   259,  3256,   331,     6,   439,   299,  6950,   292,
         3049,   259,     6,   588,   340,   338, 29542,   324,   261,   198,
        39961,  1836,   835,   28

## Step 4: Load GPT-2 Model and Fine-Tune on Lyrics


In [None]:
# Step 4: Load GPT-2 model and fine-tune with evaluation

from transformers import GPT2LMHeadModel, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset, random_split

# Load GPT-2 medium model
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')

# Adjust token embeddings for added special tokens
model.resize_token_embeddings(len(tokenizer))

# Define custom dataset
class LyricsDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return self.encodings.input_ids.shape[0]

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = item['input_ids']
        return item

# Prepare dataset
full_dataset = LyricsDataset(encodings)

# Split dataset into train and validation (90% train, 10% validation)
train_size = int(0.9 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
    learning_rate=5e-5,
    warmup_steps=200,
    prediction_loss_only=False,
    dataloader_drop_last=True,
    report_to="none"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

# Start fine-tuning
trainer.train()


  trainer = Trainer(
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,4.9155
200,3.6207
300,2.4445
400,2.3049
500,2.2762
600,2.2486
700,2.2188
800,2.2285
900,2.3222
1000,2.3159


TrainOutput(global_step=27000, training_loss=1.8302190602620443, metrics={'train_runtime': 3692.5456, 'train_samples_per_second': 14.624, 'train_steps_per_second': 7.312, 'total_flos': 2.5074918752256e+16, 'train_loss': 1.8302190602620443, 'epoch': 5.0})

In [None]:
# Save the fine-tuned model manually
trainer.save_model("./results")

# Save the tokenizer manually
tokenizer.save_pretrained("./results")


('./results/tokenizer_config.json',
 './results/special_tokens_map.json',
 './results/vocab.json',
 './results/merges.txt',
 './results/added_tokens.json')

## Step 5: Generate Sample Lyrics Using Fine-Tuned GPT-2


In [None]:
# Step 5: Generate sample lyrics using the fine-tuned GPT-2 model

from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel

# Load the fine-tuned model
model_path = "./results"
model = GPT2LMHeadModel.from_pretrained(model_path)
tokenizer = GPT2Tokenizer.from_pretrained(model_path)

# Create a text generation pipeline
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Provide a starting prompt
prompt = "Love is"

# Generate lyrics
generated = text_generator(
    prompt,
    max_length=150,
    num_return_sequences=1,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    top_k=50,
    repetition_penalty=1.2
)

# Print the generated lyrics
print("\nGenerated Lyrics:")
print(generated[0]['generated_text'])


Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



Generated Lyrics:
Love is a funny thing
When love is so strong
Love can make you forget
You've got to let go of your pride and be free.


I know I'm gonna cry (I know I'll cry) Oh why won't you forgive me
For loving someone like that?(For loving someone like this)... Kentucky! A B C D E F G H R O U N T S L Y X - Love is such an easy game, baby yeah
Yeah

It's not what it seems
'Cause love is all about trust
And the only time that I'm gonna let you go
Is when you're hurting too much or being untrue
Don't put me on that high
Just leave


## Step 6.1: Generate Multiple Lyrics Samples


In [None]:
# Generate 3 different lyrics samples using different prompts

prompts = [
    "Broken dreams",
    "Summer nights",
    "Dancing under the rain"
]

for prompt in prompts:
    generated = text_generator(
        prompt,
        max_length=150,
        num_return_sequences=1,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        top_k=50,
        repetition_penalty=1.2
    )
    print(f"\nPrompt: {prompt}")
    print("Generated Lyrics:")
    print(generated[0]['generated_text'])
    print("-" * 100)



Prompt: Broken dreams
Generated Lyrics:
Broken dreams, broken hearts
Can never heal the scars of a lifetime torn apart
I'm alone and I don't know why
I feel like crying all the time
And every night it's raining tears on my pillow Banks B - O'Hara (B-O) 

Broke dreams, broke heartaches
Can never mend the hurt that is left in our love
I'm alone but I don t know why
I feel like dying each day
And every night it rains tears on my bed... Broke Dreams! Broken Heartache. etc..etc

(x2) The sky is black and gray
My life has just begun to fade away now so lonely inside yeah
There is nothing else
----------------------------------------------------------------------------------------------------

Prompt: Summer nights
Generated Lyrics:
Summer nights are calling, and the leaves begin to fall
I close my eyes as I dream of you
And then I'm alone again
As if in a tunnel deep inside me
My thoughts go on forever
They travel through time

My love is all around me, my heart's with her every step. 
When

trainer.save_model("./results")
tokenizer.save_pretrained("./results")


Conclusion

In this project, we fine-tuned the GPT-2 Medium model on an English lyrics dataset. The model was trained for 5 epochs and achieved a final training loss of approximately 1.83.
After fine-tuning, the model was able to generate creative and meaningful English song lyrics based on different prompts.
The project demonstrates the capability of transformer-based language models to adapt to domain-specific text generation tasks with minimal fine-tuning.
