<a href="https://colab.research.google.com/github/merb404/poemgenerator_nlp/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

!pip install transformers torch pandas datasets accelerate -q

print("Mounting Google Drive...")
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from transformers import TextDataset, DataCollatorForLanguageModeling
import os



Mounting Google Drive...
Mounted at /content/drive


In [3]:
csv_file= "/content/PoetryFoundationData.csv"
df = pd.read_csv(csv_file)
print(f"✓ Loaded {len(df)} poems")

# Extract and clean poems
poems = []
for idx, row in df.iterrows():
    poem = row['Poem']
    if pd.notna(poem):
        # Clean text
        poem = str(poem).replace('\r', '')

        # Add structural tokens
        poem = poem.replace('\n\n', ' <STANZA> ')
        poem = poem.replace('\n', ' <BR> ')

        poems.append(poem)

print(f"✓ Processed {len(poems)} valid poems")

# Combine into corpus
corpus_text = '\n\n'.join(poems)
print(f"✓ Total characters: {len(corpus_text):,}")

# Save corpus
corpus_path = '/content/poetry_corpus.txt'
with open(corpus_path, 'w', encoding='utf-8') as f:
    f.write(corpus_text)

print(f"✓ Corpus saved to: {corpus_path}")

✓ Loaded 13854 poems
✓ Processed 13854 valid poems
✓ Total characters: 21,951,033
✓ Corpus saved to: /content/poetry_corpus.txt


In [4]:
model_name = 'gpt2'

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Add custom tokens
special_tokens = {'additional_special_tokens': ['<BR>', '<STANZA>']}
tokenizer.add_special_tokens(special_tokens)
tokenizer.pad_token = tokenizer.eos_token

print(f" Tokenizer loaded. Vocab size: {len(tokenizer)}")

# Load model
model = GPT2LMHeadModel.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

print(f" Model loaded. Parameters: {model.num_parameters():,}")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

 Tokenizer loaded. Vocab size: 50259


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


 Model loaded. Parameters: 124,441,344


In [6]:
block_size = 64

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=corpus_path,
    block_size=block_size
)

print(f" Dataset created. Total blocks: {len(train_dataset)}")

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

print("Data collator ready")

 Dataset created. Total blocks: 98947
Data collator ready


In [7]:
output_dir = '/content/drive/MyDrive/poetry_model'

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    save_steps=1000,
    save_total_limit=1,
    logging_steps=50,
    learning_rate=5e-5,
    warmup_steps=50,
    weight_decay=0.01,
    fp16=True,
    logging_dir='/content/logs',
)

print("Training Configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch Size: {training_args.per_device_train_batch_size}")
print(f"  Learning Rate: {training_args.learning_rate}")
print(f"  Mixed Precision: {training_args.fp16}")
print(f"  Output: {output_dir}")

Training Configuration:
  Epochs: 1
  Batch Size: 8
  Learning Rate: 5e-05
  Mixed Precision: True
  Output: /content/drive/MyDrive/poetry_model


In [8]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Train
trainer.train()

print("\n" + "="*70)
print(" TRAINING COMPLETE!")
print("="*70)

# Save model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f" Model saved to: {output_dir}")


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mf2023376080[0m ([33mf2023376080-umt[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,5.037
100,4.6106
150,4.1647
200,4.059
250,3.9326
300,3.9694
350,3.8473
400,4.0082
450,3.8121
500,3.8582



 TRAINING COMPLETE!
 Model saved to: /content/drive/MyDrive/poetry_model
