<a href="https://colab.research.google.com/github/jgarnicaa/GenAI_Poetry/blob/main/gpt2_finetuning_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tuning gpt-2 to poem generation

In this notebook we're going to fine tune gpt-2 small with our custom poetry dataset (english and spanish)

Why use this architecture?
- Have multilanguage support
- Small LLM model to deploy
- Easy to train

We're going to use MLFlow to compare different models

Model taken from: https://huggingface.co/openai-community/gpt2

To use MLFlow:

> mlflow ui --backend-store-uri file:///tmp/mlflow --default-artifact-root s3://genaipoetry-bucket/mlflow/artifacts



> mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root s3://genaipoetry-bucket/mlflow/artifacts --host 0.0.0.0 --port 5000







In [1]:
!pip install mlflow boto3 transformers datasets dvc dvc[s3]

Collecting mlflow
  Downloading mlflow-2.20.1-py3-none-any.whl.metadata (30 kB)
Collecting boto3
  Downloading boto3-1.36.18-py3-none-any.whl.metadata (6.7 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dvc
  Downloading dvc-3.59.0-py3-none-any.whl.metadata (18 kB)
Collecting mlflow-skinny==2.20.1 (from mlflow)
  Downloading mlflow_skinny-2.20.1-py3-none-any.whl.metadata (31 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.14.1-py3-none-any.whl.metadata (7.4 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==2.20.1->mlflow)
  Downloading databricks_sdk-0.44.0-py3-none-any.whl.metadata (38 kB)
Colle

In [1]:
#Libraries

import mlflow
import mlflow.pytorch
import os
import boto3
import shutil
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict, concatenate_datasets
from google.colab import drive

In [2]:
## AWS Enviroment config to save MLFlow Experiments in bucket s3
os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "https://s3.amazonaws.com"

In [3]:
# Config S3
os.environ["MLFLOW_ARTIFACT_URI"] = "s3://genaipoetry-bucket/mlflow-artifacts/"
os.environ["MLFLOW_TRACKING_URI"] = "sqlite:///mlflow.db"
mlflow.set_tracking_uri("sqlite:///mlflow.db")

In [4]:
drive.mount('/content/drive')
os.makedirs("results", exist_ok=True)

Mounted at /content/drive


In [5]:
!git clone https://github.com/jgarnicaa/GenAI_Poetry.git ## Clone to recover dataset from DVC

Cloning into 'GenAI_Poetry'...
remote: Enumerating objects: 109, done.[K
remote: Counting objects: 100% (109/109), done.[K
remote: Compressing objects: 100% (78/78), done.[K
remote: Total 109 (delta 30), reused 87 (delta 15), pack-reused 0 (from 0)[K
Receiving objects: 100% (109/109), 239.37 KiB | 4.60 MiB/s, done.
Resolving deltas: 100% (30/30), done.


In [6]:
!cd GenAI_Poetry/ && dvc pull ##Recover dataset

Collecting          |7.00 [00:03, 1.97entry/s]
Fetching
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Querying remote cache:   0% 0/1 [00:00<?, ?files/s][A
Querying remote cache:   0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Querying remote cache: 100% 1/1 [00:00<00:00,  9.62files/s{'info': ''}][A
                                                                       [A
Fetching from s3:   0% 0/5 [00:00<?, ?file/s][A
Fetching from s3:   0% 0/5 [00:00<?, ?file/s{'info': ''}][A
Fetching from s3:   0% 0/4 [00:00<?, ?file/s{'info': ''}][A

  0% 0.00/10.9M [00:00<?, ?B/s][A[A

  0% 0.00/10.9M [00:00<?, ?B/s{'info': ''}][A[A

  1% 101k/10.9M [00:00<00:20, 567kB/s{'info': ''}][A[A

  3% 352k/10.9M [00:00<00:10, 1.05MB/s{'info': ''}][A[A

 11% 1.21M/10.9M [00:00<00:03, 2.82MB/s{'info': ''}][A[A

 28% 3.06M/10.9M [00:00<00:01, 7.11MB/s{'info': ''}][A[A

 57% 6.23M/10.9M [00:00<00:00, 13.7MB/s{'info': ''}][A[A

 89% 9.73M/10.

In [7]:
#Load dataset

df_es=load_dataset("csv", data_files={"train": "/content/GenAI_Poetry/data/ES_corpus/ES_poetry_cleaned.csv"} )
df_en=load_dataset("csv", data_files={"train": "/content/GenAI_Poetry/data/EN_corpus/EN_poetry_cleaned.csv"} )

dataset=DatasetDict({"train": concatenate_datasets([df_es["train"], df_en["train"]])}) ## Merge EN and ES dataset

dataset=dataset.remove_columns(["Unnamed: 0", "Title", "Author", "Unnamed: 0.1", "Poet", "Tags"]) ## Use only Poem corpus, drop rest of info


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [8]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") ## DEfine tokenizer and model from gpt2
model = GPT2LMHeadModel.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [9]:
def tokenize_function(examples):
    tokenized = tokenizer(examples['Poem'], truncation=True, padding='max_length')

    # Las etiquetas son iguales a los input_ids (desplazados para la predicción del siguiente token)
    tokenized["labels"] = tokenized["input_ids"].copy()

    return tokenized

In [10]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [11]:

tokenized_dataset= dataset.map(tokenize_function, batched=True) ## Tokenize dataset

print(tokenized_dataset["train"][0]) ## Print first example

Map:   0%|          | 0/25563 [00:00<?, ? examples/s]

{'Poem': '\n\n\r\nEntre nosotros crece la ropa en las mañanas\n\r\nse atraviesan mil veces los oficios\n\r\nnos mueven los deberes\n\r\nel futuro\n\r\nlas cosas.\n\n\n\r\nPor si no fuera mucho alguien propone la medida\n\r\npara que no te vayas\n\r\n-dicen-\n\r\nes necesario el regateo.\n\r\nPero tus manos son mi tiempo\n\r\ny no quiero jugar a detener la boca y los abrazos.\n\r\nTe irás más tarde\n\r\n-dicen-\n\r\nsi encuentro la mesura\n\r\npero deseo tu cuerpo y este día\n\r\neste preciso cielo\n\r\nla película de hoy\n\r\nla cama próxima\n\r\ntu sudor y tu piel ahora en la tarde.\n\n\n\r\nNo voy a retener mis frases ni mi aliento\n\r\nno me quiero tragar ni un poco de silencio\n\r\nni uno solo de los consentimientos.\n\n\n\r\n¿Por qué la luz a medias?\n\r\n¿Para que no te vayas cuando te irás?\n\r\nNunca se mete el sol antes de tiempo\n\r\ny se pone lo mismo en días nublados.\n\r\nYo quiero tu cobija hasta que quieras\n\r\nte doy mientras\n\r\nmis ansias, mis costumbres,\n\r\nmis r

In [12]:
training_args = TrainingArguments(
    output_dir='/content/drive/MyDrive/Colab Notebooks/GenAIPoetry/checkpoints',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=6,
    save_steps=500,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=100,
    warmup_steps=500,
    weight_decay=0.01,
    report_to="mlflow"
)

In [13]:
mlflow.set_experiment("FineTuning-GPT2-Poetry")

2025/02/12 09:46:19 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/02/12 09:46:20 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

<Experiment: artifact_location='/content/mlruns/1', creation_time=1739353581282, experiment_id='1', last_update_time=1739353581282, lifecycle_stage='active', name='FineTuning-GPT2-Poetry', tags={}>

In [14]:
## Training with MLFLow logging
with mlflow.start_run():
    trainer=Trainer(model=model,
                    args=training_args,
                    train_dataset=tokenized_dataset["train"])

    trainer.train(resume_from_checkpoint=True)

    mlflow.log_param("epochs", training_args.num_train_epochs)
    mlflow.log_param("batch size", training_args.per_device_train_batch_size)

    model.save_pretrained('./fine-tuned-gpt2-poetry')

    mlflow.log_artifacts('./fine-tuned-gpt2-poetry', artifact_path="models") #save S3

    mlflow.pytorch.log_model(model, artifact_path="models", registered_model_name="GPT2-Poetry")


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].
  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)


Step,Training Loss


Successfully registered model 'GPT2-Poetry'.
Created version '1' of model 'GPT2-Poetry'.


In [15]:
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(experiment_ids=["1"])
for run in runs:
    print(f"Run ID: {run.info.run_id}")

Run ID: 544a11a246a244bb8dd6d0839b874d17


In [16]:
logged_model = "runs:/544a11a246a244bb8dd6d0839b874d17/models"
# Cargar el modelo
model = mlflow.pytorch.load_model(logged_model)


In [41]:
import torch


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

model.to(device)

def generate_poem(prompt, max_length=100, temperature=1.2, top_p=0.95):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        temperature=temperature, #1.2 #0.7
        top_p=top_p, #0.95 #0.9
        do_sample=True
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

print("English poem: ",generate_poem("Under the moonlight,"))

print("Spanish poem: ",generate_poem("Bajo el sol"))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Device: cpu


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


English poem:  Under the moonlight, I walk through the field, my arms and legs trembling, like little children at the start of a song.






As for us, each was created to sing, who have the right to be free of  any pain? The song of song and of love is my song,  that I sing them in silence so as to not be understood. But they know me. I know  they love
Spanish poem:  Bajo el sol que amarte mi oye

y entra mía, y vienes amantes;

y la tarde que desvanecen.

De un poco aún, que cual me entretienes,

y de mi amor no me deseo alguna vez.

¿Entonces, Dios,

en este sol desde sus man


In [45]:
from google.colab import files
files.download("/content/mlflow.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [47]:
!zip -r /content/drive/MyDrive/model.zip /content/fine-tuned-gpt2-poetry

  adding: content/fine-tuned-gpt2-poetry/ (stored 0%)
  adding: content/fine-tuned-gpt2-poetry/config.json (deflated 52%)
  adding: content/fine-tuned-gpt2-poetry/generation_config.json (deflated 24%)
  adding: content/fine-tuned-gpt2-poetry/model.safetensors (deflated 7%)


In [46]:
files.download("/content/model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>