# Fine tuning gpt-2 to poem generation

In this notebook we're going to fine tune gpt-2 small with our custom poetry dataset (english and spanish)

Why use this architecture?
- Have multilanguage support
- Small LLM model to deploy
- Easy to train 

We're going to use MLFlow to compare different models

Model taken from: https://huggingface.co/openai-community/gpt2

To use MLFlow: 
    mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns
    mlflow ui


In [1]:
#Libraries

import mlflow
import mlflow.pytorch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict, concatenate_datasets

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#Load dataset

df_es=load_dataset("csv", data_files={"train": "../../data/ES_corpus/ES_poetry_cleaned.csv"} )
df_en=load_dataset("csv", data_files={"train": "../../data/EN_corpus/EN_poetry_cleaned.csv"} )

dataset=DatasetDict({"train": concatenate_datasets([df_es["train"], df_en["train"]])}) ## Merge EN and ES dataset

dataset.remove_columns(["Unnamed: 0", "Title", "Author", "Unnamed: 0.1", "Poet", "Tags"]) ## Use only Poem corpus, drop rest of info


DatasetDict({
    train: Dataset({
        features: ['Poem'],
        num_rows: 25563
    })
})

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") ## DEfine tokenizer and model from gpt2
model = GPT2LMHeadModel.from_pretrained("gpt2")

In [4]:
def tokenize_function(examples):
    return tokenizer(examples["Poem"], padding="max_length", truncation=True)

In [5]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [6]:

tokenized_dataset= dataset.map(tokenize_function, batched=True) ## Tokenize dataset

In [7]:
training_args=TrainingArguments( ## Training arguments for huggingface training
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    logging_dir="./logs",
    report_to="none" #Deactivate hugging face reports 
)

In [None]:
mlflow.set_experiment("FineTuning-GPT2-Poetry")

<Experiment: artifact_location='file:///home/linux/Documents/Proyectos%20personales%20IA/GenAI_Poetry/src/models/mlruns/752725571680088763', creation_time=1739013767145, experiment_id='752725571680088763', last_update_time=1739013767145, lifecycle_stage='active', name='FineTuning-GPT2-Poetry', tags={}>

: 

In [None]:
## Training with MLFLow logging
with mlflow.start_run():
    trainer=Trainer(model=model,
                    args=training_args,
                    train_dataset=tokenized_dataset["train"])
    
    trainer.train()

    mlflow.log_param("epochs", training_args.num_train_epochs)
    mlflow.log_param("batch size", training_args.per_device_train_batch_size)
    mlflow.log_artifact("results")

    loss=trainer.evaluate()["eval_loss"]
    mlflow.log_metric("eval_loss", loss)

    model.save_pretrained('./fine-tuned-gpt2-poetry')

    mlflow.pytorch.log_model(model, "fine-tuned-gpt2-poetry")

print("Finished training and logged in MLFlow")
