<a href="https://colab.research.google.com/github/lucarinelli/conditional_text_generation/blob/main/notebooks/experiments/gpt2_separators_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#GPT2 experiments of Conditional Text Generation

Add a description here...

#Setup

SSH for developing and debugging purposes, useful to quickly explore all the files involved in the repo and add or fix things here and there. 

Do not enable if you are just running the experiment.

In [None]:
# Install colab_ssh on google colab
!pip install colab_ssh --upgrade

from colab_ssh import launch_ssh_cloudflared
# Comment or un-comment the next line to disable/enable ssh
launch_ssh_cloudflared(password="2N6.ufRjL,Zp:GfcJuh?TQ")

Mount goole drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Check allocated GPU, hope for something better than a K80...

In [None]:
!nvidia-smi

Install the needed python packages

In [None]:
!pip install --quiet transformers datasets tokenizers sacrebleu wandb

Clone our repository to get our utilities and overrides

In [None]:
!git clone https://github.com/lucarinelli/conditional_text_generation.git /content/conditional_text_generation

Add our repository `src` folder to python path

In [None]:
import sys
import os

module_path = os.path.abspath("/content/conditional_text_generation/src")
if module_path not in sys.path:
    sys.path.append(module_path)

Setup and connect to Weights and Biases to store logs and results

**REMEMBER TO SET WANDB_PROJECT TO THE CORRECT VALUE**

In [None]:
import wandb

%env WANDB_PROJECT=ctrl_dry_runs
%env WANDB_ENTITY=polito_aiml2021_textgen
%env WANDB_LOG_MODEL=true
%env WANDB_WATCH=all
%env WANDB_SILENT=true

wandb.login()

Set training arguments and other experiment parameters

In [None]:
from experiment_parameters import ExperimentParameters
from transformers import TrainingArguments

project_name = "gpt2-separators"
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/conditional_text_generation/runs/data/results/{}".format(project_name),  # output directory
    save_total_limit=3,
    num_train_epochs=3,  # total # of training epochs
    per_device_train_batch_size=64,  # batch size per device during training
    per_device_eval_batch_size=1,  # batch size for evaluation
    warmup_steps=500,  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,
    logging_dir='/content/drive/MyDrive/conditional_text_generation/runs/data/logs/{}'.format(project_name),  # directory for storing logs
    evaluation_strategy="epoch",
    report_to="wandb",
    load_best_model_at_end=True,
    remove_unused_columns=False
)

#TODO experiment parameters run name is not actually used
experiment_parameters = ExperimentParameters(training_args=training_args, force_dataset_update=True, control_codes_type = "separators")

#Database
We download and load the COCO captions dataset.

We join in a single item the caption for an image with the categories and/or supercategories associated to objects present in the image.
Categories and/or supercategories are used as control codes depending on the experiment settings.

The dataset is post processed to train the model with different combinations of control codes for each caption, depending on the experiment parameters. The output of the postprocessing is saved on .json files that are then loaded and further handled by the Dataset class provided by HuggingFace datasets (used for its performance and caching abilities).

In [None]:
from captions_dataset import *

dataset_train, dataset_val, control_codes, references = get_dataset(experiment_parameters, data_path="/content/data")

#Tokenization

In [None]:
tokenizer = get_tokenizer(experiment_parameters, control_codes)

dataset_train_encoded = encode_and_format_dataset(dataset_train, DatasetType.TRAIN, tokenizer)
dataset_val_encoded = encode_and_format_dataset(dataset_val, DatasetType.EVAL, tokenizer)

#Model

In [None]:
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained(experiment_parameters.model, pad_token_id=tokenizer.eos_token_id)
model.resize_token_embeddings(len(tokenizer))

#Metrics

#Training

In [None]:
import random
import torch
import numpy as np

seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
from our_trainer import *

training_args.references = references

trainer = OurTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train_encoded,         # training dataset
    eval_dataset=dataset_val_encoded,
    compute_metrics=compute_metrics,
    tokenizer = tokenizer
    )

In [None]:
trainer.train(True)

config = wandb.config
config.update(experiment_parameters)

In [None]:
trainer.save_model(training_args.output_dir)
wandb.finish()