## Set-up environment

Let's first install the required libraries:
* HuggingFace Transformers (for the CodeT5 model)
* HuggingFace Datasets (for loading the dataset + preprocessing it)
* PyTorch Lightning (for training)
* Weights and Biases (for logging training metrics).

In [None]:
!pip install --upgrade pip

In [None]:
!pip install -q huggingface_hub transformers datasets pytorch-lightning wandb

In [None]:
from huggingface_hub import login

login(token='SECRET')

In [None]:
import wandb

wandb.login(key='SECRET')

In [None]:
base_model_name = "simon-arc-lab-model681"
result_model_name = "simon-arc-lab-model682"
dataset_path = "neoneye/simon-arc-combine-v161"
base_model_path = f"neoneye/{base_model_name}"

max_input_length = 1024
max_target_length = 128
my_learning_rate = 1e-6

## Preprocess data


In [None]:
from datasets import load_dataset

dataset = load_dataset(dataset_path)
print(dataset)

As you can see, the "code-to-text/ruby" split consists of a training, validation and test set. Let's look at one particular example:

In [None]:
example = dataset['train'][0]

print("instruction:", example["instruction"])
print("input:", example["input"])
print("output:", example["output"])

The goal for the model is to generate a docstring based on the provided code.

Let's now prepare the examples (i.e. code-docstring pairs) for the model. As you might know, Transformer models like BERT, BART, T5 etc. don't expect text as direct input, but rather integers which are called `input_ids` in HuggingFace Transformers. These represent tokens of a certain vocabulary. The model will learn rich contextual embedding vectors for each token, allowing it to get good results.

In other words, we need to turn the "Code" input from above into `input_ids`, and similarly, we need to turn the "Docstring" output from above into `input_ids`, which will serve as the `labels` for the model.

In addition, as these models are trained on batches of examples rather than one example at a time, we'll need to pad/truncate both the inputs and labels, such that they are all of the same length. That's why we also will add an `attention_mask` input to the model, such that it knows not to take into account padding tokens when computing attention scores.

To summarize:
* input: code, which is turned into `input_ids` + `attention_mask`
* output: docstrings, which are turned into `labels` (which are the `input_ids` of the docstrings).

Below, we define a `preprocess_examples` function, which we can apply on the entire dataset.

In [None]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained(base_model_path)

def preprocess_examples(examples):
  # concatenate "instruction" and "input" with a newline
  instructions = examples['instruction']
  inputs = examples['input']

  concatenated_inputs = [f"{instruction}\n{input_data}" for instruction, input_data in zip(instructions, inputs)]
  model_inputs = tokenizer(concatenated_inputs, max_length=max_input_length, padding="max_length", truncation=True)

  # encode the outputs
  outputs = examples['output']
  labels = tokenizer(outputs, max_length=max_target_length, padding="max_length", truncation=True).input_ids

  # replace the index of the padding tokens by -100
  labels_with_ignore_index = []
  for labels_example in labels:
    labels_example = [label if label != 0 else -100 for label in labels_example]
    labels_with_ignore_index.append(labels_example)

  model_inputs["labels"] = labels_with_ignore_index

  return model_inputs

In [None]:
from datasets import DatasetDict

# Split the dataset into train and test (80% train, 20% test)
train_testvalid = dataset['train'].train_test_split(test_size=0.2)
train_test = DatasetDict({
    'train': train_testvalid['train'],
    'test': train_testvalid['test']
})

# Split the training set again to create a validation set (10% of the original train set)
train_valid = train_test['train'].train_test_split(test_size=0.1)

# Combine to create a final dataset dictionary
final_datasets = DatasetDict({
    'train': train_valid['train'],
    'validation': train_valid['test'],
    'test': train_test['test']
})

# Print the dataset splits
print(final_datasets)

# Apply the preprocessing function to all splits
final_datasets = final_datasets.map(preprocess_examples, batched=True)

# Set format for PyTorch DataLoader
final_datasets.set_format(type="torch", columns=['input_ids', 'attention_mask', 'labels'])

# Create DataLoaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(final_datasets['train'], shuffle=True, batch_size=8)
valid_dataloader = DataLoader(final_datasets['validation'], batch_size=4)
test_dataloader = DataLoader(final_datasets['test'], batch_size=4)

print("DataLoaders created successfully.")

Now that we have defined the function, let's call `.map()` on the HuggingFace Dataset object, which allows us to apply this function in batches (by default a batch size of 1,000 is used!) - hence super fast.

Next, let's set the format to "torch" and create PyTorch dataloaders.

In [None]:
batch = next(iter(train_dataloader))
print("batch.keys:\n", batch.keys())

print("\ninput_ids:\n", tokenizer.decode(batch['input_ids'][0]))

labels = batch['labels'][0]
decoded = tokenizer.decode([label for label in labels if label != -100])
print("\ndecoded\n", decoded)

## Fine-tune using PyTorch Lightning

As we will train the model using PyTorch Lightning, we first need to define a `LightningModule`, which is an `nn.Module` with some additional functionalities. We just need to define the `forward` pass, `training_step` (and optionally `validation_step` and `test_step`), and the corresponding dataloaders. PyTorch Lightning will then automate the training for us, handling device placement (i.e. we don't need to type `.to(device)` anywhere), etc. It also comes with support for loggers (such as Tensorboard, Weights and Biases) and callbacks.

Of course, you could also train the model in other ways:
* using regular PyTorch
* using the HuggingFace Trainer (in this case, the Seq2SeqTrainer)
* using HuggingFace Accelerate
* etc.

In [None]:
from transformers import T5ForConditionalGeneration, AdamW, get_linear_schedule_with_warmup
import pytorch_lightning as pl

class CodeT5(pl.LightningModule):
    def __init__(self, lr=1e-5, num_train_epochs=5, warmup_steps=2000):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(base_model_path)
        self.save_hyperparameters()

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        return outputs

    def common_step(self, batch, batch_idx):
        outputs = self(**batch)
        loss = outputs.loss

        return loss

    def training_step(self, batch, batch_idx):
        loss = self.common_step(batch, batch_idx)
        # logs metrics for each training_step,
        # and the average across the epoch
        self.log("training_loss", loss)

        return loss

    def validation_step(self, batch, batch_idx):
        loss = self.common_step(batch, batch_idx)
        self.log("validation_loss", loss, on_epoch=True)

        return loss

    def test_step(self, batch, batch_idx):
        loss = self.common_step(batch, batch_idx)

        return loss

    def configure_optimizers(self):
        # create optimizer
        optimizer = AdamW(self.parameters(), lr=self.hparams.lr)
        # create learning rate scheduler
        num_train_optimization_steps = self.hparams.num_train_epochs * len(train_dataloader)
        lr_scheduler = {'scheduler': get_linear_schedule_with_warmup(optimizer,
                                                    num_warmup_steps=self.hparams.warmup_steps,
                                                    num_training_steps=num_train_optimization_steps),
                        'name': 'learning_rate',
                        'interval':'step',
                        'frequency': 1}

        return {"optimizer": optimizer, "lr_scheduler": lr_scheduler}

    def train_dataloader(self):
        return train_dataloader

    def val_dataloader(self):
        return valid_dataloader

    def test_dataloader(self):
        return test_dataloader

Let's start up Weights and Biases!

Next, we initialize the model.

In [None]:
model = CodeT5(lr=my_learning_rate)

In [None]:
def interpolate(value, from_start, from_end, to_start, to_end):
    return to_start + ((to_end - to_start) / (from_end - from_start)) * (value - from_start)

number_of_rows_in_dataset_train = dataset['train'].num_rows
interval_10k_rows = 0.2  # when the dataset jsonl file has 10k rows.
#interval_100k_rows = 0.05  # when the dataset jsonl file has 100k rows.
interval_300k_rows = 0.01  # when the dataset jsonl file has 300k rows.
validation_check_interval = interpolate(number_of_rows_in_dataset_train, 10000, 300000, interval_10k_rows, interval_300k_rows)
if validation_check_interval < 0.01:
  validation_check_interval = 0.01
if validation_check_interval > 0.25:
  validation_check_interval = 0.25
print("validation check interval: ", validation_check_interval)

We can now simply start training on Colab's GPU.

In [None]:
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor

wandb_logger = WandbLogger(name=result_model_name, project='CodeT5')
# for early stopping, see https://pytorch-lightning.readthedocs.io/en/1.0.0/early_stopping.html?highlight=early%20stopping
early_stop_callback = EarlyStopping(
    monitor='validation_loss',
    patience=3,
    strict=False,
    verbose=False,
    mode='min'
)
lr_monitor = LearningRateMonitor(logging_interval='step')


trainer = Trainer(
    accelerator='gpu',
    devices=1,
    default_root_dir="/content/drive/MyDrive/CodeT5/Notebooks/Checkpoints",
    logger=wandb_logger,
    callbacks=[early_stop_callback, lr_monitor],
    val_check_interval=validation_check_interval, # val_check_interval is a fraction of an epoch, e.g., 0.1 means every 10% of the epoch
)

trainer.fit(model)

Once we're done training, we can also save the HuggingFace model as follows:

In [None]:
save_directory = "mymodel" # save in the current working directory, you can change this of course
model.model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

This allows us to easily load the trained model again using the `from_pretrained()` method, as shown below.

## Upload trained model to the hub

Cool! We can also share our model with the world, by uploading it to [hf.co](https://hf.co). For that, we need to install Git-LFS, which is used for using git with large files (note that each model on the hub = a git repository!).

Next, we can login with the credentials of our HuggingFace account (you can sign up on [hf.co](https://hf.co) if you haven't already!).

In [None]:
from huggingface_hub import HfApi, HfFolder

# Set your repository details
model_path = save_directory

# Create a repository if it doesn't exist
api = HfApi()
username = api.whoami()['name']
repo_url = api.create_repo(repo_id=result_model_name, exist_ok=True, private=True)

# Upload files to the repository
from huggingface_hub import upload_folder
upload_folder(
    folder_path=model_path,
    repo_id=f"{username}/{result_model_name}",
    commit_message="Initial model upload"
)

## Inference

Now that we've trained a model, let's test it on some examples from the test set.

In [None]:
#from datasets import load_dataset
#
#dataset = load_dataset("neoneye/simon-arc-rle-task-v1")
#print(dataset['train'])

In [None]:
#test_example = dataset['train'][2]
#print("Instruction:", test_example['instruction'])

We can load our trained model as follows:

In [None]:
#from transformers import T5ForConditionalGeneration

#model = T5ForConditionalGeneration.from_pretrained(save_directory)

We can prepare the example using `RobertaTokenizer`, and generate using the `.generate()` method. Note that there are several ways of doing generation (greedy decoding/beam search/top k sampling/etc.), for that I refer to Patrick's blog post which you can find [here](https://huggingface.co/blog/how-to-generate). Here we will just use the default settings (i.e. greedy decoding).

In [None]:
#test_example = "transform SIMONARCRLEROW to symbols\na4y6q7y3o5"
#test_example = "Histogram after deserializing SimonsRLERow\nc5"
#test_example = "Convert string to Simon-ARC-RLE-Image\n494411344,244423242,444334224,803294472,480442407"
#test_example = "SIMONARCRLEIMAGE from Json\n[[6,9],[3,9],[6,9],[9,3],[9,3],[4,3],[4,4],[0,4],[3,4]]"
#test_example = "SimonsRLEImage, 3x3 area, how many neighbors have the same color as center\n9 9 917a918a1,c9681a9,c981b9,a918d9,978791a79,929a1a907,9291b907,b5b9240,b9a52a90"
#test_example = "Histogram of SimonsRLEImage\n7 9 a671717,1701015,b7a107,a51a715,0a15a15,a101b0,b10515,5b0a17,51751a5"
#test_example = "Flipx SimonsRLEImage\n7 9 a671717,1701015,b7a107,a51a715,0a15a15,a101b0,b10515,5b0a17,51751a5"
#test_example = "Rotate CCW Simon-ARC-RLE-Image\n6 2 18c0,8d0"
#test_example = "Transform pixels to Simon-ARC-RLE-Image\n6 2 18c0,8d0"
#test_example = "histogram after deserializing SIMONARCRLEIMAGE\n3 3 a75,7a2,a92"
#test_example = "This is simon-arc-rle-task data. Extract 'Input 3 Example'\nInput 0 Example\n5 8 c80,08501,08a18,81a08,0a850,b858,80b8,c80\nOutput 0 Example\n8 10 6256a2a5,c1c7,6b762a6,6765b61,275b6a5,27165252,6a952561,56162a56,652a5651,a65626a5\nInput 1 Example\n8 7 b2a5b2,a2172a52,b27c2,a2a7c2,a27d2,,\nOutput 1 Example\n10 5 08a9c6a8,1b681c8,861b89b8,068a19a818,a1c81a81\nInput 2 Example\n7 9 0804a04,0870a40,078a0a4,0780a20,d1a0,7808a04,70b804,7b0840,7b0484\nOutput 2 Example\n5 8 45454,42a74,a7214,75414,74214,70514,1a614,5a410\nInput 3 Example\n9 5 96936c9,b93a6a90,b9396a96,a693c96,b606a909\nOutput 3 Example\n8 8 292d9,92e4,a2a94b9,94a9a429,e949,a92c92,4a24c9,4a9492a9\nInput 4 Example\n5 6 46141,46b4,16b4,46b4,4,14141\nOutput 4 Example\n8 8 19139191,a194c6,1b64191,16a14b1,16b9141,96139a14,96b1491,9a19a1a9\nInput 5 Test\n6 7 6a25a6,62c6,,2d6,2b656,,c656\nOutput 5 Test\nNone"

# prepare for the model
#input_ids = tokenizer(test_example, return_tensors='pt').input_ids
# generate
#outputs = model.generate(input_ids)
#print("response:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Let's compare this to the ground-truth docstring:

In [None]:
#print("Ground truth:", test_example['docstring'])