# Fine tuning T5 with Layer

[![Open in Layer](https://development.layer.co/assets/badge.svg)](https://development.layer.co/layer/t5-fine-tuning-with-layer) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/layerai/examples/blob/main/translation/T5_Fine_tuning_with_Layer.ipynb) [![Layer Examples Github](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com/layerai/examples/tree/main/translation)

A T5 is an encoder-decoder model. It converts all NLP problems like language translation, summarization, text generation, question-answering, to a text-to-text task.

We are going to fine tune a pretrained T5 Model from 🤗 and train it to translate English to SQL.

# Install Requirements

In [None]:
!pip install layer --upgrade -q
!pip install sentencepiece -q
!pip install transformers -q

In [None]:
from layer.decorators import dataset, model,resources, pip_requirements, fabric
import layer
import torch
import random
import re
import pandas as pd
import numpy as np
import os

# Getting Started with Layer

Layer is an MLOps platform which advances ML pipelines with remote computation and tracking.

## Login to Layer

Let's login to Layer first.

In [None]:
layer.login()

## Initialize Layer Project
Now we are ready to init our project. Layer Project is basically an ML Repo hosted on Layer where you can store your datasets, models, metrics

In [None]:
layer.init("t5-fine-tuning-with-layer")

Your project is ready. Find your project here:

https://app.layer.ai

# Dataset Generation
Unlike language to language translation datasets, we can build custom English to SQL translation pairs programmatically with the help of some templates.

In [None]:
templates = [
              ["[prop1] of [nns]","SELECT [prop1] FROM [nns]"],
              ["[agg] [prop1] for each [breakdown]","SELECT [agg]([prop1]) , [breakdown] FROM [prop1] GROUP BY [breakdown]"],
              ["[prop1] of [nns] by [breakdown]","SELECT [prop1] , [breakdown] FROM [nns] GROUP BY [breakdown]"],
              ["[prop1] of [nns] in [location] by [breakdown]","SELECT [prop1] , [breakdown] FROM [nns] WHERE location = '[location]' GROUP BY [breakdown]"],
              ["[nns] having [prop1] between [number1] and [number2]","SELECT name FROM [nns] WHERE [prop1] > [number1] and [prop1] < [number2]"],
              ["[prop] by [breakdown]","SELECT name , [breakdown] FROM [prop] GROUP BY [breakdown]"],
              ["[agg] of [prop1] of [nn]","SELECT [agg]([prop1]) FROM [nn]"],
              ["[prop1] of [nns] before [year]","SELECT [prop1] FROM [nns] WHERE date < [year]"],
              ["[prop1] of [nns] after [year] in [location]","SELECT [prop1] FROM [nns] WHERE date > [year] AND location='[location]'"],
              ["[nns] [verb] after [year] in [location]","SELECT name FROM [nns] WHERE location = '[location]' AND date > [year]"],
              ["[nns] having [prop1] between [number1] and [number2] by [breakdown]","SELECT name , [breakdown] FROM [nns] WHERE [prop1] < [number1] AND [prop1] > [number2] GROUP BY [breakdown]"],
              ["[nns] with a [prop1] of maximum [number1] by their [breakdown]","SELECT name , [breakdown] FROM [nns] WHERE [prop1] <= [number1] GROUP BY [breakdown]"],
              ["[prop1] and [prop2] of [nns] since [year]","SELECT [prop1] , [prop2] FROM [nns] WHERE date > [year]"],
              ["[nns] which have both [prop1] and [prop2]","SELECT name FROM [nns] WHERE [prop1] IS true AND [prop2] IS true"],
              ["Top [number1] [nns] by [prop1]","SELECT name FROM [nns] ORDER BY [prop1] DESC LIMIT [number1]"]
]
template = random.choice(templates)

In [None]:
objects = ["countries","wines","wineries","tasters", "provinces","grapes","cities","bottles","deliveries"]
object_single = ["country","wine","winery","taster", "province","grape","city","bottle", "delivery"]
properties = ["points","price","taste","title","texture","age","duration","acidity","flavor","level"]
aggs = [["average","avg"], ["total","sum"],["count","count"], ["minimum","min"], ["maximum","max"]]
breakdowns = ["quality","price","province","country","point", "variety","flavor","age"]
locations = ["Italy","US","Portugal","Spain","Chile","Turkey","Canada"]
verbs = ["produced","bottled"]

regex = r"\[([a-z0-9]*)\]"
number_of_samples = 2500

@dataset("english_sql_translations")
def build_dataset():
    rows = []
    for index in range(0,number_of_samples):
        template = random.choice(templates)
        nl = template[0]
        sql = template[1]

        matches = re.finditer(regex, nl, re.MULTILINE)

        for matchNum, match in enumerate(matches, start=1):
            key = match.group()
            prop = None
            prop_sql = None
            if key.startswith("[prop"):
                prop = random.choice(properties)
                prop_sql = prop.replace(" ","_").lower()
            if key in ["[nns]"]:
                prop = random.choice(objects)
                prop_sql = prop
            if key in ["[nn]"]:
                prop = random.choice(object_single)
                prop_sql = prop.replace(" ","_").lower()
            if key == "[breakdown]":
                prop = random.choice(breakdowns)
                prop_sql = prop.replace(" ","_").lower()
            if key == "[verb]":
                prop = random.choice(verbs)
                prop_sql = prop.replace(" ","_").lower()
            if key == "[agg]":
                aggregation = random.choice(aggs)
                prop = aggregation[0]
                prop_sql = aggregation[1]
            if key == "[location]":
                prop = random.choice(locations)
                prop_sql = prop
            if key.startswith("[number"):
                prop = str(random.randint(1,1000))
                prop_sql = prop
            if key.startswith("[year"):
                prop = str(random.randint(1950,2022))
                prop_sql = prop
            

            if prop is not None:
                nl = nl.replace(key,prop)
                sql = sql.replace(key,prop_sql)
        
        prefix = random.randint(1,20)
        if prefix == 1:
            nl = "Show me "+nl
        elif prefix == 2:
            nl = "List "+nl
        elif prefix == 3:
            nl = "List of "+nl
        elif prefix == 4:
            nl = "Find "+nl
        rows.append([nl,sql])

    df = pd.DataFrame(rows, columns=["query", "sql"])
    return df

## Register dataset to Layer

In the above cell, we have used a special decorator called `@dataset` which tells Layer that our function creates dataset. Now we are going to pass this function to Layer to be run on Layer infra and register the built dataset under our project.

In [None]:
layer.run([build_dataset])

## Create Data Loader

In [None]:
from torch.utils.data import Dataset
class EnglishToSQLDataSet(Dataset):

  def __init__(self, dataframe, tokenizer, source_len, target_len, source_text, target_text):
    self.tokenizer = tokenizer
    self.data = dataframe
    self.source_len = source_len
    self.target_len = target_len
    self.target_text = self.data[target_text]
    self.source_text = self.data[source_text]

    self.data["query"] = "translate English to SQL: "+self.data["query"]
    self.data["sql"] = "<pad>" + self.data["sql"] + "</s>"

  def __len__(self):
    return len(self.target_text)

  def __getitem__(self, index):
    source_text = str(self.source_text[index])
    target_text = str(self.target_text[index])

    source_text = ' '.join(source_text.split())
    target_text = ' '.join(target_text.split())

    source = self.tokenizer.batch_encode_plus([source_text], max_length= self.source_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
    target = self.tokenizer.batch_encode_plus([target_text], max_length= self.target_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')

    source_ids = source['input_ids'].squeeze()
    source_mask = source['attention_mask'].squeeze()
    target_ids = target['input_ids'].squeeze()
    target_mask = target['attention_mask'].squeeze()

    return {
        'source_ids': source_ids.to(dtype=torch.long),
        'source_mask': source_mask.to(dtype=torch.long),
        'target_ids': target_ids.to(dtype=torch.long),
        'target_ids_y': target_ids.to(dtype=torch.long)
    }

# Fine Tune T5

Our dataset is ready and registered to Layer. Now we are going to develop the fine tuning logic, decorate the function with `@model` and pass it to Layer so that it can be run on Layer infra and registered under our project

In [None]:
def train(epoch, tokenizer, model, device, loader, optimizer):
    import torch

    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]

        step = (epoch * len(loader)) + _
        layer.log({"loss": float(loss)}, step)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Here we use 3 seperate Layer decorators:
- [`@model`](https://docs.app.layer.ai/docs/sdk-library/model-decorator): Tells Layer that this function trains an ML model
- [`@fabric`](https://docs.app.layer.ai/docs/sdk-library/fabric-decorator): Tells Layer the computation resources (cpu, gpu etc.) needed to train the model. Here is a list of the [available fabrics](https://docs.app.layer.ai/docs/reference/fabrics) you can use.
- [`@pip_requirements`](https://docs.app.layer.ai/docs/sdk-library/pip-requirements-decorator): Tells the pypi libraries needed to train the model.

In [None]:
@model("t5-tokenizer")
@fabric("f-medium")
@pip_requirements(packages=["torch","transformers","sentencepiece"])
def build_tokenizer():
    from transformers import T5Tokenizer
    # Load tokenizer from Hugging face
    tokenizer = T5Tokenizer.from_pretrained("t5-small")
    return tokenizer

@model("t5-english-to-sql")
@fabric("f-gpu-small")
@pip_requirements(packages=["torch","transformers","sentencepiece"])
def build_model():
    from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    import torch.nn.functional as F
    from torch import cuda
    import torch
    
    parameters={
        "BATCH_SIZE":8,          
        "EPOCHS":3,              
        "LEARNING_RATE":2e-05,          
        "MAX_SOURCE_TEXT_LENGTH":75,   
        "MAX_TARGET_TEXT_LENGTH":75,
        "SEED": 42
    }

    # Log parameters to Layer
    layer.log(parameters)
    
    # Set seeds for reproducibility
    torch.manual_seed(parameters["SEED"])
    np.random.seed(parameters["SEED"])
    torch.backends.cudnn.deterministic = True

    # Load tokenizer from Layer
    tokenizer = layer.get_model("t5-tokenizer").get_train()

    # Load pretrained model from Hugging face
    model = T5ForConditionalGeneration.from_pretrained("t5-small")
    device = 'cuda' if cuda.is_available() else 'cpu'
    model.to(device)

    dataframe = layer.get_dataset("english_sql_translations").to_pandas()
    source_text = "query"
    target_text = "sql"

    dataframe = dataframe[[source_text,target_text]]

    train_dataset = dataframe.sample(frac=0.8,random_state = parameters["SEED"])
    train_dataset = train_dataset.reset_index(drop=True)

    layer.log({"FULL Dataset": str(dataframe.shape),
             "TRAIN Dataset": str(train_dataset.shape)
             })

    training_set = EnglishToSQLDataSet(train_dataset, tokenizer, parameters["MAX_SOURCE_TEXT_LENGTH"], parameters["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)

    dataloader_paramaters = {
      'batch_size': parameters["BATCH_SIZE"],
      'shuffle': True,
      'num_workers': 0
      }

    training_loader = DataLoader(training_set, **dataloader_paramaters)
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=parameters["LEARNING_RATE"])

    for epoch in range(parameters["EPOCHS"]):
        train(epoch, tokenizer, model, device, training_loader, optimizer)

    return model

In [None]:
# # You can train your model locally by just calling the function to debug your code.
# build_tokenizer()
# build_model()

# # Once you are ready, you can push your model training function to Layer to be trained.
layer.run([build_tokenizer, build_model], debug=True)

## Where to go from here?

Now that you have created first Layer Project, you can:

- Join our [Slack Community ](https://bit.ly/layercommunityslack)
- Visit [Layer Examples Repo](https://github.com/layerai/examples) for more examples
- Browse [Trending Layer Projects](https://layer.ai) on our mainpage
- Check out [Layer Documentation](https://docs.app.layer.ai) to learn more