<a href="https://colab.research.google.com/github/maktaurus/ML-Work/blob/main/Torch_Notebooks/Language_Translation_Fine_tunning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tunning pre-trained language model for Language Translation

We will be using Huggingface pretrained T5 model for language translation task. The model is already pretained for on this task but we will fine tune it.

In [1]:
pip install -q pytorch_lightning

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.2/815.2 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m926.4/926.4 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import pytorch_lightning as pl
import pandas as pd
import numpy as np
from transformers import T5ForConditionalGeneration, T5TokenizerFast

import dataset from kaggle

In [3]:
!kaggle datasets download devicharith/language-translation-englishfrench

Dataset URL: https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench
License(s): CC0-1.0
Downloading language-translation-englishfrench.zip to /content
  0% 0.00/3.51M [00:00<?, ?B/s]
100% 3.51M/3.51M [00:00<00:00, 158MB/s]


In [4]:
!unzip /content/language-translation-englishfrench.zip

Archive:  /content/language-translation-englishfrench.zip
  inflating: eng_-french.csv         


In [5]:
data = pd.read_csv('/content/eng_-french.csv')
data.head()

Unnamed: 0,English words/sentences,French words/sentences
0,Hi.,Salut!
1,Run!,Cours !
2,Run!,Courez !
3,Who?,Qui ?
4,Wow!,Ça alors !


In [6]:
class MyDataSet(torch.utils.data.Dataset):
  def __init__(self,data):
    self.data = data

  def __len__(self):
    return len(self.data)

  def __getitem__(self,idx):
    en = self.data.iloc[idx,0]
    fr = self.data.iloc[idx,1]
    return en,fr

In [7]:
df = MyDataSet(data)

In [8]:
for x in df:
  print(x)
  break

('Hi.', 'Salut!')


In [9]:
tokenizer = T5TokenizerFast.from_pretrained("google-t5/t5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [19]:
def collate_fn(batch):
  en = [x[0] for x in batch]
  fr = [x[1] for x in batch]
  tokens = tokenizer(en,text_target=fr,padding="max_length",max_length=64,truncation=True,return_tensors="pt")
  labels = tokens["labels"]
  labels[labels == tokenizer.pad_token_id] = -100
  return {"input_ids":tokens["input_ids"],"attention_mask":tokens["attention_mask"],"labels":labels}

In [20]:
train,val = torch.utils.data.random_split(df,[int(len(df)*0.8),len(df)-int(len(df)*0.8)])

In [21]:
train_df = torch.utils.data.DataLoader(df,batch_size=64,shuffle=True,collate_fn=collate_fn)
val_df = torch.utils.data.DataLoader(df,batch_size=64,collate_fn=collate_fn)

In [35]:
for x in train_df:
  print(x["input_ids"].dtype)
  break

torch.int64


In [23]:
device = ("cuda" if torch.cuda.is_available() else "cpu")

In [24]:
model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small")

Create a lightning module and set low learning rate of 0.0001.  

In [25]:
class MyModel(pl.LightningModule):
  def __init__(self):
    super().__init__()
    self.model = model

  def forward(self,input_ids,attention_mask):
    output = self.model(input_ids=input_ids,attention_mask=attention_mask)

  def training_step(self,batch,batch_idx):
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    output = self.model(input_ids=input_ids,attention_mask=attention_mask,labels=labels)
    loss = output.loss
    self.log("train_loss",loss,prog_bar=True,logger=True)
    return loss

  def validation_step(self,batch,batch_idx):
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    output = self.model(input_ids=input_ids,attention_mask=attention_mask,labels=labels)
    loss = output.loss
    self.log("val_loss",loss,prog_bar=True,logger=True)
    return loss

  def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(),lr=1e-4)

In [26]:
pl_model = MyModel()

In [27]:
trainer = pl.Trainer(max_epochs=1)
trainer.fit(pl_model,train_df,val_df)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                       | Params | Mode
------------------------------------------------------------
0 | model | T5ForConditionalGeneration | 60.5 M | eval
------------------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
242.026   Total estimated model params size (MB)
0         Modules in train mode
277       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1` reached.


In [68]:
tt = "Although not all browsers recognize this attribute, it is respected by automatic translation systems such as Google Translate, and may also be respected by tools used by human translators. As such it's important that web authors use this attribute to mark content that should not be translated."
tok = tokenizer(tt,return_tensors="pt")
out = pl_model.model.generate(tok["input_ids"],attention_mask=tok["attention_mask"],max_new_tokens=100)
print(out[0].dtype)
tokenizer.decode(out[0])

AttributeError: 'function' object has no attribute 'pl_model'