<a href="https://colab.research.google.com/github/mark1702/mark/blob/master/Translation_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#!pip install transformers datasets evaluate sacrebleu

In [2]:
from datasets import load_dataset
books = load_dataset("opus_books", "en-fr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
books = books['train'].select(range(100)).train_test_split(test_size=0.2)

In [4]:
from transformers import AutoTokenizer

In [5]:
#!pip install sentencepiece

In [6]:
checkpoint = "facebook/hf-seamless-m4t-medium"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang='eng', tgt_lang='fra')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

In [8]:
max_input_len=128
max_target_len=128
def tokenizer_fn(batch):
    inputs = [x[source_lang] for x in batch['translation']]
    targets = [x[target_lang] for x in batch['translation']]

    tokenized_inputs = tokenizer(inputs, max_length=max_input_len, truncation=True)
    tokenized_targets = tokenizer(text_target=targets, max_length=max_target_len, truncation=True)

    tokenized_inputs['labels'] = tokenized_targets['input_ids']
    return tokenized_inputs

In [9]:
tokenized_books = books.map(tokenizer_fn, batched=True, remove_columns=books["train"].column_names)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [10]:
#!pip install transformers --upgrade

In [11]:
from transformers import SeamlessM4Tv2ForTextToText, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = SeamlessM4Tv2ForTextToText.from_pretrained(checkpoint)

You are using a model of type seamless_m4t to instantiate a model of type seamless_m4t_v2. This is not supported for all configurations of models and can yield errors.


In [12]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [13]:
import evaluate
metric = evaluate.load("sacrebleu")

In [14]:
import numpy as np
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [15]:
# !pip install accelerate -U

In [16]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    eval_accumulation_steps=1,
    gradient_checkpointing=True,
)

In [17]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model('checkpoint/my_saved_model')

You're using a SeamlessM4TTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,5.618


