# Translation (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[K     |████████████████████████████████| 452 kB 14.0 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 11.3 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 71.1 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 79.8 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 67.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |███████████████████████████

You will need to setup git, adapt your email and name in the following cell.

In [37]:
!git config --global user.email "keonju2@naver.com "
!git config --global user.name "keonju"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [4]:
from datasets import load_dataset

dataset = load_dataset("msarmi9/korean-english-multitarget-ted-talks-task")

Downloading readme:   0%|          | 0.00/2.69k [00:00<?, ?B/s]



Downloading and preparing dataset json/msarmi9--korean-english-multitarget-ted-talks-task to /root/.cache/huggingface/datasets/msarmi9___json/msarmi9--korean-english-multitarget-ted-talks-task-f75be56c53babca9/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/41.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/448k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/461k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/msarmi9___json/msarmi9--korean-english-multitarget-ted-talks-task-f75be56c53babca9/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['korean', 'english'],
        num_rows: 166215
    })
    test: Dataset({
        features: ['korean', 'english'],
        num_rows: 1982
    })
    validation: Dataset({
        features: ['korean', 'english'],
        num_rows: 1958
    })
})

In [6]:
print(dataset['train'][1]["korean"])
print(dataset['train'][1]["english"])

우리는 여러분에게 바닷속 이야기를 영상과 함께 들려주고자 합니다.
And we're going to tell you some stories from the sea here in video.


In [7]:
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-ko-en"
translator = pipeline("translation", model=model_checkpoint)
translator("나는 밥을 먹었다")

Downloading:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/842k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/813k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.72M [00:00<?, ?B/s]



[{'translation_text': 'I ate.'}]

In [8]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-ko-en"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

In [9]:
en_sentence = dataset['train'][1]["english"]
kr_sentence = dataset['train'][1]["korean"]

inputs = tokenizer(kr_sentence, text_target=en_sentence)
inputs

{'input_ids': [337, 10184, 9, 39219, 717, 4312, 11345, 162, 703, 9, 32498, 6621, 171, 2, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [70, 37, 12, 101, 220, 5, 469, 18, 150, 4946, 65, 4, 1674, 169, 13, 2791, 2, 0]}

In [10]:
wrong_targets = tokenizer(en_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(inputs["labels"]))

['▁And', '▁we', "'", 're', '▁go', 'ing', '▁to', '▁t', 'ell', '▁you', '▁some', '▁st', 'or', 'ies', '▁from', '▁the', '▁', 'se', 'a', '▁here', '▁in', '▁v', 'ide', 'o', '.', '</s>']
['▁And', '▁we', "'", 're', '▁going', '▁to', '▁tell', '▁you', '▁some', '▁stories', '▁from', '▁the', '▁sea', '▁here', '▁in', '▁video', '.', '</s>']


In [11]:
type(dataset['train']["korean"])

list

In [12]:
max_length = 128


def preprocess_function(examples):
    inputs = examples["korean"]
    targets = examples["english"]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [13]:
tokenized_datasets = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=['korean','english']
)

  0%|          | 0/167 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [14]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 166215
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1982
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1958
    })
})

In [15]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [16]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [17]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [18]:
batch["labels"]

tensor([[   70,    37,    12,   101,   220,     5,   469,    18,   150,  4946,
            65,     4,  1674,   169,    13,  2791,     2,     0,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100],
        [  113,    12,   250,   288,   150,     6,     4,   320, 10219,  2791,
             6, 49712,    15,    12,    10,   589,   145,   841,     3,     8,
            37,    12,   101,    34,   220,     5,   569,    18,   267,     6,
            24,     2,     0]])

In [19]:
batch["decoder_input_ids"]

tensor([[65000,    70,    37,    12,   101,   220,     5,   469,    18,   150,
          4946,    65,     4,  1674,   169,    13,  2791,     2,     0, 65000,
         65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000,
         65000, 65000, 65000],
        [65000,   113,    12,   250,   288,   150,     6,     4,   320, 10219,
          2791,     6, 49712,    15,    12,    10,   589,   145,   841,     3,
             8,    37,    12,   101,    34,   220,     5,   569,    18,   267,
             6,    24,     2]])

In [20]:
for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])

[70, 37, 12, 101, 220, 5, 469, 18, 150, 4946, 65, 4, 1674, 169, 13, 2791, 2, 0]
[113, 12, 250, 288, 150, 6, 4, 320, 10219, 2791, 6, 49712, 15, 12, 10, 589, 145, 841, 3, 8, 37, 12, 101, 34, 220, 5, 569, 18, 267, 6, 24, 2, 0]


In [21]:
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 13.7 MB/s 
Collecting portalocker
  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)
Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.6.0 sacrebleu-2.3.1


In [22]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [23]:
predictions = [
    "먹었다. 나는 밥을"
]
references = [
    [
        "나는 밥을 먹었다."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 45.18010018049227,
 'counts': [4, 2, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [100.0, 66.66666666666667, 25.0, 25.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

In [24]:
predictions = [
    "나는 밥을 먹었다."
]
references = [
    [
        "나는 밥을 먹었다."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 100.00000000000004,
 'counts': [4, 3, 2, 1],
 'totals': [4, 3, 2, 1],
 'precisions': [100.0, 100.0, 100.0, 100.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

In [25]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

In [34]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


In [35]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"Helsinki-NLP/opus-mt-ko-en",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [39]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/keonju/opus-mt-ko-en into local empty directory.
Using cuda_amp half precision backend


In [40]:
trainer.evaluate(max_length=max_length)

***** Running Evaluation *****
  Num examples = 1958
  Batch size = 64


{'eval_loss': 1.7998191118240356,
 'eval_bleu': 20.373712770046897,
 'eval_runtime': 107.8914,
 'eval_samples_per_second': 18.148,
 'eval_steps_per_second': 0.287}

In [None]:
trainer.train()

***** Running training *****
  Num examples = 166215
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 15585
  Number of trainable parameters = 77419008


Step,Training Loss
500,1.8313
1000,1.8302
1500,1.8148
2000,1.8232
2500,1.8085
3000,1.8162
3500,1.8081
4000,1.7987


In [None]:
trainer.evaluate(max_length=max_length)

In [None]:
trainer.push_to_hub(tags="translation", commit_message="Training complete")

In [None]:
from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

In [None]:
output_dir = "marian-finetuned-kde4-en-to-fr-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

In [None]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)