### Translation

This code comes basically from [here](https://huggingface.co/docs/transformers/v4.40.1/tasks/translation). It shows you how to fine-tune a dataset on a pre-trained model. Here we will fine-tune google-t5/t5-small to translate from English to Hungarian.

In [2]:
import evaluate 
import numpy as np

from datasets import load_dataset, load_metric
from transformers import pipeline, DataCollatorForSeq2Seq
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer 
import torch 

2024-04-24 23:02:55.631385: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-24 23:02:55.697425: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-24 23:02:55.697474: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-24 23:02:55.697508: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-24 23:02:55.712815: I tensorflow/core/platform/cpu_feature_g

Loading [Opus Books](https://huggingface.co/datasets/Helsinki-NLP/opus_books) dataset, which contains a collection of copyright free books in different languages. We're doing English (en) to Hungarian (hu) translation.

In [3]:
books = load_dataset("opus_books", "en-hu")
books

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 137151
    })
})

In [4]:
books = books["train"].train_test_split(test_size=0.2)
books

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 109720
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 27431
    })
})

In [5]:
for i in range(3):
    datum = books['train'][i]
    print(f'=' * 20)
    print(f"ID: {datum['id']}")
    print(f"En: {datum['translation']['en']}")
    print(f"Hu: {datum['translation']['hu']}")


ID: 129829
En: "Silence!" shouted the constable.
Hu: Csend! - süvöltötte a teremőr.
ID: 132179
En: "Yes, it is Latin," my uncle went on; "but it is Latin confused and in disorder; "_pertubata seu inordinata,_" as Euclid has it."
Hu: Igen, latin - folytatta nagybátyám -, de elrontott latin.
ID: 55048
En: "And this carte blanche," said d’Artagnan, "this carte blanche, does it remain in her hands?"
Hu: Vajon most is a kezében van ez a felhatalmazás? - kérdezte D'Artagnan.


In [6]:
metric = evaluate.load('sacrebleu')

In [18]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

{'score': 0.0,
 'counts': [4, 2, 0, 0],
 'totals': [4, 2, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

Here, score is the BLEU, counts is the number of correct n-grams, precisions is the percent of n grams correct.

In [7]:
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [31]:
sentences = [
    "Wow, this is pretty cool",
    "I don't know what else to say",
    "이 단어는 어휘에 포함되어 있지 않습니다"
    ]

for sent in sentences:
    print(f'=' * 20)
    print(f'Sentence: {sent}')
    print(f"Tokens: {tokenizer(sent)['input_ids']}")

Sentence: Wow, this is pretty cool
Tokens: [9758, 6, 48, 19, 1134, 1633, 1]
Sentence: I don't know what else to say
Tokens: [27, 278, 31, 17, 214, 125, 1307, 12, 497, 1]
Sentence: 이 단어는 어휘에 포함되어 있지 않습니다
Tokens: [3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 1]


In [8]:
source_lang = "en"                          ### src language (English)
target_lang = "hu"                          ### tgt language (Hungarian)
prefix = "Translate English to Hungarian: " ### starting "prompt" to model

In [9]:
max_input_length = 128  ### max to feed into (otherwise it'll truncate)
max_target_length = 128 ### max it will output (otherwise it'll truncate)

def preprocess(examples, verbose=False):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    ### specifically to tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    if verbose:
        print(f'Inputs:')

        for input in inputs:
            print(input)

        print(f'=' * 20)
        print("Outputs:")
        for target in targets:
            print(target)

        print(f'=' * 20)
        print('Model Inputs')
        for model_input in model_inputs['input_ids']:
            print(model_input)

        print(f'=' * 20)
        print("Labels")
        for label in labels['input_ids']:
            print(label)
            
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [56]:
examples  = books['train'][:3]
ex_inputs = preprocess(examples, verbose=True)

Inputs:
Translate English to Hungarian: Source: Project GutenbergAudiobook available here
Translate English to Hungarian: Pride and Prejudice
Translate English to Hungarian: Jane Austen
Outputs:
Source: mek.oszk.huTranslation: Szenczi MiklósAudiobook available here
Büszkeség és balítélet
Jane Austen
Model Inputs
[30355, 15, 1566, 12, 454, 425, 6855, 10, 9149, 10, 2786, 7756, 11063, 188, 5291, 32, 2567, 347, 270, 1]
[30355, 15, 1566, 12, 454, 425, 6855, 10, 24252, 11, 1266, 14312, 867, 1]
[30355, 15, 1566, 12, 454, 425, 6855, 10, 8158, 1392, 324, 1]
Labels
[9149, 10, 140, 157, 5, 32, 7, 172, 157, 5, 107, 76, 18474, 6105, 10, 180, 1847, 75, 702, 21475, 40, 4922, 7, 188, 5291, 32, 2567, 347, 270, 1]
[21162, 7, 172, 7735, 154, 122, 3, 899, 6561, 2, 2229, 1655, 1]
[8158, 1392, 324, 1]


In [67]:
### tokenize the whole dataset
tokenized_books = books.map(preprocess, batched=True)

Map:   0%|          | 0/109720 [00:00<?, ? examples/s]



Map:   0%|          | 0/27431 [00:00<?, ? examples/s]

In [59]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [60]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [10]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [15]:
print(model)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [17]:
encoder_q_weights = model.encoder.block[0].layer[0].SelfAttention.q.weight.data
encoder_k_weights = model.encoder.block[0].layer[0].SelfAttention.k.weight.data
encoder_v_weights = model.encoder.block[0].layer[0].SelfAttention.v.weight.data
encoder_o_weights = model.encoder.block[0].layer[0].SelfAttention.o.weight.data

# Print the shapes of these weights as an example
print("Encoder 'q' weights shape:", encoder_q_weights.shape)
print("Encoder 'k' weights shape:", encoder_k_weights.shape)
print("Encoder 'v' weights shape:", encoder_v_weights.shape)
print("Encoder 'o' weights shape:", encoder_o_weights.shape)

Encoder 'q' weights shape: torch.Size([512, 512])
Encoder 'k' weights shape: torch.Size([512, 512])
Encoder 'v' weights shape: torch.Size([512, 512])
Encoder 'o' weights shape: torch.Size([512, 512])


In [64]:
tokenized_books

DatasetDict({
    train: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 137151
    })
})

In [68]:
training_args = Seq2SeqTrainingArguments(
    output_dir="base_finetuned_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
print(f"Final Results:")
trainer.evaluate(max_length=128)