# Machine Translation Example

This notebook provides the code to execute preprocessing, training, and inference for a translation model (from English to French) which is based on the following tutorial: https://huggingface.co/docs/transformers/tasks/translation. 

In [1]:
# Uncomment the line below to install the required libraries
# !pip install transformers datasets evaluate sacrebleu tensorflow tf-keras



## Sign in to your Hugging Face account

This will enable you to upload and share the model.

### Steps to get the `Access Token` from Hugging Face:

 - **Sign In or Sign Up:** If you don't have a Hugging Face account yet, you'll need to sign up. If you already have an account, sign in.

 - **Access Your Profile:** Once you're signed in, navigate to your profile settings. You can do this by clicking on your profile icon or username, usually located in the top-right corner of the Hugging Face website.
 
- **Navigate to Access Token Settings:** Within your profile settings, look for an option related to Access tokens. This is where you can manage and generate tokens.

- **Generate a New Token:** If you haven't generated a token before, you'll see a button (`New token`) to generate a new token. Click on this button. Please ensure you the token `write` access

- **Name Your Token (Optional):** You may be prompted to give your token a name or description. This step is optional but can be helpful if you plan to generate multiple tokens for different purposes.

- **Copy Your Token:** Once your token is generated, you'll typically see it displayed on the screen. Make sure to copy the token and replace it in the `login` code below. 

# Translation

In [2]:
from huggingface_hub import login

login(token="<ADD ACCESS TOKEN>")

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/gitpod/.cache/huggingface/token
Login successful


### Config

In [22]:
model_dir="../tf_saved_model"
repo_id="MelioAI/machine-translation"

### Load OPUS Books dataset

- Start by loading the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset from the 🤗 Datasets library:

- Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method

- `translation`: an English and French translation of the text.

In [4]:
from datasets import load_dataset, concatenate_datasets, DatasetDict

books = load_dataset("opus_books", "en-fr")
print(books.shape)

subset_train = books["train"].select(range(500))
subset_test = books["train"].select(range(50))
books = DatasetDict({
    "train": subset_train,
    "test": subset_test
})

print(books.shape)
books["train"][0]

{'train': (127085, 2)}
{'train': (500, 2), 'test': (50, 2)}


{'id': '0', 'translation': {'en': 'The Wanderer', 'fr': 'Le grand Meaulnes'}}

### Preprocess

- The next step is to load a T5 tokenizer to process the English-French language pairs:

- The preprocessing function you want to create needs to:

    1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.

    2. Tokenize the input (English) and target (French) separately because you can't tokenize French text with a tokenizer pretrained on an English vocabulary.
    
    3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [5]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)



- To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [7]:
tokenized_books = books.map(preprocess_function, batched=True)

- Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [8]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

2024-05-09 13:42:16.815680: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-09 13:42:16.932253: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-09 13:42:17.225043: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [9]:
import evaluate

metric = evaluate.load("sacrebleu")

- Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the SacreBLEU score:

In [10]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

- Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

### Train

<Tip>

If you aren't familiar with finetuning a model with Keras, take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-a-tensorflow-model-with-keras)!

</Tip>
To finetune a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters:

In [11]:
from transformers import AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

- Then you can load T5 with [TFAutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.TFAutoModelForSeq2SeqLM):

In [12]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


- Convert your datasets to the `tf.data.Dataset` format with [prepare_tf_dataset()](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset):

In [13]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_books["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_books["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

- Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to:

In [14]:
import tensorflow as tf

model.compile(optimizer=optimizer)  # No loss argument!

## Pushing to HuggingFace

- Specify where to push your model and tokenizer in the [PushToHubCallback](https://huggingface.co/docs/transformers/main/en/main_classes/keras_callbacks#transformers.PushToHubCallback):

- NB: If there are any files existing in the repo, the command below will clone the repo and those files ie: you might see a folder called `tf_saved_model`

In [28]:
from transformers.keras_callbacks import PushToHubCallback

# Callback that will save and push the model to the Hub 
push_to_hub_callback = PushToHubCallback(
    output_dir=model_dir,
    tokenizer=tokenizer,
    hub_model_id=repo_id
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/workspace/examples/machine-translation/notebooks/../tf_saved_model is already a clone of https://huggingface.co/MelioAI/machine-translation. Make sure you pull the latest changes with `repo.git_pull()`.


- Then bundle your callbacks together:

In [29]:
callbacks = [push_to_hub_callback]

- Finally, you're ready to start training your model! Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) with your training and validation datasets, the number of epochs, and your callbacks to finetune the model:

In [30]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=1, callbacks=callbacks)



<tf_keras.src.callbacks.History at 0x7f4045554860>

- Once training is completed, your model is automatically uploaded/updated to the HuggingFace. You can go onto HuggingFace and check the model files. 

<Tip>

For a more in-depth example of how to finetune a model for translation, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).

</Tip>

### HuggingFace - Download and Upload Files (Optional)

- **NB:** The download and upload functionality is not necessary, but if you wish to download the model files to your local system or upload a folder/files to Hugging Face, you can utilize these features.

- Only run the code blocks below for `upload` and `download` if needed!

### a)  Download Model

In [31]:
from huggingface_hub import snapshot_download

# Download model to local dir
snapshot_download(
    repo_id=repo_id,
    local_dir=model_dir
)

Fetching 8 files: 100%|██████████| 8/8 [00:02<00:00,  3.54it/s]


'/workspace/examples/machine-translation/tf_saved_model'

### b) Upload a folder

 - **NB**: Folder will not get uploaded if no already exists and there are no changes to the file

 - Reference: https://huggingface.co/docs/huggingface_hub/en/guides/upload

In [None]:
# Upload all the content from the local folder to your remote Space.
# By default, files are uploaded at the root of the repo

from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
    folder_path=model_dir,
    repo_id=repo_id,
    repo_type="model",
    #multi_commits=True,
    multi_commits_verbose=True,
)

### C. Upload single files

In [None]:
from huggingface_hub import HfApi
api = HfApi()

api.upload_file(
    path_or_fileobj="/workspace/examples/machine-translation/tf_saved_model/README.md",
    path_in_repo="README.md",
    repo_id=repo_id,
    repo_type="model",
)

### Inference

- Great, now that you've finetuned a model, you can use it for inference!

- Come up with some text you'd like to translate to another language. For T5, you need to prefix your input depending on the task you're working on. For translation from English to French, you should prefix your input as shown below:

In [32]:
# example text
text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

- The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for translation with your model, and pass your text to it:

In [33]:
from transformers import pipeline

translator = pipeline("translation", model=repo_id)
translator(text)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at MelioAI/machine-translation.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
I0000 00:00:1715264331.097958   18699 service.cc:145] XLA service 0x55a1327da490 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1715264331.098008   18699 service.cc:153]   StreamExecutor device (0): Host, Default Version
2024-05-09 14:18:51.198361: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1715264331.217740   18699 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[{'translation_text': 'Legumes partagent les ressources avec les bactéries fixatrices de azote.'}]

# Manual Inference

Once you train the model, this is how you can test it without using Huggingface Hub.

1. Save the tokenizer and model manually
2. You can either use the `pipeline` function, *OR*
3. Construct the `pipeline` manually with 
  - Tokenize the text and return the input_ids as tensors.
  - Use the generate() method to create the translation. 
  - Decode the generated token ids back into text.

In [35]:
tokenizer.save_pretrained("machine-translation/tf_saved_model")
model.save_pretrained("machine-translation/tf_saved_model")

In [36]:
text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

In [37]:
from transformers import pipeline

## Do everything in one step using pipeline
translator = pipeline("translation", model="machine-translation/tf_saved_model")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at machine-translation/tf_saved_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [38]:
from transformers import AutoTokenizer
from transformers import TFAutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("machine-translation/tf_saved_model")
loaded_model = TFAutoModelForSeq2SeqLM.from_pretrained("machine-translation/tf_saved_model/")

inputs = tokenizer(text, return_tensors="tf").input_ids
outputs = loaded_model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
tokenizer.decode(outputs[0], skip_special_tokens=True)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at machine-translation/tf_saved_model/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


"Les lignées partagent les ressources avec les bactéries fixatrices d'azote."

- You can navigate to the `deployment` folder and follow the instructions provided in the `README` file to locally test the deployed model.

- **NB:** Ensure your model folder name is name is `tf_saved_model` and is saved within the `machine-translation` directory ie:

```sh
    machine-translation
        - deployment
        - notebooks
        - training
        - tf_saved-model
        - README.md
```