# Finetune T5 locally for machine translation on COVID-19 Health Service Announcements with Hugging Face

[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws/studio-lab-examples/blob/main/natural-language-processing/NLP_Disaster_Recovery_Translation.ipynb)

This notebook is designed to run within SageMaker Lab, on a `g4dn.xlarge` GPU instance. If you are not using that right now, please restart your session and select `GPU`, as this will help you train your model in a matter of tens of minutes, rather than hours.

If you are ready for training a large-scale machine translation model, then please check out using Hugging Face on Amazon SageMaker! 

Otherwise, please enjoy this notebook.

### Step 0. Install all necessary packages

In [1]:
%conda env list


# conda environments:
#
base                     /ai/anaconda3
hug                      /ai/anaconda3/envs/hug
sagemaker-distribution     /ai/anaconda3/envs/sagemaker-distribution
training              *  /ai/anaconda3/envs/training


Note: you may need to restart the kernel to use updated packages.


In [2]:
%%writefile requirements.txt

ipywidgets
git+https://github.com/huggingface/transformers
datasets
sacrebleu
torch
sentencepiece
evaluate

Writing requirements.txt


In [3]:
%pip install -r requirements.txt

Collecting git+https://github.com/huggingface/transformers (from -r requirements.txt (line 3))
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-2hsmmdbw
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-2hsmmdbw
  Resolved https://github.com/huggingface/transformers to commit a22ff36e0e347d3d0095cccd931cbbd12b14e86a
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting ipywidgets (from -r requirements.txt (line 2))
  Using cached ipywidgets-8.1.3-py3-none-any.whl.metadata (2.4 kB)
Collecting datasets (from -r requirements.txt (line 4))
  Using cached datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting sacrebleu (from -r requirements.txt (line 5))
  Downloading sacrebleu-2.4.2-py3-none-any.whl.metadata (58 kB)
Collecting torch (from -r requirements.txt (line 

In [4]:
import IPython
# make sure to restart your kernel to use the newly install packages
# IPython.Application.instance().kernel.do_shutdown(True) 

## Step 1. Explore the available datasets on Translators without Borders 
Then, download a pair you would like to use for training a language translation model. The steps below download the translation pairs for English to Spanish, but you are welcome to modify these and use a different pair if you prefer.

Overall site page: https://tico-19.github.io/

Page with all language pairs: https://tico-19.github.io/memories.html 

Scroll through all supported language pairs and pick your favorite. We'll demonstrate English to Spanish, `en-to-es`

Copy the link to that pair, for `en-to-es` it looks like this:
- https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip 

In [5]:
path_to_my_data = 'https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip'

In [6]:
!wget {path_to_my_data}

--2024-08-13 21:40:58--  https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip
Resolving tico-19.github.io (tico-19.github.io)... 185.199.108.153, 185.199.111.153, 185.199.110.153, ...
Connecting to tico-19.github.io (tico-19.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 381511 (373K) [application/zip]
Saving to: ‘all.en-es-LA.tmx.zip’


2024-08-13 21:40:59 (8,81 MB/s) - ‘all.en-es-LA.tmx.zip’ saved [381511/381511]



In [7]:
local_file = path_to_my_data.split('/')[-1]
print (local_file)
filename = local_file.split('.zip')[0]
print (filename)

all.en-es-LA.tmx.zip
all.en-es-LA.tmx


In [8]:
!unzip {local_file}

Archive:  all.en-es-LA.tmx.zip
  inflating: all.en-es-LA.tmx        


### Step 2: Extract data from `.tmx` file type 
Next, you can use this local function to extract data from the `.tmx` file type and format for local training with Hugging Face.

In [9]:
# paste the name of your file and language codes here
source_code_1 = 'en'
target_code_2 =  'es'

In [10]:
def parse_tmx(filename, source_code_1, target_code_2):
    '''
    Takes a local TMX filename and codes for source and target languages. 
    Walks through your file, row by row, looking for tmx / html specific formatting.
    If there's a regex match, will clean your string and add to a dictionary for downstream pandas formatting.
    '''
    
    data = {source_code_1:[], target_code_2:[]}

    with open(filename) as f:

        for row in f.readlines():

            if not row.endswith('</seg></tuv>\n'):
                continue

            if row.startswith('<seg>'):

                st_1 = row.strip()

                st_1 = st_1.replace('<seg>', '')
                st_1 = st_1.replace('</seg></tuv>', '')

                data[source_code_1].append(st_1)

            # when you use your own target code, remove the -LA string 
            if row.startswith('<tuv xml:lang="{}-LA"><seg>'.format(target_code_2)):

                st_2 = row.strip()
                # when you use your own target code, remove the -LA string 
                st_2 = st_2.replace('<tuv xml:lang="{}-LA"><seg>'.format(target_code_2), '')
                st_2 = st_2.replace('</seg></tuv>', '')

                data[target_code_2].append(st_2)
                
        return data

data = parse_tmx(filename, source_code_1, target_code_2)

In [11]:
# this makes sure you got actual pairs
assert len(data[source_code_1]) == len(data[target_code_2])

In [12]:
import pandas as pd

df = pd.DataFrame.from_dict(data, orient = 'columns')

df.head()

Unnamed: 0,en,es
0,about how long have these symptoms been going on?,¿cuánto hace más o menos que tiene estos sínto...
1,and all chest pain should be treated this way ...,y siempre el dolor de pecho debe tratarse de e...
2,and along with a fever,y también fiebre
3,and also needs to be checked your cholesterol ...,y también debe controlarse su colesterol y pre...
4,and are you having a fever now?,¿y tiene fiebre ahora?


In [13]:
# write to disk in case you need to restart your kernel later
df.to_csv('language_pairs.csv', index=False, header=True)

### Step 3. Format extracted data for machine translation with Hugging Face
Core examples available right here: https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation 

Guidance on formatting for Hugging Face datasets here:
https://huggingface.co/docs/datasets/loading_datasets.html#json-files 

In [14]:
import pandas as pd

df = pd.read_csv('language_pairs.csv')
df.head()

Unnamed: 0,en,es
0,about how long have these symptoms been going on?,¿cuánto hace más o menos que tiene estos sínto...
1,and all chest pain should be treated this way ...,y siempre el dolor de pecho debe tratarse de e...
2,and along with a fever,y también fiebre
3,and also needs to be checked your cholesterol ...,y también debe controlarse su colesterol y pre...
4,and are you having a fever now?,¿y tiene fiebre ahora?


The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key "translation" and its value another dictionary whose keys is the language pair. For example:

`{ "translation": { "en": "Others have dismissed him as a joke.", "ro": "Alții l-au numit o glumă." } }
{ "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alții așteaptă implozia." } }`


In [16]:
objs = []

for idx, row in df.iterrows():
    
    obj = {"translation": {source_code_1: row[source_code_1], target_code_2: row[target_code_2]}} 
    objs.append(obj)

In [17]:
objs[:5]

[{'translation': {'en': 'about how long have these symptoms been going on?',
   'es': '¿cuánto hace más o menos que tiene estos síntomas?'}},
 {'translation': {'en': 'and all chest pain should be treated this way especially with your age',
   'es': 'y siempre el dolor de pecho debe tratarse de esta manera, en especial a su edad'}},
 {'translation': {'en': 'and along with a fever', 'es': 'y también fiebre'}},
 {'translation': {'en': 'and also needs to be checked your cholesterol blood pressure',
   'es': 'y también debe controlarse su colesterol y presión arterial'}},
 {'translation': {'en': 'and are you having a fever now?',
   'es': '¿y tiene fiebre ahora?'}}]

In [18]:
import json 
!mkdir data
with open('data/train.json', 'w') as f:
    for row in objs:
        j = json.dumps(row, ensure_ascii = False)
        f.write(j)
        f.write('\n')

### Step 4 - Finetune a machine translation model locally
Do to this, let's first download the raw Python file we need from Hugging Face to finetune our model.

In [19]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/translation/run_translation.py

--2024-08-13 21:43:49--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/translation/run_translation.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30439 (30K) [text/plain]
Saving to: ‘run_translation.py’


2024-08-13 21:43:49 (6,37 MB/s) - ‘run_translation.py’ saved [30439/30439]



The error message indicates that you need to install or update the `accelerate` package to version 0.21.0 or higher to use the `Trainer` with PyTorch. The error occurs when the script tries to set up the device for training but finds that the required version of `accelerate` is not installed.

To resolve this issue, you can install the required package by running the following command:

```bash
pip install 'accelerate>=0.21.0'
```

Alternatively, you can install the required dependencies for using `transformers` with PyTorch by running:

```bash
pip install transformers[torch]
```

This should resolve the `ImportError`, and you should be able to run your translation training script successfully.

In [None]:
pip install 'accelerate>=0.21.0'

In [13]:
# full hugging face Trainer API args available here
# https://github.com/huggingface/transformers/blob/de635af3f1ef740aa32f53a91473269c6435e19e/src/transformers/training_args.py
# T5 trainig args available here
# https://huggingface.co/transformers/model_doc/t5.html#t5config
!python run_translation.py \
    --model_name_or_path t5-small \
    --do_train \
    --source_lang en \
    --target_lang es \
    --source_prefix "translate English to Spanish: " \
    --train_file data/train.json \
    --output_dir output/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --save_strategy epoch \
    --num_train_epochs 3
#     --do_eval \
#     --validation_file path_to_jsonlines_file \
#     --dataset_name cov-19 \
#     --dataset_config_name en-es \


08/13/2024 22:56:49 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
ev

In [2]:
!ls output/tst-translation

all_results.json  generation_config.json   tokenizer_config.json
checkpoint-1536   model.safetensors	   tokenizer.json
checkpoint-2304   README.md		   trainer_state.json
checkpoint-768	  special_tokens_map.json  training_args.bin
config.json	  spiece.model		   train_results.json


### Step 5. Test your newly fine-tuned translation model

The warning message you're seeing indicates that the `AutoModelWithLMHead` class is deprecated and will be removed in a future version of the `transformers` library. Instead, you should use the more specific classes depending on the type of model you are working with. Since you are using a T5 model, which is an encoder-decoder model, you should use `AutoModelForSeq2SeqLM`.

Here’s how you can update your code to avoid the deprecation warning:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Load the model from the trained output directory
model = AutoModelForSeq2SeqLM.from_pretrained('output/tst-translation')
```

This should eliminate the warning and ensure your code is up-to-date with the latest practices in the `transformers` library.

In [5]:
#from transformers import AutoTokenizer, AutoModelWithLMHead 
#tokenizer = AutoTokenizer.from_pretrained("t5-small")
#model = AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path = 'output/tst-translation')

#Here’s how you can update your code to avoid the deprecation warning:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Load the model from the trained output directory
model = AutoModelForSeq2SeqLM.from_pretrained('output/tst-translation')



In [6]:
# line to make sure your model supports local inference
model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

Next, let's test it! Remember that, in using the default settings of only 3 epoch, your translation is probably not going to be SOTA. For achieving state of the art, (SOTA), we recommend migrating to Amazon SageMaker to scale up and out. Scaling up means moving your code to a more advanced compute type, such as a p4 series or even Trainium. Scaling out means adding more compute, so going from 1 to many instances. Using the entire AWS cloud you can train for much longer periods of time on much larger datasets, which can directly translate to a more accurate model.

It looks like the translation outputs are not as accurate as expected. Several phrases are incorrect or incomplete, which could be due to a few reasons, such as:

1. **Model Quality**: The T5-small model is a small variant of the T5 family and may not provide highly accurate translations for all inputs, especially when trained on a small dataset.

2. **Training Dataset**: If the training dataset (`data/train.json`) is small or not well-aligned with the task, the model may not generalize well to unseen data.

3. **Generation Settings**: The warning about `max_length` suggests that the default maximum length for generated sequences might be too short, truncating some translations. Setting a higher value for `max_new_tokens` could yield more complete translations.

To improve the results, you can try the following:

### 1. Adjust the `max_new_tokens`
Increase the `max_new_tokens` to allow the model to generate longer sequences:

```python
outputs = model.generate(input_ids, max_new_tokens=50)  # Adjust this value as needed
```

### 2. Fine-tune the Model Further
If possible, fine-tune the model with more relevant data to improve its performance on your specific task.

### 3. Use a Larger Model
Consider using a larger variant of T5, such as `t5-base` or `t5-large`, which may provide better translations.

### 4. Post-Processing
Post-process the output to fix common errors, like missing or incorrect words. This could include rules or another model to refine the translations.

If you apply these suggestions, you should see improvements in the translation quality.

In [8]:
input_sequences = ['about how long have these symptoms been going on?',	
'and all chest pain should be treated this way especially with your age	',
'and along with a fever	',
'and also needs to be checked your cholesterol blood pressure',	
'and are you having a fever now?	',
'and are you having any of the following symptoms with your chest pain',	
'and are you having a runny nose?',	
'and are you having this chest pain now?',
'and besides do you have difficulty breathing',
'and can you tell me what other symptoms are you having along with this?',
'and does this pain move from your chest?',
'and drink lots of fluids',
'and how high has your fever been',
'and i have a cough too',
'and i have a little cold and a cough',
'''and i'm really having some bad chest pain today''']

task_prefix = "translate English to Spanish: "

for i in input_sequences:
    input_ids = tokenizer('''{} {}'''.format(task_prefix, i), return_tensors='pt').input_ids
#   outputs = model.generate(input_ids)
    outputs = model.generate(input_ids, max_new_tokens=50)  # Adjust this value as needed

    print(i, tokenizer.decode(outputs[0], skip_special_tokens=True))


about how long have these symptoms been going on? en el trabajo de los sntomas?
and all chest pain should be treated this way especially with your age	 y todos los dolores de la población del trabajo es tratar el felo, en specialmente en el âge
and along with a fever	 y a las fiebas
and also needs to be checked your cholesterol blood pressure y es necesario a verificar la presión arterial del cholesterol
and are you having a fever now?	 y tiene una fiere?
and are you having any of the following symptoms with your chest pain y tiene el sntomas más más en el dolor de la población
and are you having a runny nose? y tiene un nase agua?
and are you having this chest pain now? y tiene el dolor de la población?
and besides do you have difficulty breathing y ahora ahora a las dificultas respirar
and can you tell me what other symptoms are you having along with this? y tu pueden me dire quels sntomas tienes en el trabajo?
and does this pain move from your chest? y se movió el dolor de ta pobla?

In [9]:
model.save_pretrained('my-tf-en-to-sp')

In [11]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

!tar -czf my_model.tar.gz my-tf-en-to-sp

In [None]:
The warning you're encountering comes from the `tokenizers` library used by Hugging Face, and it appears when the process is forked after parallelism has been used. This can potentially lead to deadlocks, so the library disables parallelism as a precaution.

### Solutions to Avoid the Warning

1. **Set the Environment Variable**:
   You can explicitly set the `TOKENIZERS_PARALLELISM` environment variable to either `true` or `false` depending on whether you want to enable or disable parallelism.

   To disable parallelism and suppress the warning, run:
   ```bash
   export TOKENIZERS_PARALLELISM=false
   ```

   If you're running this command inside a Python script or Jupyter notebook, you can set the environment variable within the script:
   ```python
   import os
   os.environ["TOKENIZERS_PARALLELISM"] = "false"
   ```

2. **Avoid Using `tokenizers` Before Forking**:
   If possible, structure your code so that the `tokenizers` library is not used before the process forks. This might involve reordering code or separating certain tasks.

3. **Ignore the Warning**:
   If the warning doesn't impact your workflow, you can choose to ignore it. The library will handle the situation by disabling parallelism automatically.

### Using `tar` to Compress the Model

The warning is unrelated to the `tar` command you used to compress your model directory. The command:

```bash
tar -czf my_model.tar.gz my-tf-en-to-sp
```

is correct and will create a `my_model.tar.gz` file containing the contents of the `my-tf-en-to-sp` directory. If the warning appears right after this command, it's likely due to some part of the code or environment that was executed before or after this command involving the `tokenizers` library.