# Starter notebook

The purpose of this Notebook is to build baseline model that translate dyula (dyu) language (source language) into French (fr) language (target language). we'll train from scratch a Transformer model using JoeyNMT.

NB: Run time execution of this notebook it less than **1h** respect resources (GPU, RAM) define below.

For more details about JoeyNMT see [here](https://github.com/joeynmt)

## Environmental setup

> ⚠ **Important:** Before you start, set runtime type to GPU.

In [1]:
#!nvidia-smi

In [2]:
import torch
torch.__version__

'2.2.1+cu121'

Mount your Google Drive


In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# %%capture
# !pip install joeynmt==2.3.0
# !pip install datasets==2.18.0

## Data Preparation

### Download

We download the corpus train-dev-test subsets from Huggingface hub.

In [5]:
from datasets import load_dataset, DatasetDict, Translation
repo_name = "data354/Koumankan_mt_dyu_fr"
dataset = load_dataset(repo_name)
dataset

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['ID', 'translation'],
        num_rows: 8065
    })
    validation: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1471
    })
    test: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1393
    })
})

In [6]:
import re

## Data preprocessing

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
# Optional: remove punctuation.

src_lang = 'dyu'
trg_lang = "fr"
chars_to_remove_regex = '[!"&\(\),-./:;=?+.\n\[\]]'
def remove_special_characters(text):
    text = re.sub(chars_to_remove_regex, ' ', text.lower())
    return text.strip()

def clean_text(batch):
    # process source text
    batch['translation'][src_lang] = remove_special_characters(batch['translation'][src_lang])
    # process target text
    batch['translation'][trg_lang] = remove_special_characters(batch['translation'][trg_lang])

    return batch


dataset = dataset.map(clean_text)
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'translation'],
        num_rows: 8065
    })
    validation: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1471
    })
    test: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1393
    })
})

Let's inspect the sentences.

In [7]:
dataset["validation"]["translation"][:3]

[{'dyu': 'i tɔgɔ bi cogodɔ', 'fr': 'tu portes un nom de fantaisie'},
 {'dyu': 'puɛn saba fɔlɔ', 'fr': 'trois points d’avance'},
 {'dyu': 'tile bena', 'fr': 'le soleil s’est couché'}]

Save the train-dev subsets on disk.

In [8]:
data_dir = "../data/dyu_fr"
dataset.save_to_disk(data_dir)

Saving the dataset (1/1 shards): 100%|██████████| 8065/8065 [00:00<00:00, 1072003.22 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1471/1471 [00:00<00:00, 382933.29 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1393/1393 [00:00<00:00, 411339.44 examples/s]


### Vocabulary

We will use the [sentencepiece](https://github.com/google/sentencepiece) library to split words into subwords (BPE) according to their frequency in the training corpus.

`build_vocab.py` script will train the BPE model and creates joint vocabulary. It takes the same config file as the joeynmt.

In [9]:
from pathlib import Path

# model dir
model_dir = "../saved_model/dyu_fr"

# Create the config
config = """
name: "dyu_fr_transformer-sp"
joeynmt_version: "2.3.0"
model_dir: "{model_dir}"
use_cuda: False # False for CPU training
fp16: False

data:
    train: "{data_dir}"
    dev: "{data_dir}"
    test: "{data_dir}"
    dataset_type: "huggingface"
    dataset_cfg:
        name: "dyu-fr"
    sample_dev_subset: 1460
    src:
        lang: "dyu"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 4000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"
    trg:
        lang: "fr"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 4000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"
    special_symbols:
        unk_token: "<unk>"
        unk_id: 0
        pad_token: "<pad>"
        pad_id: 1
        bos_token: "<s>"
        bos_id: 2
        eos_token: "</s>"
        eos_id: 3

""".format(data_dir=data_dir, model_dir=model_dir)
with (Path(data_dir) / "config.yaml").open('w') as f:
    f.write(config)

Call the `build_vocab.py` script with `--joint` flag to build the vocabulary

In [10]:
!wget https://raw.githubusercontent.com/joeynmt/joeynmt/v2.3/scripts/build_vocab.py
! sudo chmod 777 build_vocab.py

--2024-04-02 13:17:29--  https://raw.githubusercontent.com/joeynmt/joeynmt/v2.3/scripts/build_vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13170 (13K) [text/plain]
Saving to: ‘build_vocab.py.1’


2024-04-02 13:17:29 (96.1 MB/s) - ‘build_vocab.py.1’ saved [13170/13170]



In [11]:
!python build_vocab.py {data_dir}/config.yaml --joint

### Vocab file '../data/dyu_fr/vocab.txt' will be overwritten.
Model file '../data/dyu_fr/sp.model' will be overwritten.
### Training sentencepiece...
sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=/tmp/sentencepiece_v7l6ptxi.txt --model_prefix=../data/dyu_fr/sp --model_type=unigram --vocab_size=4000 --character_coverage=1.0 --accept_language=dyu,fr --unk_piece=<unk> --bos_piece=<s> --eos_piece=</s> --pad_piece=<pad> --unk_id=0 --bos_id=2 --eos_id=3 --pad_id=1 --vocabulary_output_piece_score=false
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /tmp/sentencepiece_v7l6ptxi.txt
  input_format: 
  model_prefix: ../data/dyu_fr/sp
  model_type: UNIGRAM
  vocab_size: 4000
  accept_language: dyu
  accept_language: fr
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_ite

The generated vocabulary looks like this:

In [12]:
!head -10 {data_dir}/vocab.txt

<unk>
<pad>
<s>
</s>
▁a
s
▁ka
'
a
▁la


## Model Training

### Configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places.

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

In [13]:
config += """
testing:
    #load_model: "{model_dir}/best.ckpt"
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 256
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    #return_prob: "hyp"
    #return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"

training:
    #load_model: "{model_dir}/latest.ckpt"
    #reset_best_ckpt: False
    #reset_scheduler: False
    #reset_optimizer: False
    #reset_iter_state: False
    random_seed: 42
    optimizer: "adamw"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 100
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    loss: "crossentropy"
    batch_size: 512
    batch_type: "token"
    batch_multiplier: 4
    early_stopping_metric: "bleu"
    epochs: 6
    updates: 550
    validation_freq: 30
    logging_freq: 5
    overwrite: True
    shuffle: True
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3

model:
    initializer: "xavier_uniform"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier_uniform"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.2
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

""".format(model_dir=model_dir)
with (Path(data_dir) / "config.yaml").open('w') as f:
    f.write(config)

### Run training
⏳ The log reports the training process, look out for the prints of example translations and the BLEU evaluation scores to get an impression of the current quality.

In [14]:
%%time
!python -m joeynmt train {data_dir}/config.yaml --skip-test

2024-04-02 13:17:35,507 - INFO - root - Hello! This is Joey-NMT (version 2.3.0).
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                           cfg.name : dyu_fr_transformer-sp
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                cfg.joeynmt_version : 2.3.0
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                      cfg.model_dir : /workspace/highwind-examples/translate-dyu-fr/saved_model/dyu_fr
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                       cfg.use_cuda : False
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                           cfg.fp16 : False
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                     cfg.data.train : ../data/dyu_fr
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                       cfg.data.dev : ../data/dyu_fr
2024-04-02 13:17:35,507 - INFO - joeynmt.config -                      cfg.data.test : ../data/dyu_fr
2024-04-02 13:17:35,507 - INFO - joeynmt.config -              cfg.data.d

In [15]:
# Add the best model info on config file
with (Path(model_dir) / "config.yaml").open('r') as f:
    config = f.read()
resume_config = config\
  .replace(f'#load_model: "{model_dir}/best.ckpt"',
           f'load_model: "{model_dir}/best.ckpt"')

resume_config = resume_config\
  .replace(f'model_file: "{data_dir}/sp.model"',
           f'model_file: "{model_dir}/sp.model"')

resume_config = resume_config\
  .replace(f'voc_file: "{data_dir}/vocab.txt"',
           f'voc_file: "{model_dir}/vocab.txt"')

with (Path(model_dir) / "config.yaml").open('w') as f:
    f.write(resume_config)

In [16]:
!cp {data_dir}/vocab.txt  {model_dir}
!cp -R {model_dir} /content/drive/MyDrive/mt-dyu-fr

cp: cannot create directory '/content/drive/MyDrive/mt-dyu-fr': No such file or directory


## (Optional) Upload trained model to Hugging Face

Remember that during the running of this notebook, several model checkpoints, vocabulary files, tokeniser models, etc. are saved. These are not all necessary for capturing the model. Only the necessary files from this set can be copied to a folder called `lean_model`, which will be uploaded to Hugging Face. The essential files include the following:

- Best model checkpoint (e.g. `510.ckpt` -> rename to `best.ckpt`)
- JoyNMT config file (`config.yaml`)
- Tokenizer (`sp.model`)
- Vocabulary file (`vocab.txt`)

Copy over these trained model files to a folder `lean_model` at `../saved_model/lean_model` before continuing.

> 💡 Remember to change the paths in the `config.yml` file to point to paths in the serving container (e.g. `/app/saved_model`, check the Dockerfile).

In [9]:
# Remember to run `huggingface-cli login` before you run the code below
import os
from pathlib import Path

import joeynmt
import torch
from huggingface_hub import HfApi

api = HfApi()

In [23]:
HF_REPO_NAME = "MelioAI/dyu-fr-joeynmt"
lean_model_dir = "../saved_model/lean_model"

In [38]:
# Optionally add a model card
# Create the config
model_card = f"""---
language: 
- en
- fr
- multilingual
tags:
- translation
- pytorch
model-index:
- name: MelioAI/dyu-fr-joeynmt
  results: []
---

# MelioAI/dyu-fr-joeynmt

An example of a machine translation model that translates Dyula to French using the [JoeyNMT framework](https://github.com/joeynmt/joeynmt).

This following example is based on [this Github repo](https://github.com/data354/koumakanMT-challenge) that was kindly created by [data354](https://data354.com/en/).

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Usage

### Load and use for inference

```python
import torch
from joeynmt.config import load_config, parse_global_args
from joeynmt.prediction import predict, prepare
from huggingface_hub import snapshot_download

# Download model
snapshot_download(
    repo_id="{HF_REPO_NAME}",
    local_dir="/path/to/save/locally"
)

# Define model interface
class JoeyNMTModel:
    '''
    JoeyNMTModel which load JoeyNMT model for inference.

    :param config_path: Path to YAML config file
    :param n_best: return this many hypotheses, <= beam (currently only 1)
    '''
    def __init__(self, config_path: str, n_best: int = 1):
        seed = 42
        torch.manual_seed(seed)
        cfg = load_config(config_path)
        args = parse_global_args(cfg, rank=0, mode="translate")
        self.args = args._replace(test=args.test._replace(n_best=n_best))
        # build model
        self.model, _, _, self.test_data = prepare(self.args, rank=0, mode="translate")

    def _translate_data(self):
        _, _, hypotheses, trg_tokens, trg_scores, _ = predict(
            model=self.model,
            data=self.test_data,
            compute_loss=False,
            device=self.args.device,
            rank=0,
            n_gpu=self.args.n_gpu,
            normalization="none",
            num_workers=self.args.num_workers,
            args=self.args.test,
            autocast=self.args.autocast,
        )
        return hypotheses, trg_tokens, trg_scores

    def translate(self, sentence) -> list:
        '''
        Translate the given sentence.

        :param sentence: Sentence to be translated
        :return:
        - translations: (list of str) possible translations of the sentence.
        '''
        self.test_data.set_item(sentence.strip())
        translations, _, _ = self._translate_data()
        assert len(translations) == len(self.test_data) * self.args.test.n_best
        self.test_data.reset_cache()
        return translations

# Load model
config_path = "/path/to/lean_model/config_local.yaml" # Change this to the path to your model congig file
model = JoeyNMTModel(config_path=config_path, n_best=1)

# Translate
model.translate(sentence="i tɔgɔ bi cogodɔ")
```

## Training procedure

### Training hyperparameters

More information needed

### Training results

More information needed

### Framework versions

- JoeyNMT {joeynmt.__version__}
- Torch {torch.__version__}

"""
with (Path(lean_model_dir) / "README.md").open('w') as f:
    f.write(model_card)

In [39]:
# List files in the model directory (lean_model)
files = []
for filename in os.listdir(lean_model_dir):
    filepath = os.path.join(lean_model_dir, filename)
    if os.path.isfile(filepath):
        files.append(Path(filepath))

files

[PosixPath('../saved_model/lean_model/config_local.yaml'),
 PosixPath('../saved_model/lean_model/config.yaml'),
 PosixPath('../saved_model/lean_model/README.md'),
 PosixPath('../saved_model/lean_model/sp.model'),
 PosixPath('../saved_model/lean_model/vocab.txt'),
 PosixPath('../saved_model/lean_model/best.ckpt')]

In [40]:
for file_path in files:
    print(file_path.name)
    print(str(file_path))
    api.upload_file(
        path_or_fileobj=file_path,
        path_in_repo=file_path.name,
        repo_id=HF_REPO_NAME,
    )

config_local.yaml
../saved_model/lean_model/config_local.yaml
config.yaml
../saved_model/lean_model/config.yaml
README.md
../saved_model/lean_model/README.md
sp.model
../saved_model/lean_model/sp.model
vocab.txt
../saved_model/lean_model/vocab.txt
best.ckpt
../saved_model/lean_model/best.ckpt


Author : [Data354](https://data354.com/en/)

Thanks to Julia & Co for making [JoeyNMT transformers](https://github.com/joeynmt/joeynmt) so simple to use.
