# Pymarian Quick Start

Last updated: 2024-10-18

>This notebook accompanies Pymarian demo paper @ EMNLP24 Demo.
>* OpenReview link: https://openreview.net/forum?id=3BKsyqIieh
>* ArXiv : https://arxiv.org/abs/2408.11853
>* Benchmarking Scripts: https://github.com/thammegowda/017-pymarian


In this notebook, we demonstrate how to work with Pymarian APIs.
* Evaluator
* Translator
* Trainer

---
## Setup

In [None]:
# Install pymarian
!pip install pymarian==1.12.31

Collecting pymarian==1.12.31
  Downloading pymarian-1.12.31-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.7 kB)
Collecting portalocker (from pymarian==1.12.31)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading pymarian-1.12.31-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (602.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m602.7/602.7 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, pymarian
Successfully installed portalocker-2.10.1 pymarian-1.12.31


In [None]:
!pymarian-eval --version

pymarian-eval 1.12.31


In [None]:
import sys
import urllib
import tarfile
from pathlib import Path
from huggingface_hub import hf_hub_download as hf_get
import pymarian
print(f'Python {sys.version}; pymarian {pymarian.__version__}')

Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]; pymarian 1.12.31


---
## Evaluator


NOTE: run `huggingface-cli login` for accessing gated models such as  cometkiwi22 or newer

In [None]:
from pymarian import Evaluator

model = hf_get("marian-nmt/chrfoid-wmt23",
    filename="checkpoints/marian.model.bin")
vocab = hf_get("marian-nmt/chrfoid-wmt23",
    filename="vocab.spm")
evaluator = Evaluator.new(
    model_file=Path(model), vocab_file=Path(vocab),
    like='comet-qe', quiet=True, fp16=False,
    cpu_threads=4)

srcs =  ['Hello', 'Howdy']
mts = ['Howdy', 'Hello']
lines = [f'{s}\t{t}' for s,t in zip(srcs, mts)]
scores = evaluator.evaluate(lines)
for score in scores:
    print(f'{score:.4f}')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


marian.model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

vocab.spm:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

0.0165
0.0000


## Translator


In [None]:
from pymarian import Translator

model_url = "http://data.statmt.org/romang/marian-regression-tests/models/wngt19.tar.gz"
model_dir = Path.home() / 'tmp' /  'marian-models'
model_file = str(model_dir / 'wngt19' / 'model.base.npz')
vocab_file = str(model_dir / 'wngt19' / 'en-de.spm')

if not Path(model_file).exists():
    print(f"Downloading {model_url} and extracting to {model_dir}")
    request = urllib.request.urlopen(model_url)
    with tarfile.open(fileobj=request, mode="r|gz") as tar:
        tar.extractall(path=model_dir)
    print("Downloaded and extracted model files")

translator = Translator(models=model_file, vocabs=[vocab_file, vocab_file], quiet=True)
hyp = translator.translate("Hello. Good morning.")
print(hyp)

Downloading http://data.statmt.org/romang/marian-regression-tests/models/wngt19.tar.gz and extracting to /root/tmp/marian-models
Downloaded and extracted model files
Hallo , Guten Morgen .


## Trainer

In [None]:
data_url = "https://textmt.blob.core.windows.net/www/data/marian-tests-data.tgz"
data_dir = Path.home() / 'tmp' / 'marian-tests-data'
data_dir.mkdir(parents=True, exist_ok=True)
vocab_file = data_dir / 'deu-eng/vocab.8k.spm'
train_src = data_dir / 'deu-eng/sample.5k.deu'
train_tgt = train_src.with_suffix('.eng')

if not train_tgt.exists():
    print(f"Downloading data package... to {data_dir}")
    with urllib.request.urlopen(data_url) as response:
        with tarfile.open(fileobj=response, mode="r|gz") as tar:
            tar.extractall(path=data_dir.parent)
    print("Downloaded the data package")

!head -n4 {train_src} {train_tgt}

vocab_file = str(vocab_file)
train_src = str(train_src)
train_tgt = str(train_tgt)

==> /root/tmp/marian-tests-data/deu-eng/sample.5k.deu <==
Steigt Gold auf 10.000 Dollar?
SAN FRANCISCO – Es war noch nie leicht, ein rationales Gespräch über den Wert von Gold zu führen.
In letzter Zeit allerdings ist dies schwieriger denn je, ist doch der Goldpreis im letzten Jahrzehnt um über 300 Prozent angestiegen.
Erst letzten Dezember verfassten meine Kollegen Martin Feldstein und Nouriel Roubini Kommentare, in denen sie mutig die vorherrschende optimistische Marktstimmung hinterfragten und sehr überlegt auf die Risiken des Goldes  hinwiesen.

==> /root/tmp/marian-tests-data/deu-eng/sample.5k.eng <==
$10,000 Gold?
SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.
Lately, with gold prices up more than 300% over the last decade, it is harder than ever.
Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.


In [None]:
from pymarian import Trainer
args = {
    'type': 'transformer',
    'dim_emb': 512,
    'enc_depth': 6,
    'dec_depth': 6,
    'tied_embeddings_all': True,
    'transformer_heads': 8,
    'transformer_dim_ffn': 2048,
    'transformer_ffn_activation': 'relu',
    'transformer_dropout': 0.1,
    'cost_type': 'ce-mean-words',
    'max_length': 80,
    'mini_batch_fit': False,
    'maxi_batch': 256,
    'optimizer_params': [0.9, 0.98, 1e-09],
    'sync_sgd': True,
    'learn_rate': 0.0003,
    'lr_decay_inv_sqrt': [16000],
    'lr_warmup': 16000,
    'label_smoothing': 0.1,
    'clip_norm': 0,
    'exponential_smoothing': 0.0001,
    'early_stopping': 8,
    'keep_best': True,
    'beam_size': 2,
    'normalize': 1,
    'valid_metrics': ['ce-mean-words', 'bleu', 'perplexity'],
    'valid_mini_batch': 16,
    'mini_batch': '1Mt',
    'after': '100e',  # stop after 500 updates
    'valid_freq': '100Mt',  # validate every 250 updates
    'disp_freq': '50kt',
    'disp_first': 10,
    'save_freq': '100Mt',
    'vocabs': [vocab_file, vocab_file],
    'train_sets': [train_src, train_tgt],
    'quiet': False,
}

args['model'] = f'{data_dir.parent}/model.npz'

trainer = Trainer(**args)
trainer.train()

# careful with notebook retaining objects in memory
# you cant create second object if first one is still consuming GPU RAM
del trainer

In [None]:
!ls {data_dir.parent}

marian-models	   model.iter200.npz.decoder.yml  model.npz		   model.npz.progress.yml
marian-tests-data  model.iter400.npz		  model.npz.decoder.yml    model.npz.yml
model.iter200.npz  model.iter400.npz.decoder.yml  model.npz.optimizer.npz
