# NMT-keras

Neural Machine Translation with Keras.

Library documentation: 

[nmt-keras.readthedocs.io](http://nmt-keras.readthedocs.io)

https://colab.research.google.com/github/lvapeab/nmt-keras/blob/master/examples/tutorial.ipynb

In [2]:
!pip install update pip
!git clone git@github.com:lvapeab/nmt-keras.git
import os
os.chdir('nmt-keras')
# the package is fragile. Need exactly the same versions of keras and numpy
# !pip uninstall -y keras
# !pip uninstall -y numpy
!pip install -e .

Collecting update
  Downloading update-0.0.1-py2.py3-none-any.whl (2.9 kB)
Collecting style==1.1.0
  Downloading style-1.1.0-py2.py3-none-any.whl (6.4 kB)
Installing collected packages: style, update
Successfully installed style-1.1.0 update-0.0.1
Cloning into 'nmt-keras'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 4746 (delta 5), reused 9 (delta 4), pack-reused 4730[K
Receiving objects: 100% (4746/4746), 5.70 MiB | 570.00 KiB/s, done.
Resolving deltas: 100% (3221/3221), done.
Obtaining file:///Users/tianqing/Downloads/course/COMP4901-2020/tutorial%209/nmt-keras/nmt-keras


Installing collected packages: nmt-keras
  Attempting uninstall: nmt-keras
    Found existing installation: nmt-keras 0.6
    Uninstalling nmt-keras-0.6:
      Successfully uninstalled nmt-keras-0.6
  Running setup.py develop for nmt-keras
Successfully installed nmt-keras


### 1. Building a Dataset model
First, we are creating a [Dataset](https://github.com/MarcBS/multimodal_keras_wrapper/keras_wrapper/dataset.py) object (from the [Multimodal Keras Wrapper](https://github.com/MarcBS/multimodal_keras_wrapper) library). 

In [4]:
from keras_wrapper.dataset import Dataset, saveDataset
from data_engine.prepare_data import keep_n_captions
ds = Dataset('tutorial_dataset', 'tutorial', silence=False)

Using TensorFlow backend.


In [31]:
ds.setInput('examples/EuTrans/training.en',
            'train',
            type='text',
            id='state_below',
            required=False,
            tokenization='tokenize_none',
            pad_on_batch=True,
            build_vocabulary='target_text',
            offset=1,
            fill='end',
            max_text_len=30,
            max_words=30000)
ds.setInput(None,
            'val',
            type='ghost',
            id='state_below',
            required=False)

[18/11/2020 10:52:35] 	Applying tokenization function: "tokenize_none".
[18/11/2020 10:52:35] 	Reusing vocabulary named "target_text" for data with data_id "state_below".
[18/11/2020 10:52:35] Loaded "train" set inputs of data_type "text" with data_id "state_below" and length 9900.
[18/11/2020 10:52:35] Loaded "val" set inputs of data_type "ghost" with data_id "state_below" and length 100.


In [5]:
ds.setOutput('examples/EuTrans/training.en',
             'train',
             type='text',
             id='target_text',
             tokenization='tokenize_none',
             build_vocabulary=True,
             pad_on_batch=True,
             sample_weights=True,
             max_text_len=30,
             max_words=30000,
             min_occ=0)

ds.setOutput('examples/EuTrans/dev.en',
             'val',
             type='text',
             id='target_text',
             pad_on_batch=True,
             tokenization='tokenize_none',
             sample_weights=True,
             max_text_len=30,
             max_words=0)

[18/11/2020 10:42:41] 	Applying tokenization function: "tokenize_none".
[18/11/2020 10:42:41] Creating vocabulary for data with data_id 'target_text'.
[18/11/2020 10:42:41] 	 Total: 513 unique words in 9900 sentences with a total of 98304 words.
[18/11/2020 10:42:41] Creating dictionary of 30000 most common words, covering 100.0% of the text.
[18/11/2020 10:42:41] Loaded "train" set outputs of data_type "text" with data_id "target_text" and length 9900.
[18/11/2020 10:42:41] 	Applying tokenization function: "tokenize_none".
[18/11/2020 10:42:41] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 100.


In [21]:
ds.setInput('examples/EuTrans/training.es',
            'train',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            build_vocabulary=True,
            fill='end',
            max_text_len=30,
            max_words=30000,
            min_occ=0)
ds.setInput('examples/EuTrans/dev.es',
            'val',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            fill='end',
            max_text_len=30,
            min_occ=0)

[18/11/2020 10:49:50] 	Applying tokenization function: "tokenize_none".
[18/11/2020 10:49:50] Creating vocabulary for data with data_id 'source_text'.
[18/11/2020 10:49:50] 	 Total: 686 unique words in 9900 sentences with a total of 96172 words.
[18/11/2020 10:49:50] Creating dictionary of 30000 most common words, covering 100.0% of the text.
[18/11/2020 10:49:50] Loaded "train" set inputs of data_type "text" with data_id "source_text" and length 9900.
[18/11/2020 10:49:50] 	Applying tokenization function: "tokenize_none".
[18/11/2020 10:49:50] Loaded "val" set inputs of data_type "text" with data_id "source_text" and length 100.


In [19]:
len(ds.vocabulary['target_text']["words2idx"])

516

In [22]:
  for split, input_text_filename in zip(['train', 'val'], ['examples/EuTrans/training.es', 'examples/EuTrans/dev.es']):
    ds.setRawInput(input_text_filename,
                  split,
                  type='file-name',
                  id='raw_source_text',
                  overwrite_split=True)

[18/11/2020 10:50:26] Loaded "train" set inputs of type "file-name" with id "raw_source_text".
[18/11/2020 10:50:26] Loaded "val" set inputs of type "file-name" with id "raw_source_text".


In [34]:
ds.X_train['source_text'][0], ds.X_train['state_below'][0]

('¿ le importaría darnos las llaves de la habitación , por favor ?',
 'would you mind giving us the keys to the room , please ?')

### 2. Model training 

Check https://colab.research.google.com/github/lvapeab/nmt-keras/blob/master/examples/tutorial.ipynb

# OpenNMT-py(tf)

https://opennmt.net/

https://github.com/OpenNMT/OpenNMT-py

In [38]:
# os.chdir('/Users/tianqing/Downloads/course/COMP4901-2020/tutorial 9/')
# !pwd
# !git clone https://github.com/OpenNMT/OpenNMT-py.git
os.chdir("OpenNMT-py/")
!python setup.py install

running install
running bdist_egg
running egg_info
creating OpenNMT_py.egg-info
writing OpenNMT_py.egg-info/PKG-INFO
writing dependency_links to OpenNMT_py.egg-info/dependency_links.txt
writing entry points to OpenNMT_py.egg-info/entry_points.txt
writing requirements to OpenNMT_py.egg-info/requires.txt
writing top-level names to OpenNMT_py.egg-info/top_level.txt
writing manifest file 'OpenNMT_py.egg-info/SOURCES.txt'
reading manifest file 'OpenNMT_py.egg-info/SOURCES.txt'
writing manifest file 'OpenNMT_py.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.9-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/onmt
copying onmt/train_single.py -> build/lib/onmt
copying onmt/model_builder.py -> build/lib/onmt
copying onmt/constants.py -> build/lib/onmt
copying onmt/__init__.py -> build/lib/onmt
copying onmt/opts.py -> build/lib/onmt
copying onmt/trainer.py -> build/lib/onmt
creating build/lib/onmt/bin
copying onmt/bin/tr

byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/bin/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/bin/average_models.py to average_models.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/bin/train.py to train.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/bin/release_model.py to release_model.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/translate/penalties.py to penalties.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/translate/translation_server.py to translation_server.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/translate/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/translate/beam_search.py to beam_search.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/onmt/translate/translation.py to translation.cpython-36.pyc
byte-compiling build/bdist.macosx-10.9

onmt.transforms.__pycache__.__init__.cpython-36: module references __file__
creating dist
creating 'dist/OpenNMT_py-2.0.0rc2-py3.6.egg' and adding 'build/bdist.macosx-10.9-x86_64/egg' to it
removing 'build/bdist.macosx-10.9-x86_64/egg' (and everything under it)
Processing OpenNMT_py-2.0.0rc2-py3.6.egg
creating /Users/tianqing/anaconda3/envs/nmt-keras-py36/lib/python3.6/site-packages/OpenNMT_py-2.0.0rc2-py3.6.egg
Extracting OpenNMT_py-2.0.0rc2-py3.6.egg to /Users/tianqing/anaconda3/envs/nmt-keras-py36/lib/python3.6/site-packages
Adding OpenNMT-py 2.0.0rc2 to easy-install.pth file
Installing onmt_average_models script to /Users/tianqing/anaconda3/envs/nmt-keras-py36/bin
Installing onmt_build_vocab script to /Users/tianqing/anaconda3/envs/nmt-keras-py36/bin
Installing onmt_release_model script to /Users/tianqing/anaconda3/envs/nmt-keras-py36/bin
Installing onmt_server script to /Users/tianqing/anaconda3/envs/nmt-keras-py36/bin
Installing onmt_train script to /Users/tianqing/anaconda3/envs

Best match: urllib3 1.26.2
Processing urllib3-1.26.2-py2.py3-none-any.whl
Installing urllib3-1.26.2-py2.py3-none-any.whl to /Users/tianqing/anaconda3/envs/nmt-keras-py36/lib/python3.6/site-packages
Adding urllib3 1.26.2 to easy-install.pth file

Installed /Users/tianqing/anaconda3/envs/nmt-keras-py36/lib/python3.6/site-packages/urllib3-1.26.2-py3.6.egg
Searching for idna<3,>=2.5
Reading https://pypi.org/simple/idna/
Downloading https://files.pythonhosted.org/packages/a2/38/928ddce2273eaa564f6f50de919327bf3a00f091b5baba8dfa9460f3a8a8/idna-2.10-py2.py3-none-any.whl#sha256=b97d804b1e9b523befed77c48dacec60e6dcb0b5391d57af6a65a312a90648c0
Best match: idna 2.10
Processing idna-2.10-py2.py3-none-any.whl
Installing idna-2.10-py2.py3-none-any.whl to /Users/tianqing/anaconda3/envs/nmt-keras-py36/lib/python3.6/site-packages
Adding idna 2.10 to easy-install.pth file

Installed /Users/tianqing/anaconda3/envs/nmt-keras-py36/lib/python3.6/site-packages/idna-2.10-py3.6.egg
Searching for chardet<4,>=3.

In [48]:
!head -n 1 data/src-train.txt
!head -n 1 data/tgt-train.txt

It is not acceptable that , with the help of the national bureaucracies , Parliament &apos;s legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Es geht nicht an , dass über Ausführungsbestimmungen , deren Inhalt , Zweck und Ausmaß vorher nicht bestimmt ist , zusammen mit den nationalen Bürokratien das Gesetzgebungsrecht des Europäischen Parlaments ausgehebelt wird .


In [76]:
"""
!echo -e "save_data: toy-ende/run/example\n
src_vocab: toy-ende/run/example.vocab.src\n
tgt_vocab: toy-ende/run/example.vocab.tgt\n
overwrite: False\n
data:\n
    corpus_1:\n
        path_src: toy-ende/src-train.txt\n
        path_tgt: toy-ende/tgt-train.txt\n
    valid:\n
        path_src: toy-ende/src-val.txt\n
        path_tgt: toy-ende/tgt-val.txt" > toy-ende/vocab.yml
"""
!echo -e "save_data: toy-ende/run/example\nsrc_vocab: toy-ende/run/example.vocab.src\ntgt_vocab: toy-ende/run/example.vocab.tgt\noverwrite: False\ndata:\n    corpus_1:\n        path_src: toy-ende/src-train.txt\n        path_tgt: toy-ende/tgt-train.txt\n    valid:\n        path_src: toy-ende/src-val.txt\n        path_tgt: toy-ende/tgt-val.txt" > toy-ende/vocab.yml

In [77]:
!onmt_build_vocab -config toy-ende/vocab.yml -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2020-11-18 12:24:44,142 INFO] Counter vocab from 10000 samples.
[2020-11-18 12:24:44,142 INFO] Build vocab on 10000 transformed examples/corpus.
[2020-11-18 12:24:44,154 INFO] corpus_1's transforms: TransformPipe()
[2020-11-18 12:24:44,155 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2020-11-18 12:24:44,467 INFO] Counters src:24995
[2020-11-18 12:24:44,468 INFO] Counters tgt:35816


In [94]:
"""
!echo -e "src_vocab: toy-ende/run/example.vocab.src\n\
tgt_vocab: toy-ende/run/example.vocab.tgt\n\
save_model: toy-ende/run/model\n\
save_checkpoint_steps: 500\n\
train_steps: 1000\n\
valid_steps: 500" > toy-ende/train.yml
"""
# !cat toy-ende/vocab.yml

!echo -e "src_vocab: toy-ende/run/example.vocab.src\ntgt_vocab: toy-ende/run/example.vocab.tgt\nsave_model: toy-ende/run/model\nsave_checkpoint_steps: 500\ntrain_steps: 1000\nvalid_steps: 500" > toy-ende/train_tmp.yml

!cat toy-ende/vocab.yml toy-ende/train_tmp.yml > toy-ende/train.yml

In [95]:
!onmt_train -config toy-ende/train.yml

[2020-11-18 12:39:08,132 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2020-11-18 12:39:08,133 INFO] Missing transforms field for valid data, set to default: [].
[2020-11-18 12:39:08,133 INFO] Parsed 2 corpora from -data.
[2020-11-18 12:39:08,133 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2020-11-18 12:39:08,133 INFO] Loading vocab from text file...
[2020-11-18 12:39:08,133 INFO] Loading src vocabulary from toy-ende/run/example.vocab.src
[2020-11-18 12:39:08,180 INFO] Loaded src vocab has 24995 tokens.
[2020-11-18 12:39:08,191 INFO] Loading tgt vocabulary from toy-ende/run/example.vocab.tgt
[2020-11-18 12:39:08,284 INFO] Loaded tgt vocab has 35816 tokens.
[2020-11-18 12:39:08,301 INFO] Building fields with vocab in counters...
[2020-11-18 12:39:08,353 INFO]  * tgt vocab size: 35820.
[2020-11-18 12:39:08,386 INFO]  * src vocab size: 24997.
[2020-11-18 12:39:08,387 INFO]  * src vocab size = 24997
[2020-11-18 12:39:08,387 INFO]  * tgt

# HuggingFace Transformers

https://huggingface.co/transformers/master/model_doc/bert.html

In [None]:
!pip install transformers
!pip install torch

In [None]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state