Skip to content

Commit

Permalink
byt5 unicode implementation (NVIDIA#2365)
Browse files Browse the repository at this point in the history
* Audio Norm (NVIDIA#2285)

* add jenkins test, refactoring

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix new test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* add serial to the default normalizer, add tests

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* manifest test added

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* expose more params, new test cases

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix jenkins, serial clean, exclude range from cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* addressed review comments

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix decimal in measure

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* move serial in cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* clean up

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update for SH zero -> oh

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* change n_tagger default

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* bumping version to 1.0.1

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add check for numba regardless of device

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* upper bound for webdataset

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Correct Dockerfile

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update readmes

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update README (NVIDIA#2332)

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* ddp translate GPU allocation fix (NVIDIA#2312)

* fixed branch in IR tutorial

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* ddp translate GPU allocation fix

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* map_location instead of set_device

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Shallow fusion (NVIDIA#2315)

* fixed branch in IR tutorial

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* shallow fusion init commit

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* debug info removed

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* [BUGFIX] Add upper bound to hydra for 1.0.x (NVIDIA#2337)

* upper bound hydra

Signed-off-by: ericharper <complex451@gmail.com>

* upper bound hydra

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update version number

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update package version

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* sparrowhawk tests + punctuation post processing for pynini TN (NVIDIA#2320)

* add jenkins test, refactoring

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix new test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* add serial to the default normalizer, add tests

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* manifest test added

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* expose more params, new test cases

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix jenkins, serial clean, exclude range from cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* addressed review comments

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix decimal in measure

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* move serial in cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* sh tests init

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* sparrowhawk container tests support added

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* add post process to normalize.py, update tests

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* remove duplication

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update notebooks to 1.0.2 release (NVIDIA#2338)

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update ranges for omegaconf and hydra (NVIDIA#2336)

* Update ranges

Signed-off-by: smajumdar <titu1994@gmail.com>

* Updates for Hydra and OmegaConf updates

Signed-off-by: smajumdar <titu1994@gmail.com>

* Style fixes

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct tests and revert patch for model utils

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct docstring

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert unnecessary change

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert unnecessary change

Signed-off-by: smajumdar <titu1994@gmail.com>

* Guard scheduler for None

Signed-off-by: smajumdar <titu1994@gmail.com>

* default to 0.0 if bpe_dropout is None

Signed-off-by: ericharper <complex451@gmail.com>

* Correctly log class that was restored

Signed-off-by: smajumdar <titu1994@gmail.com>

* Root patch *bpe_dropout

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update FastPitch Export (NVIDIA#2355)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* byt5 unicode implementation, first cut

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* add bytelevel tokenizer

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update out_dir to not collide (NVIDIA#2358)

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update container version to 21.05 (NVIDIA#2309)

* Update container version

Signed-off-by: smajumdar <titu1994@gmail.com>

* Temporarily change export format of waveglow

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add conda update for numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update numba compat via global flag for strictness level `--relax_numba_compat`, remove pytorchlightning.metrics, refactor out numba utils to core, update tests

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct order of numba minimum verion, remove wrong flag from test

Signed-off-by: smajumdar <titu1994@gmail.com>

* Double test of cuda numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Double test of cuda numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Enable RNNT tests

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Text Normalization Update (NVIDIA#2356)

* upper cased date support

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update whitelist, change roman weights

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* docstrings, space fix, init file

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* lgtm

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fraction with measure class

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* address comment

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add ASR CTC tutorial on fine-tuning on another language (NVIDIA#2346)

* Add ASR CTC Language finetuning notebook

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add to documentation

Signed-off-by: smajumdar <titu1994@gmail.com>

* Improve documentation

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct name of the dataset

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Correct colab link to notebook (NVIDIA#2366)

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* sgdqa update data directories for testing (NVIDIA#2323)

* sgdqa update data directories for testing

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix syntax

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* check if data dir exists

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* adding pretrained model

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Added documentation for export() (NVIDIA#2330)

* Added export document

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Addressed review comments

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update Citrinet model card info (NVIDIA#2369)

* Update model card info

Signed-off-by: smajumdar <titu1994@gmail.com>

* Cleanup Docs

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* [NMT] Model Parallel Megatron Encoders (NVIDIA#2238)

* add megatron encoder

Signed-off-by: ericharper <complex451@gmail.com>

* added megatron to get_nmt_tokenizer

Signed-off-by: ericharper <complex451@gmail.com>

* add vocab_size and hidden_size to megatron bert

Signed-off-by: ericharper <complex451@gmail.com>

* add megatron encoder module

Signed-off-by: ericharper <complex451@gmail.com>

* fixed horrible typo

Signed-off-by: ericharper <complex451@gmail.com>

* fix typo and add default

Signed-off-by: ericharper <complex451@gmail.com>

* updating nlp overrides for mp nmt

Signed-off-by: ericharper <complex451@gmail.com>

* move some logic back to nlpmodel from overrides

Signed-off-by: ericharper <complex451@gmail.com>

* add checkpoint_file property

Signed-off-by: ericharper <complex451@gmail.com>

* fix property

Signed-off-by: ericharper <complex451@gmail.com>

* num_tokentypes=0

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* find_unused_parameters=True

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* get instead of pop

Signed-off-by: ericharper <complex451@gmail.com>

* remove token type ids from megatron input example

Signed-off-by: ericharper <complex451@gmail.com>

* pop vocab_size

Signed-off-by: ericharper <complex451@gmail.com>

* fix checkpointing for model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* fix bug in non model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* convert cfg.trainer to dict

Signed-off-by: ericharper <complex451@gmail.com>

* make num_tokentypes configurable for nmt

Signed-off-by: ericharper <complex451@gmail.com>

* update checkpoint_file when using named megatron model in nemo

Signed-off-by: ericharper <complex451@gmail.com>

* make vocab_file configurable

Signed-off-by: ericharper <complex451@gmail.com>

* dataclass can't have mutable default

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* unused imports

Signed-off-by: ericharper <complex451@gmail.com>

* revert input example

Signed-off-by: ericharper <complex451@gmail.com>

* check that checkpoint version is not None

Signed-off-by: ericharper <complex451@gmail.com>

* add mp jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* update docstring

Signed-off-by: ericharper <complex451@gmail.com>

* add docs for pretrained encoders with nemo nmt

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add notebook with recommendations for 8 kHz speech (NVIDIA#2326)

* Added a notebook with best practices for telephony speech

* Added datasets detaiils

* Added training recommendations

* Emptied out cells with results

* Added tutorial to docs

Signed-off-by: jbalam <jbalam@nvidia.com>

* Addressed review comments

Signed-off-by: jbalam <jbalam@nvidia.com>

* Added a line to note original sampling rate of an4

Signed-off-by: jbalam <jbalam@nvidia.com>

* Made changes suggested in review

Signed-off-by: jbalam <jbalam@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add FastEmit support for RNNT Losses (NVIDIA#2374)

* Temp commit

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial code for fastemit forward pass

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct return reg value

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial cpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Try gpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Try gpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct few impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update fastemit scaling

Signed-off-by: smajumdar <titu1994@gmail.com>

* Cleanup fastemit

Signed-off-by: smajumdar <titu1994@gmail.com>

* Finalize FastEmit regularization PR

Signed-off-by: smajumdar <titu1994@gmail.com>

* Refactor code to support fastemit regularization

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* byt5 unicode implementation, first cut

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* add bytelevel tokenizer

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update styling

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* avoid circular import

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* fix bugs in hifigan code (NVIDIA#2392)

Signed-off-by: Oktai Tatanov <oktai.tatanov@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update setup.py (NVIDIA#2394)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update bytelevel_tokenizer.py

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update bytelevel_tokenizer.py

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* typo

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* missed one

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* bug fixes

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* style fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* bytelevelprocessor is now generic.

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* style fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update checkpointing (NVIDIA#2396)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* woops, didnt merge jenkinsfile the right way

* add newline

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* undo changes to enja processor

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* processor selection decision fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* newline fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: mchrzanowski <mchrzanowski@nvidia.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Jagadeesh Balam <4916480+jbalam-nv@users.noreply.github.com>
Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Co-authored-by: Oktai Tatanov <oktai.tatanov@gmail.com>
Co-authored-by: root <root@dgx0026.nsv.rno1.nvmetal.net>
Co-authored-by: root <root@dgx0079.nsv.rno1.nvmetal.net>
  • Loading branch information
17 people authored and mousebaiker committed Jul 8, 2021
1 parent 8986c31 commit c836de1
Show file tree
Hide file tree
Showing 4 changed files with 111 additions and 19 deletions.
1 change: 1 addition & 0 deletions nemo/collections/common/tokenizers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
from nemo.collections.common.tokenizers.char_tokenizer import CharTokenizer
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
Expand Down
79 changes: 79 additions & 0 deletions nemo/collections/common/tokenizers/bytelevel_tokenizers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import re
from pathlib import Path
from typing import List
from nemo.collections.common.tokenizers.tokenizer_spec import TokenizerSpec

__all__ = ['ByteLevelProcessor', 'ByteLevelTokenizer']


class ByteLevelProcessor:
"""
A very basic tokenization and detokenization class for use with byte-level
tokenization.
"""

def detokenize(self, tokens: List[str]) -> str:
return ' '.join(tokens)

def tokenize(self, text) -> str:
return text

def normalize(self, text) -> str:
return text


class ByteLevelTokenizer(TokenizerSpec):
def __init__(self):
self.vocab_size = 259
self.special_tokens = [self.bos_id, self.eos_id, self.pad_id]

# no distinction between tokens and ids.
def text_to_tokens(self, text):
return self.text_to_ids(text)

def tokens_to_text(self, tokens):
return self.ids_to_text(tokens)

def text_to_ids(self, text):
return list(text.encode('utf-8'))

def ids_to_text(self, ids):
# remove special tokens.
ids = [x for x in ids if x < 256]
return bytes(ids).decode('utf-8', errors='ignore').rstrip()

def tokens_to_ids(self, tokens):
return tokens

def ids_to_tokens(self, ids):
return ids

@property
def pad_id(self):
return 256

@property
def bos_id(self):
return 257

@property
def eos_id(self):
return 258

@property
def unk_id(self):
return 259 # unused
42 changes: 25 additions & 17 deletions nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
from nemo.collections.common.losses import NLLLoss, SmoothedCrossEntropyLoss
from nemo.collections.common.metrics import GlobalAverageLossMetric
from nemo.collections.common.parts import transformer_weights_init
from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelProcessor
from nemo.collections.common.tokenizers.chinese_tokenizers import ChineseProcessor
from nemo.collections.common.tokenizers.en_ja_tokenizers import EnJaProcessor
from nemo.collections.common.tokenizers.moses_tokenizers import MosesProcessor
Expand Down Expand Up @@ -70,17 +71,20 @@ def __init__(self, cfg: MTEncDecModelConfig, trainer: Trainer = None):
self.multilingual = cfg.get("multilingual", False)
self.multilingual_ids = []

self.encoder_tokenizer_library = cfg.encoder_tokenizer.get('library', 'yttm')
self.decoder_tokenizer_library = cfg.decoder_tokenizer.get('library', 'yttm')

# Instantiates tokenizers and register to be saved with NeMo Model archive
# After this call, ther will be self.encoder_tokenizer and self.decoder_tokenizer
# Which can convert between tokens and token_ids for SRC and TGT languages correspondingly.
self.setup_enc_dec_tokenizers(
encoder_tokenizer_library=cfg.encoder_tokenizer.get('library', 'yttm'),
encoder_tokenizer_library=self.encoder_tokenizer_library,
encoder_tokenizer_model=cfg.encoder_tokenizer.get('tokenizer_model'),
encoder_bpe_dropout=cfg.encoder_tokenizer.get('bpe_dropout', 0.0)
if cfg.encoder_tokenizer.get('bpe_dropout', 0.0) is not None
else 0.0,
encoder_model_name=cfg.encoder.get('model_name') if hasattr(cfg.encoder, 'model_name') else None,
decoder_tokenizer_library=cfg.decoder_tokenizer.get('library', 'yttm'),
decoder_tokenizer_library=self.decoder_tokenizer_library,
decoder_tokenizer_model=cfg.decoder_tokenizer.tokenizer_model,
decoder_bpe_dropout=cfg.decoder_tokenizer.get('bpe_dropout', 0.0)
if cfg.decoder_tokenizer.get('bpe_dropout', 0.0) is not None
Expand Down Expand Up @@ -112,15 +116,13 @@ def __init__(self, cfg: MTEncDecModelConfig, trainer: Trainer = None):
self.source_processor_list = []
self.target_processor_list = []
for src_lng, tgt_lng in zip(self.src_language, self.tgt_language):
src_prcsr, tgt_prscr = self.setup_pre_and_post_processing_utils(
source_lang=src_lng, target_lang=tgt_lng
)
src_prcsr, tgt_prscr = self.setup_pre_and_post_processing_utils(src_lng, tgt_lng)
self.source_processor_list.append(src_prcsr)
self.target_processor_list.append(tgt_prscr)

else:
# After this call, the model will have self.source_processor and self.target_processor objects
self.setup_pre_and_post_processing_utils(source_lang=self.src_language, target_lang=self.tgt_language)
self.setup_pre_and_post_processing_utils(self.src_language, self.tgt_language)
self.multilingual_ids = [None]

# TODO: Why is this base constructor call so late in the game?
Expand Down Expand Up @@ -385,7 +387,7 @@ def setup_enc_dec_tokenizers(
decoder_model_name=None,
):

supported_tokenizers = ['yttm', 'huggingface', 'sentencepiece', 'megatron']
supported_tokenizers = ['yttm', 'huggingface', 'sentencepiece', 'megatron', 'byte-level']
if (
encoder_tokenizer_library not in supported_tokenizers
or decoder_tokenizer_library not in supported_tokenizers
Expand Down Expand Up @@ -661,18 +663,24 @@ def setup_pre_and_post_processing_utils(self, source_lang, target_lang):
Creates source and target processor objects for input and output pre/post-processing.
"""
self.source_processor, self.target_processor = None, None
if (source_lang == 'en' and target_lang == 'ja') or (source_lang == 'ja' and target_lang == 'en'):

if self.encoder_tokenizer_library == 'byte-level':
self.source_processor = ByteLevelProcessor()
elif (source_lang == 'en' and target_lang == 'ja') or (source_lang == 'ja' and target_lang == 'en'):
self.source_processor = EnJaProcessor(source_lang)
elif source_lang == 'zh':
self.source_processor = ChineseProcessor()
elif source_lang is not None and source_lang not in ['ja', 'zh']:
self.source_processor = MosesProcessor(source_lang)

if self.decoder_tokenizer_library == 'byte-level':
self.target_processor = ByteLevelProcessor()
elif (source_lang == 'en' and target_lang == 'ja') or (source_lang == 'ja' and target_lang == 'en'):
self.target_processor = EnJaProcessor(target_lang)
else:
if source_lang == 'zh':
self.source_processor = ChineseProcessor()
if target_lang == 'zh':
self.target_processor = ChineseProcessor()
if source_lang is not None and source_lang not in ['ja', 'zh']:
self.source_processor = MosesProcessor(source_lang)
if target_lang is not None and target_lang not in ['ja', 'zh']:
self.target_processor = MosesProcessor(target_lang)
elif target_lang == 'zh':
self.target_processor = ChineseProcessor()
elif target_lang is not None and target_lang not in ['ja', 'zh']:
self.target_processor = MosesProcessor(target_lang)

return self.source_processor, self.target_processor

Expand Down
8 changes: 6 additions & 2 deletions nemo/collections/nlp/modules/common/tokenizer_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -18,6 +18,7 @@
from typing import Dict, List, Optional

import nemo
from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
from nemo.collections.common.tokenizers.char_tokenizer import CharTokenizer
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
from nemo.collections.common.tokenizers.word_tokenizer import WordTokenizer
Expand Down Expand Up @@ -138,12 +139,15 @@ def get_nmt_tokenizer(
return nemo.collections.common.tokenizers.sentencepiece_tokenizer.SentencePieceTokenizer(
model_path=tokenizer_model, special_tokens=special_tokens_dict
)
elif library == 'byte-level':
logging.info(f'Using byte-level tokenization')
return ByteLevelTokenizer()
elif library == 'megatron':
logging.info(
f'Getting Megatron tokenizer with pretrained model name: {model_name} and custom vocab file: {vocab_file}'
)
return get_tokenizer(tokenizer_name=model_name, vocab_file=vocab_file)
else:
raise NotImplementedError(
'Currently we only support "yttm", "huggingface", "megatron", and "sentencepiece" tokenizer library.'
'Currently we only support "yttm", "huggingface", "sentencepiece", "megatron", and "byte-level" tokenizer libraries.'
)

0 comments on commit c836de1

Please sign in to comment.