GitHub

Tokenization with Factorized Subword Encoding

David Samuel and Lilja Øvrelid

University of Oslo
Language Technology Group

Installation

git clone https://github.com/ltgoslo/factorizer.git
cd factorizer
python3 setup.py install

Pretrained factorizer models

Language	URL
Arabic (`ar`)	`arabic.dawg` (191 MB)
Chinese (`zh`)	`chinese.dawg` (180 MB))
Czech (`cs`)	`czech.dawg` (158 MB)
English (`en`)	`english.dawg` (129 MB)
Norwegian (`no`)	`norwegian.dawg` (186 MB)
Scottish Gaelic (`gd`)	`gaelic.dawg` (187 MB)
Turkish (`tr`)	`turkish.dawg` (206 MB)

Usage

from factorizer import Factorizer


tokenizer = Factorizer("english.dawg")
sentence = "The echo of a distant time comes willowing across the sand, and everything is green and submarine."

encoding = tokenizer(sentence)

print(f"INPUT:    {sentence}")
print(f"SUBWORDS: {' '.join(encoding.tokens)}")
print(f"INDICES:  {' '.join(str(index) for index in encoding.ids)}")
print(f"DECODED:  {tokenizer.decode(encoding.ids}")

This should output:

INPUT:    The echo of a distant time comes willowing across the sand, and everything is green and submarine.
SUBWORDS: ⸥The⸤ ⸥echo⸤ ⸥of⸤ ⸥a⸤ ⸥distant⸤ ⸥time⸤ ⸥comes⸤ ⸥wil lowing⸤ ⸥across⸤ ⸥the⸤ ⸥sand ,⸤ ⸥and⸤ ⸥everything⸤ ⸥is⸤ ⸥green⸤ ⸥and⸤ ⸥submarine .⸤
INDICES:  (52, 74, 62) (221, 21, 77) (135, 64, 137) (181, 45, 79) (248, 77, 122) (88, 92, 159) (124, 92, 64) (49, 151, 114) (79, 180, 104) (129, 186, 151) (52, 74, 219) (49, 127, 34) (35, 174, 39) (76, 101, 35) (32, 176, 191) (135, 209, 205) (44, 28, 242) (76, 101, 35) (13, 171, 144) (211, 41, 131)
DECODED:  The echo of a distant time comes willowing across the sand, and everything is green and submarine.

Documentation

class Encoding:

A named tuple containing:

ids (List[Tuple[int, int, int]])
tokens (List[str])
perplexities (List[float])
offsets (List[Tuple[int, int]])

`Factorizer.init`

Argument	Description
`tokenizer_path` (str)	path to a DAWG file containing with pretrained vocabulary
`alpha` (float)	the alpha_split hyperparameter controling the granularity of subword splits (default: 0.1)
`sigma` (float)	the sigma_sample hyperparameter controling the randomness (temperature) of sampling (default: 0.0) (no sampling)
`merge_unks` (bool)	set this argument to True if you want to merge consecutive UNK tokens (default: True)
`allow_decoding` (bool)	set this argument to True if you want to precompute the inverse vocabulary for decoding (default: False)
`sample` (bool)	set this argument to True if you want to sample from the subword distribution; set to False if you want to always do the optimal tokenization (default: False)

`Factorizer.call`

Factorizes the input string (or list of strings)

Argument	Description
`text` (Union[str, List[str]])	the input string (or list of strings)

Returns: Union[Encoding, List[Encoding]]

`Factorizer.encode`

The same functions as Factorizer.__call__

`Factorizer.decode`

Takes the factorized indices and decodes them back to string (also accepts a batched input)

Argument	Description
`indices` (Union[List[Tuple[int, int, int]], List[List[Tuple[int, int, int]]]])	the factorized indices

Returns: Union[str, List[str]]

Please cite the following publication

@inproceedings{samuel-ovrelid-2023-tokenization,
    title = "Tokenization with Factorized Subword Encoding",
    author = "Samuel, David  and
      {\O}vrelid, Lilja",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.890",
    doi = "10.18653/v1/2023.findings-acl.890",
    pages = "14143--14161",
    abstract = "In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. The effectiveness of the proposed tokenization method, referred to as the Factorizer, is evaluated on language modeling and morpho-syntactic tasks for 7 diverse languages. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
vq-vae		vq-vae
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
example.py		example.py
factorizer.py		factorizer.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenization with Factorized Subword Encoding

Tokenization with Factorized Subword Encoding

Installation

Pretrained factorizer models

Usage

Documentation

class Encoding:

`Factorizer.init`

`Factorizer.call`

`Factorizer.encode`

`Factorizer.decode`

Please cite the following publication

About

Releases 1

Packages

Languages

License

ltgoslo/factorizer

Folders and files

Latest commit

History

Repository files navigation

Tokenization with Factorized Subword Encoding

Tokenization with Factorized Subword Encoding

Installation

Pretrained factorizer models

Usage

Documentation

class Encoding:

Factorizer.__init__

Factorizer.__call__

Factorizer.encode

Factorizer.decode

Please cite the following publication

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`Factorizer.init`

`Factorizer.call`

`Factorizer.encode`

`Factorizer.decode`

Packages