Marginalization over tokenizations

This repository contains code for running our marginalization over tokenizatons algorithm presented in our ACL'23 paper:

@inproceedings{marginaliation,
    title={Should you marginalize over possible tokenizations?},
    author={Chirkova, Nadezhda and Kruszewski, Germ{\'a}n and Rozen, Jos and Dymetman, Marc},
    booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics},
    year={2023},
}

Dependencies

Code uses the following libraries:

python -m venv marginalization
source marginalization/bin/activate
pip3 install numpy pandas torch transformers datasets

Quick start

By default, the code supports two model families (gpt2 and bigscience/bloom-1b7 / bigscience/bloom-560m) and three datasets (wikitext, twitter, and flores:<lang>, see language codes here). The model and data will be downloaded automatically using HuggingFace utilities. You can run evaluation as follows:

python3 main.py --dataset <dataset> --model <model> --log_file <logfile>

The script will evaluate bits-per-character (BPC) according to the common practice (using default tokenization) and then according to the proposed marginalization paradigm (using sampled tokenizations). The detailed log with be saved to <logfile> and the summary of the results will be printed.

You can also print the summary of the results for some given log, as follows:

python3 parse_logs.py <logfile>

For example, you can use this command to check the progress of an ongoing evaluation run.

Our logs are available here.

Hyperparameters

You can specify algorithm / data hyperparameters by passing arguments to the main.py script:

--max_length: max length (in tokens) of concatenated text strings. -1 disables concatenation. Default: 800
--num_texts: number of (concatenated) strings to evaluate on. Default: 100
--num_toks_per_seq: number of tokenizations to sample for each string (K in eq. (2) in the paper). Default: 30
--max_block_len: max. length of blocks which strings will be split into (L in the paper). Default: 19
--batch_size: number of tokenizations of a block to score with LM in a single batch. Default: 16
--max_batches_per_block: max. number of batches per block to score. The number of scored tokenizations per block (M in the paper) is defined as batch_size * max_batches_per_block. Default: 8

We recommend setting --max_block_len to the maximum token length in the default tokenization of the data, when possible. You can check token length distribution by running the following script (add flags --max_length and --num_texts if you use non-default values for them):

python3 check_token_lengths_distribution.py --model <model> --dataset <dataset>

Adding custom models or datasets

If you wish to add you model or dataset, please include them in the aux.py file.

For a dataset, include it in the load_data function:
- return a list of strings;
For a model, include it in the load_model_and_tokenizer function and return:
- the model;
- the tokenizer;
- the model_max_length value (the max. number of tokens in the sequence supported by the model);
- the is_new_word function which determines whether a given token is a beginning of a new word (needed for splitting a string into blocks).

License

The code is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marginalization over tokenizations

Dependencies

Quick start

Hyperparameters

Adding custom models or datasets

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
aux.py		aux.py
check_token_lengths_distribution.py		check_token_lengths_distribution.py
main.py		main.py
parse_logs.py		parse_logs.py
tokenizations.py		tokenizations.py

License

naver/marginalization

Folders and files

Latest commit

History

Repository files navigation

Marginalization over tokenizations

Dependencies

Quick start

Hyperparameters

Adding custom models or datasets

License

About

Resources

License

Stars

Watchers

Forks

Languages