Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the order of additional special tokens deterministic #5704

Merged

Conversation

gonglinyuan
Copy link
Contributor

In SpecialTokensMixin.all_special_tokens_extended, deduplication is performed by all_toks = list(set(all_toks)). However, this will change the ordering of additional special tokens, and the order depends on the hash seed of set data structure. This will result in non-deterministic id of additional special tokens added to AutoTokenizer.from_pretrained method. Therefore, I changed this problematic line to all_toks = list(OrderedDict.fromkeys(all_toks)). This line will deduplicate all_toks while still keeping the original ordering.

@codecov
Copy link

codecov bot commented Jul 13, 2020

Codecov Report

Merging #5704 into master will decrease coverage by 0.94%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5704      +/-   ##
==========================================
- Coverage   78.26%   77.32%   -0.95%     
==========================================
  Files         146      146              
  Lines       25998    25998              
==========================================
- Hits        20348    20102     -246     
- Misses       5650     5896     +246     
Impacted Files Coverage Δ
src/transformers/tokenization_utils_base.py 93.60% <100.00%> (ø)
src/transformers/modeling_tf_mobilebert.py 23.38% <0.00%> (-73.39%) ⬇️
src/transformers/modeling_tf_electra.py 95.53% <0.00%> (+69.51%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0befb51...c706cc9. Read the comment docs.

Copy link
Contributor

@JetRunner JetRunner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. can't think of a side effect.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pinging @n1t0 for verification

@LysandreJik LysandreJik requested a review from n1t0 July 27, 2020 08:53
Copy link
Member

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@LysandreJik LysandreJik merged commit b390a56 into huggingface:master Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants