Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strip_accents to basic BertTokenizer. #6280

Merged
merged 7 commits into from Aug 6, 2020

Conversation

PhilipMay
Copy link
Contributor

@PhilipMay PhilipMay commented Aug 6, 2020

The BertTokenizerFast can turn off strip_accents with strip_accents=False. This PR also adds this option to the basic BertTokenizer.

Also see #6186

@PhilipMay
Copy link
Contributor Author

Strange CI problem with checksum of Torch:

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    torch from https://files.pythonhosted.org/packages/38/53/914885a93a44b96c0dd1c36f36ff10afe341f091230aad68f7228d61db1e/torch-1.6.0-cp36-cp36m-manylinux1_x86_64.whl#sha256=7669f4d923b5758e28b521ea749c795ed67ff24b45ba20296bc8cff706d08df8 (from transformers==3.0.2):
        Expected sha256 7669f4d923b5758e28b521ea749c795ed67ff24b45ba20296bc8cff706d08df8
             Got        8188c461d0b762aa4ae6e72105b3c3a01f7bb2863b46a392ff8537a0854e8967

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM. Agree with @JetRunner's suggestion.

@codecov
Copy link

codecov bot commented Aug 6, 2020

Codecov Report

Merging #6280 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #6280   +/-   ##
=======================================
  Coverage   79.64%   79.64%           
=======================================
  Files         147      147           
  Lines       27120    27125    +5     
=======================================
+ Hits        21600    21605    +5     
  Misses       5520     5520           
Impacted Files Coverage Δ
src/transformers/tokenization_bert.py 91.51% <100.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31da35c...547df96. Read the comment docs.

@LysandreJik
Copy link
Member

You can ignore the HASH errors, we're working on solving these but they're unrelated to your PR.

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Aug 6, 2020

This is ready to be merged IMO. 😁

@JetRunner JetRunner merged commit d5bc32c into huggingface:master Aug 6, 2020
@PhilipMay
Copy link
Contributor Author

Wow - this was really fast from first commit to merge. Many thanks to the contributors. This makes open source development twice as much fun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants