⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker · 2023-06-29T02:10:16Z

What does this PR do?

Fixes the T5Tokenizer (not the fast one yet). (at the same time adresses part of #11531)
When converting UMT5 I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process <extra_id_xx>.

Example:

# t5-base tokenizer
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 3, 5, 8774] # ['<extra_id_0>', ' ▁', '.', '▁Hello']
# seqio.SentencePieceVocabulary(vocab_path, extra_ids = 300)
>>> processor.encode("<extra_id_0>. Hello")
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

#after fix: 
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

The reason is that t5x wrapps arround sentencepiece, and adds the extra id to the vocab, but they are not saved that way.
We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:

>>> tokenizer.sp_model.encode(". Hello")
[273, 274, 9]

While the original model never sees a . (or a lot of other characters) alone, and thus we add an extra space...

This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.

TODO:

Extra checks should be added to make sure this does not add anything else (like stripping a . This for example would break: tokenizer.encode(". Hello") as it remove the prefix space that is normally added.

HuggingFaceDocBuilderDev · 2023-06-29T02:34:11Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-06-29T03:29:25Z

Actually switch t5 tests have to be updated!
This means I have to check if the models were trained with this extra token (if they used HF tokenizer) or not.

tests.models.instructblip.test_modeling_instructblip.InstructBlipModelIntegrationTest testMethod=test_inference_flant5_xl failing on main too so not related.....

tests.models.mt5.test_modeling_flax_mt5.MT5IntegrationTest also fails on main...
tests/models/t5/test_tokenization_t5.py the issue comes from the convert_slow modification. Need to investigate
- [ ] tests/models/t5/test_tokenization_t5.py:399 T5TokenizationTest.test_get_sentinel_token_ids_for_fasttokenizer
- [ ] tests/test_tokenization_common.py:3425 T5TokenizationTest.test_save_pretrained
- [ ] tests/models/t5/test_tokenization_t5.py:271 T5TokenizationTest.test_special_tokens_initialization

ArthurZucker · 2023-06-29T06:47:39Z

This can also be made non "breakable" with a flag. Up to debate since it is a bug fix.

sgugger

Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.

src/transformers/models/t5/tokenization_t5.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ArthurZucker · 2023-07-02T03:07:29Z

Edit: just to make sure, I did more testing and unfortunately , there is one bug:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']

instead of

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']

This is because we have to prepend a _ instead of a space. (text = SPIECE_UNDERLINE + text. Not a single test caught this when runing pytest tests -k t5 which is interesting.
Fixing asap and adding tests. This is becoming very complex 😓

pointonjoel · 2023-07-20T11:34:28Z

I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code:

from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained("google/mt5-small")

The error is:
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at #24565
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(

ArthurZucker · 2023-07-20T13:55:24Z

Yep, just set legacy=False. The goal of the warning is for you to decide wether or not you thing the legacy behaviour is alright with you or not.

ArthurZucker added 2 commits June 29, 2023 01:57

don't add space before single letter chars that don't have a merge

e03a768

fix the fix

76d6ab3

ArthurZucker added 4 commits June 29, 2023 02:35

fixup

5a7184b

add a test

baac7be

more testing

6e37601

fixup

b933328

ArthurZucker mentioned this pull request Jun 29, 2023

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Closed

4 tasks

hack to make sure fast is also fixed

d0cbc49

ArthurZucker marked this pull request as ready for review June 29, 2023 03:28

ArthurZucker added 2 commits June 29, 2023 04:36

update switch transformers test

50008ed

revert convert slow

5edf863

ArthurZucker requested review from Narsil and sgugger June 29, 2023 06:09

sgugger approved these changes Jun 29, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits June 30, 2023 03:54

Update src/transformers/models/t5/tokenization_t5.py

17bda2c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

add typechecking

059999e

quality

8d3f2a2

ArthurZucker merged commit b52a03c into huggingface:main Jun 30, 2023
19 checks passed

dtiarks mentioned this pull request Jun 30, 2023

Add UDOP #22940

Merged

4 tasks

This was referenced Jul 2, 2023

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Closed

hy928302776 mentioned this pull request Jul 13, 2023

启动模型 bash ./scripts/infer.sh 异常 jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese#10

Open

ashercn97 mentioned this pull request Jul 18, 2023

Confusing Error Message (i think). OpenAccess-AI-Collective/axolotl#290

Closed

atillabasaran mentioned this pull request Jul 18, 2023

Legacy tokenizer artidoro/qlora#212

Open

Anushagudipati mentioned this pull request Apr 22, 2024

getting issues with tokenizer meta-llama/llama3#116

Open

union-cmd mentioned this pull request Apr 22, 2024

[Bug]: ImportError: sys.meta_path is None, Python is likely shutting down run-llama/llama_index#13019

Closed

vvubbalubba mentioned this pull request Apr 24, 2024

Stage 1 breaks with ^C deepseek-ai/DreamCraft3D#56

Closed

polop08 mentioned this pull request Apr 30, 2024

About No VAE weights detected, VAE not initalized. city96/ComfyUI_ExtraModels#29

Closed

trainee0918 mentioned this pull request May 5, 2024

[BUG] docker安装QAnything 在Deeplin下404报错 netease-youdao/QAnything#311

Closed

2 tasks

HASAN3DG mentioned this pull request May 6, 2024

Hi man Clybius/ComfyUI-Extra-Samplers#13

Open

Jeremy-lf mentioned this pull request May 9, 2024

Issue about pretraining[return code = -8 ], anyone can help me? haotian-liu/LLaVA#1495

Open

douzizi mentioned this pull request May 9, 2024

[DIPU]华为上设置DIPU_PYTHON_DEVICE_AS_CUDA=false报错module 'torch' has no attribute 'xpu' DeepLink-org/deeplink.framework#804

Open

Guuuli mentioned this pull request May 9, 2024

【2024年4月2日】相关问题汇总 qianlima-lab/DiffShape#4

Open

This was referenced May 10, 2024

[BUG] win11 wsl2 docker MiniChat-2-3B “legacy=False”启动错误 netease-youdao/QAnything#325

Open

wsl2 docker LLM 服务超时 netease-youdao/QAnything#331

Open

bitisony mentioned this pull request May 15, 2024

Could not generate image successfully that using sample commands in the README Tencent/HunyuanDiT#19

Closed

whmc76 mentioned this pull request May 16, 2024

httpx.ReadTimeout: timed out Tencent/HunyuanDiT#26

Closed

godguy23 mentioned this pull request May 17, 2024

NotImplementedError: Cannot copy out of meta tensor; no data! Tencent/HunyuanDiT#35

Closed

Bleking mentioned this pull request May 23, 2024

LoRA 파인튜닝을 진행하다가 RuntimeError: output tensor must have the same type as input tensor 오류가 나왔습니다. tabtoyou/KoLLaVA#24

Open

Qin-xb mentioned this pull request May 23, 2024

纯python环境下，启动服务404，Application QAnything cannot handle your request ⚠️ 404 — Not Found netease-youdao/QAnything#357

Open

equinox-sun mentioned this pull request May 23, 2024

linux服务器没有搭梯子，想在本地下载后上传到相关位置，但是不知道应该放在哪里。请问方便提供一下吗？ Tencent/HunyuanDiT#56

Closed

JoshonSmith mentioned this pull request May 25, 2024

infer llama3-8B-instruct 报错return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string zjunlp/DeepKE#513

Closed

YefanWu mentioned this pull request May 26, 2024

[Usability]: 运行LLaMa时报错'Config' object has no attribute 'define_bool_state' secretflow/spu#704

Closed

amansahu278 mentioned this pull request May 29, 2024

IndexError in run_encoding.py UChi-JCL/CacheGen#2

Open

wertyhb-sns mentioned this pull request May 30, 2024

张量不匹配问题 beeevita/EvoPrompt#7

Open

jiangkl8bigai mentioned this pull request May 30, 2024

llama3 in run_knowedit_llama2.py zjunlp/EasyEdit#275

Closed

HoppeDeng mentioned this pull request Jun 6, 2024

Chatglm3 model convert to sym_int4 failed intel-analytics/ipex-llm#11234

Closed

zombieme mentioned this pull request Jun 9, 2024

!!! Exception during processing!!! Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! city96/ComfyUI_ExtraModels#53

Open

NaNAGISaSA mentioned this pull request Jun 20, 2024

llama3-70b int8+kv8 convert checkpoint failed on v0.10.0 branch NVIDIA/TensorRT-LLM#1814

Open

4 tasks

tankeui mentioned this pull request Jun 20, 2024

Full parameter fine-tuning bugs OptimalScale/LMFlow#864

Open

stevennt mentioned this pull request Jun 21, 2024

[FIX] The chat fails when querying with /notes while running LLAMA3 API khoj-ai/khoj#831

Open

14 tasks

yyinchan mentioned this pull request Jun 27, 2024

【急】求助大佬，为什么打开前端没有对话界面（点击新建知识库也提示请求失败 netease-youdao/QAnything#416

Open

2 tasks

tankeui mentioned this pull request Jun 27, 2024

DPO+ZeRO train error OptimalScale/LMFlow#870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker commented Jun 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023

sgugger left a comment

ArthurZucker commented Jul 2, 2023 •

edited

Loading

pointonjoel commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

Conversation

ArthurZucker commented Jun 29, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 29, 2023 • edited Loading

ArthurZucker commented Jun 29, 2023 • edited Loading

ArthurZucker commented Jun 29, 2023

sgugger left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jul 2, 2023 • edited Loading

pointonjoel commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker commented Jun 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jul 2, 2023 •

edited

Loading