Tweak ESM tokenizer for Nucleotide Transformer #22770

Rocketknight1 · 2023-04-14T13:48:07Z

Nucleotide Transformer is a model that takes DNA inputs. It uses the same model architecture as the protein model ESM, but in addition to a different vocabulary it tokenizes inputs without a <sep> or <eos> token at the end. This PR makes a small tweak to the tokenization code for ESM, so that it doesn't try to add self.eos_token_id to sequences when the tokenizer does not have an eos_token set. With this change, we can fully support Nucleotide Transformer as an ESM checkpoint.

amyeroberts · 2023-04-14T13:54:46Z

src/transformers/models/esm/tokenization_esm.py

+                return cls + token_ids_0
+            else:
+                return cls + token_ids_0 + sep
+        elif sep is None:


Should this be:

Suggested change

elif sep is None:

elif self.eos_token_id is None:

as sep is always a list?

I spotted that one just before you got here, and yes you're right!

HuggingFaceDocBuilderDev · 2023-04-14T14:11:02Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

LGTM!

* If EOS is None, don't add it to sequences * If EOS is None, don't add it to sequences

If EOS is None, don't add it to sequences

8e6cfa9

Rocketknight1 requested review from amyeroberts and sgugger April 14, 2023 13:48

If EOS is None, don't add it to sequences

469d5c3

amyeroberts reviewed Apr 14, 2023

View reviewed changes

sgugger approved these changes Apr 14, 2023

View reviewed changes

Rocketknight1 merged commit 06e737f into main Apr 14, 2023

Rocketknight1 deleted the nucleotide_transformer_tokenizer branch April 14, 2023 14:18

Rocketknight1 mentioned this pull request Apr 19, 2023

Fix to removing ESM special tokens #22870

Merged

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

Tweak ESM tokenizer for Nucleotide Transformer (huggingface#22770)

cd23c60

* If EOS is None, don't add it to sequences * If EOS is None, don't add it to sequences

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak ESM tokenizer for Nucleotide Transformer #22770

Tweak ESM tokenizer for Nucleotide Transformer #22770

Rocketknight1 commented Apr 14, 2023

amyeroberts Apr 14, 2023

Rocketknight1 Apr 14, 2023

HuggingFaceDocBuilderDev commented Apr 14, 2023 •

edited

Loading

sgugger left a comment

Tweak ESM tokenizer for Nucleotide Transformer #22770

Tweak ESM tokenizer for Nucleotide Transformer #22770

Conversation

Rocketknight1 commented Apr 14, 2023

amyeroberts Apr 14, 2023

Choose a reason for hiding this comment

Rocketknight1 Apr 14, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 14, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 14, 2023 •

edited

Loading