Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to serialize custom Python components #581

Closed
n1t0 opened this issue Jan 6, 2021 · 6 comments
Closed

Add the ability to serialize custom Python components #581

n1t0 opened this issue Jan 6, 2021 · 6 comments
Labels
enhancement New feature or request python Issue related to the python binding

Comments

@n1t0
Copy link
Member

n1t0 commented Jan 6, 2021

It is currently impossible to serialize custom Python components, so if a Tokenizer embeds some of them, the user can't save it.

I didn't really dig this so I don't know exactly what would be the constraints/requirements, but this is something we should explore at some point.

@n1t0 n1t0 added enhancement New feature or request python Issue related to the python binding labels Jan 6, 2021
@ibraheem-moosa
Copy link

ibraheem-moosa commented Feb 13, 2022

This is a useful feature. We can probably serialize Python objects using pickle or dill. However the serialization code is in Rust. Is it possible to serialize the custom Python components with pickle?

@Narsil
Copy link
Collaborator

Narsil commented Feb 14, 2022

The end result has to be saved as JSON, I don't think it's doable.
Also pickle is highly unsafe and not portable (despite being widely used).

Currently the workaround, is to override the component before save, and override after load

tokenizer.pre_tokenizer = Custom()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.save("tok.json")

## Load later
tokenizer = Tokenizer.from_file("tok.json")
tokenizer.pre_tokenizer = Custom()

It is a bit inconvenient but at least it's safe and portable.

@cceyda
Copy link

cceyda commented Apr 6, 2023

You also can't load it as a PreTrainedTokenizerFast if you have a custom component.

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

As a workaround I do

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(CustomPreTokenizer())

but using overriding using the private _tokenizer maybe unpredictably problematic.

@Narsil
Copy link
Collaborator

Narsil commented Apr 7, 2023

Totally understandable.

What kind of pre-tokenizer are you saving ?
If some building blocks are missing we could add them to make the thing more composable/portable/shareable.

@luvwinnie
Copy link

Is now can saving the custom pretokenizer?

@Narsil
Copy link
Collaborator

Narsil commented Aug 28, 2023

No. custom is python code, it's not serializable by nature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python Issue related to the python binding
Projects
None yet
Development

No branches or pull requests

5 participants