Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract the word-counts to each Trainer #524

Closed
n1t0 opened this issue Nov 13, 2020 · 1 comment · Fixed by #544
Closed

Extract the word-counts to each Trainer #524

n1t0 opened this issue Nov 13, 2020 · 1 comment · Fixed by #544
Labels
enhancement New feature or request

Comments

@n1t0
Copy link
Member

n1t0 commented Nov 13, 2020

Current state

Training a Model starts by computing the word counts from the training corpus, in the Tokenizer. We later provide these word counts to the relevant Trainer in order to start the training. This actually has several limitations:

  • Computing the word counts is not always the best starting point to train a Model
  • This prevents streaming the corpus directly in the Trainer, while forcing us to build a first representation in memory. This is limiting for big datasets. Sometimes Trainers can directly build a better representation, effectively reducing the memory footprint.

Goal

Change the Trainer API to:

  • Feed it with &str directly
  • Leave it the responsibility to build its own representation
  • train should just take the Model to train.
@n1t0 n1t0 added the enhancement New feature or request label Nov 13, 2020
@n1t0 n1t0 mentioned this issue Nov 13, 2020
6 tasks
@n1t0 n1t0 closed this as completed in #544 Nov 28, 2020
@pietrolesci
Copy link

HI @n1t0,

Possibly related to this: is there a way from the users' side to get the word counts after training a tokenizer?

I preferred asking this here instead of opening an issue directly.
Context: I am on '0.10.0rc1' using the WordLevel tokenizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants