Extract the word-counts to each Trainer #524

n1t0 · 2020-11-13T17:52:25Z

Current state

Training a Model starts by computing the word counts from the training corpus, in the Tokenizer. We later provide these word counts to the relevant Trainer in order to start the training. This actually has several limitations:

Computing the word counts is not always the best starting point to train a Model
This prevents streaming the corpus directly in the Trainer, while forcing us to build a first representation in memory. This is limiting for big datasets. Sometimes Trainers can directly build a better representation, effectively reducing the memory footprint.

Goal

Change the Trainer API to:

Feed it with &str directly
Leave it the responsibility to build its own representation
train should just take the Model to train.

The text was updated successfully, but these errors were encountered:

pietrolesci · 2020-12-08T21:42:58Z

HI @n1t0,

Possibly related to this: is there a way from the users' side to get the word counts after training a tokenizer?

I preferred asking this here instead of opening an issue directly.
Context: I am on '0.10.0rc1' using the WordLevel tokenizer

n1t0 added the enhancement New feature or request label Nov 13, 2020

n1t0 mentioned this issue Nov 13, 2020

Training improvements #528

Closed

6 tasks

n1t0 mentioned this issue Nov 28, 2020

Ability to train from memory #544

Merged

1 task

n1t0 closed this as completed in #544 Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract the word-counts to each Trainer #524

Extract the word-counts to each Trainer #524

n1t0 commented Nov 13, 2020

pietrolesci commented Dec 8, 2020

Extract the word-counts to each Trainer #524

Extract the word-counts to each Trainer #524

Comments

n1t0 commented Nov 13, 2020

Current state

Goal

pietrolesci commented Dec 8, 2020