Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding 1.5GB corpus file runs out of RAM #213

Open
enchantinggg4 opened this issue Feb 27, 2023 · 2 comments
Open

Encoding 1.5GB corpus file runs out of RAM #213

enchantinggg4 opened this issue Feb 27, 2023 · 2 comments

Comments

@enchantinggg4
Copy link

I got basic setup of training tokenizer(on a subset of corpus), TokenDataset of a big corpus and training
I run out of RAM while creating TokenDataset(I got 30gb ram on kaggle). This seems strange, given original corpus size is far less than 30GB

Is there a way to encode corpus into other file(with smaller batches) and then load them lazily for training?

@Vectorrent
Copy link
Contributor

I have run into an issue where specific files, for whatever reason, can crash the tokenizer. I've had a 15kb XML file swallow 30GBs of RAM. I'm not really sure why some files cause this, but perhaps that's the issue you're running into?

@breadbrowser
Copy link

breadbrowser commented Apr 1, 2023

I got basic setup of training tokenizer(on a subset of corpus), TokenDataset of a big corpus and training I run out of RAM while creating TokenDataset(I got 30gb ram on kaggle). This seems strange, given original corpus size is far less than 30GB

Is there a way to encode corpus into other file(with smaller batches) and then load them lazily for training?

kaggle gives you 16gigs of ram and you would need like 100 or more gigs of ram to encode it (edit this is super old and it should use cpu and not ram)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants