Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentence_dedup.py: "struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)" #112

Closed
griff4692 opened this issue Mar 1, 2024 · 2 comments

Comments

@griff4692
Copy link

  File "/weka/home-griffin/datatrove/src/datatrove/pipeline/dedup/sentence_dedup.py", line 69, in save_hashes
    f.write(struct.pack("<H", hs.sent_id))
    │ │     │      │          │  └ 312748
    │ │     │      │          └ HashSig(hash_value=6388858470426, doc_id=5179808, file_id=None, sent_id=312748)
    │ │     │      └ <built-in function pack>
    │ │     └ <module 'struct' from '/usr/lib/python3.10/struct.py'>
    │ └ <function LocalFileOpener.write at 0x7f020cfcd990>
    └ <fsspec.implementations.local.LocalFileOpener object at 0x7ec05d247d00>

struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)

I'm processing a large dataset (30BN tokens) -- is this a numerical overflow issue?

Is there a quick fix? Thank you!

@guipenedo
Copy link
Collaborator

guipenedo commented Mar 1, 2024

Indeed it is, you have an overflow on the sentence id. Currently we do not support a config to let you change the datatype for sentence id, so the quickest solution is to find your document with >= 312748 sentences and split it into smaller documents (it's document number 5179808 on this specific input file, based on your log).

Note that the issue here is not the number of documents but rather that one individual document has a lot of sentences

@griff4692
Copy link
Author

Thanks - Yes - i just saw that it was on sent.id. have done the filter and re-running thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants