Skip to content

Fix token sampling underflow for short token datasets#17

Merged
maderix merged 1 commit intomaderix:mainfrom
TastyHeadphones:tastyheadphones/short-dataset-underflow-fix
Mar 4, 2026
Merged

Fix token sampling underflow for short token datasets#17
maderix merged 1 commit intomaderix:mainfrom
TastyHeadphones:tastyheadphones/short-dataset-underflow-fix

Conversation

@TastyHeadphones
Copy link
Contributor

Summary

  • add a dataset length guard after mmap in train_large.m
  • prevent size_t underflow in max_pos = n_tokens - SEQ - 1
  • return early with a clear error when the token file is too short for one training window

Why

max_pos is computed from unsigned size_t. If n_tokens <= SEQ + 1, the subtraction wraps to a huge value, which can produce invalid random offsets and out-of-bounds reads.

Validation

  • built successfully with make train_large in training/

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026
…size_t wraparound on short datasets in both train_large variants
@maderix maderix merged commit 3efa27d into maderix:main Mar 4, 2026
@maderix
Copy link
Owner

maderix commented Mar 4, 2026

Thanks for catching this — clean fix for a real underflow bug. Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants