Fix token sampling underflow for short token datasets by TastyHeadphones · Pull Request #17 · maderix/ANE

TastyHeadphones · 2026-03-03T02:45:17Z

Summary

add a dataset length guard after mmap in train_large.m
prevent size_t underflow in max_pos = n_tokens - SEQ - 1
return early with a clear error when the token file is too short for one training window

Why

max_pos is computed from unsigned size_t. If n_tokens <= SEQ + 1, the subtraction wraps to a huge value, which can produce invalid random offsets and out-of-bounds reads.

Validation

built successfully with make train_large in training/

…size_t wraparound on short datasets in both train_large variants

maderix · 2026-03-04T12:25:23Z

Thanks for catching this — clean fix for a real underflow bug. Merged!

Fix token sampling underflow on short datasets

2b3b7ae

TastyHeadphones mentioned this pull request Mar 3, 2026

Fix underflow when token dataset is too short TastyHeadphones/ANE#1

Closed

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026

[fix] Token sampling underflow fix (upstream PR maderix#17): prevent …

380237a

…size_t wraparound on short datasets in both train_large variants

dev-erik mentioned this pull request Mar 3, 2026

Community contributions: M1-M4 compat, security fixes, docs, benchmarks, and community dashboard #25

Closed

6 tasks

maderix merged commit 3efa27d into maderix:main Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix token sampling underflow for short token datasets#17

Fix token sampling underflow for short token datasets#17
maderix merged 1 commit intomaderix:mainfrom
TastyHeadphones:tastyheadphones/short-dataset-underflow-fix

TastyHeadphones commented Mar 3, 2026

Uh oh!

maderix commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TastyHeadphones commented Mar 3, 2026

Summary

Why

Validation

Uh oh!

maderix commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants