Skip to content

fix: caseless match for arm64 and proper check for win32#97

Merged
MasterJH5574 merged 2 commits into
mlc-ai:mainfrom
vinovo:fix/paul/cmake
May 19, 2026
Merged

fix: caseless match for arm64 and proper check for win32#97
MasterJH5574 merged 2 commits into
mlc-ai:mainfrom
vinovo:fix/paul/cmake

Conversation

@vinovo
Copy link
Copy Markdown
Contributor

@vinovo vinovo commented May 16, 2026

Summary

Two small CMake fixes that make tokenizers-cpp build robustly on Windows
across different compilers, generators, and toolchain files. The previous
logic happened to work for the common MSVC + Visual Studio generator case
but was not robust to other valid Windows configurations.

Changes

1. Pick .lib vs .a based on the OS, not the host compiler

rustc on Windows always emits tokenizers_c.lib regardless of which
C/C++ compiler CMake is configured with — the file name is chosen by Rust,
not by the host compiler. The previous gate if(MSVC) is only true for
cl.exe, so any other Windows compiler (where MSVC is FALSE) was
expecting libtokenizers_c.a. The custom command's OUTPUT was never
produced and the post-build copy step failed.

Switching the gate to if(WIN32) matches what Rust actually produces and
makes the rule depend on the target OS rather than on which host compiler
happens to be in use.

2. Case-insensitive CMAKE_SYSTEM_PROCESSOR match on Windows

The Windows branch only accepted the exact spellings ARM64 (uppercase)
or aarch64 (lowercase). The uppercase form is filled in by some
generators (e.g. Visual Studio invoked with -A ARM64); other valid
configurations such as Ninja + a toolchain file may use mixed/lowercase
arm64, which silently fell through to the else branch and selected
x86_64-pc-windows-msvc. Rust then cross-built the wrong architecture
and the resulting .lib could not link against the rest of the arm64
build.

Normalizing CMAKE_SYSTEM_PROCESSOR with string(TOUPPER ...) before
comparing makes the detection robust to whichever spelling the generator
or toolchain file happens to use.

Bonus: sentencepiece submodule bump

This PR also bumps the sentencepiece submodule from 11051e3 to a899e9a, picking up upstream's migration from the bundled abseil-compatibility shim to the official Abseil library (LTS 20260107.1, fetched via FetchContent at configure time). The old third_party/absl/* shim has been removed upstream; the build now pulls real Abseil and creates a sentencepiece/third_party/absl -> abseil-cpp/absl symlink (file(CREATE_LINK ... SYMBOLIC)).

The public SentencePieceProcessor API surface that src/sentencepiece_tokenizer.cc depends on (LoadFromSerializedProto, Encode, Decode, GetPieceSize, IdToPiece, PieceToId) is unchanged, so no source changes are required in tokenizers-cpp. The default SPM_ABSL_PROVIDER is now "module" (previously "internal"); this is left at the upstream default.

Verification

Verified end-to-end on Windows arm64 (MSVC 19.50, VS 2026 generator, with the two CMake fixes above):

  • Configure: FetchContent clones abseil-cpp at tag 20260107.1; symlink creation under sentencepiece/third_party/absl succeeds.

  • Build: Abseil + sentencepiece + Rust tokenizers-c + tokenizers_cpp.lib all compile and link cleanly.

  • Runtime: ran example/example.exe against the four tokenizer fixtures used by build_and_run.sh:

    Tokenizer Asset Result
    SentencePiece tokenizer.model (Vicuna 7B) IDs [1724, 338, 278, 29871, 7483, 310, 7400, 29973], decode round-trip ✅, vocab 32000
    Huggingface JSON tokenizer.json (RedPajama-3B) round-trip ✅, vocab 50277
    Huggingface BPE vocab.json + merges.txt (Qwen2.5-3B) round-trip ✅, vocab 151643
    RWKV World tokenizer_model round-trip ✅, vocab 65529

    The SentencePiece IDs match the expected LLaMA SP tokenization (including the 29871 whitespace marker for the doubled space), IdToPiece/PieceToId round-trip on the sampled IDs, and the decode assertion in TestTokenizer passes — confirming the abseil migration is behavior-preserving for the API used by this project.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The changes in CMakeLists.txt improve the robustness of the Windows build configuration. Specifically, the processor check for ARM64 is now case-insensitive by converting the processor string to uppercase, and the library extension logic has been updated to use WIN32 instead of MSVC to ensure correct naming across different Windows compilers. I have no feedback to provide.

@vinovo
Copy link
Copy Markdown
Contributor Author

vinovo commented May 18, 2026

@MasterJH5574 Mind giving this a review? We have windows build failing with non-MSVC compiler (e.g. clang). Thank you!

@MasterJH5574
Copy link
Copy Markdown
Member

@vinovo Thanks for pinging me. The PR looks good, thank you!

@MasterJH5574 MasterJH5574 merged commit f135a71 into mlc-ai:main May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants