Skip to content

Conversation

@pcuenca
Copy link
Member

@pcuenca pcuenca commented Sep 11, 2025

There was a case I had not considered 😢

Fixes #96

There was a case I had not considered 😢

Fixes #96
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes legacy behavior for Llama tokenizers by implementing proper handling of prefix tokens when isLegacy is false. The fix addresses issue #96 related to tokenization differences.

  • Adds test case to verify legacy Llama tokenizer behavior
  • Modifies property visibility in PreTrainedTokenizer for subclass access
  • Implements tokenize override in LlamaPreTrainedTokenizer to handle prefix tokens correctly

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
Tests/TokenizersTests/TokenizerTests.swift Adds test case to verify legacy Llama tokenization behavior
Sources/Tokenizers/Tokenizer.swift Changes property visibility and adds tokenize override for proper prefix token handling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

let tokens = super.tokenize(text: sentencePieceUnderline + text.replacingOccurrences(of: sentencePieceUnderline, with: " "))

let second = tokens.dropFirst().first
if tokens.first == sentencePieceUnderline, second != nil, specialTokens[second!] != nil {
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using force unwrap (second!) is risky. Consider using optional binding or guard statement: if tokens.first == sentencePieceUnderline, let secondToken = second, specialTokens[secondToken] != nil {

Suggested change
if tokens.first == sentencePieceUnderline, second != nil, specialTokens[second!] != nil {
if tokens.first == sentencePieceUnderline, let secondToken = second, specialTokens[secondToken] != nil {

Copilot uses AI. Check for mistakes.
@pcuenca pcuenca merged commit ec0ae14 into main Sep 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tokenizer behavior is different from Python transformers

2 participants