fix(tokenizer): tokenize offset 계산이 다중 공백에서 틀리는 버그 수정 by lovit · Pull Request #287 · lovit/soynlp

lovit · 2026-03-10T17:07:42Z

Summary

sentence.split() + offset += len(token) + 1 조합이 연속 공백 시 Token.begin/end 오류를 발생시키던 버그 수정
re.finditer(r'\S+', sentence)로 교체하여 각 eojeol의 실제 위치를 직접 추출

변경 전/후

# 변경 전
tokenize('hello  world', return_words=False)
# Token(world, begin=6, ...)  ← 실제 위치는 7  ❌

# 변경 후
tokenize('hello  world', return_words=False)
# Token(world, begin=7, ...)  ✓

관련 이슈

Closes #280

🤖 Generated with Claude Code

sentence.split() + offset += len + 1 방식은 연속 공백 시 Token position 오류. re.finditer(r'\S+', sentence)로 교체하여 각 eojeol의 실제 시작 위치를 직접 추출. 단위 테스트 추가. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PR #287 병합 시 충돌 마커(<<<<<<< HEAD)가 잘못 포함된 것을 수정 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lovit force-pushed the feature/280 branch from e4f51e5 to f5db4e2 Compare March 10, 2026 17:42

lovit merged commit 013f568 into refactor-2026 Mar 10, 2026
0 of 2 checks passed

lovit added a commit that referenced this pull request Mar 10, 2026

fix(tokenizer): test_tokenizers.py 충돌 마커 제거

8665ddb

PR #287 병합 시 충돌 마커(<<<<<<< HEAD)가 잘못 포함된 것을 수정 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): tokenize offset 계산이 다중 공백에서 틀리는 버그 수정#287

fix(tokenizer): tokenize offset 계산이 다중 공백에서 틀리는 버그 수정#287
lovit merged 1 commit intorefactor-2026from
feature/280

lovit commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lovit commented Mar 10, 2026

Summary

변경 전/후

관련 이슈

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant