Skip to content

fix(doc): align word statistics compound tokens#1706

Merged
fangshuyu-768 merged 1 commit into
mainfrom
feat/doc-word-stat-compound-token-counting
Jul 2, 2026
Merged

fix(doc): align word statistics compound tokens#1706
fangshuyu-768 merged 1 commit into
mainfrom
feat/doc-word-stat-compound-token-counting

Conversation

@fangshuyu-768

@fangshuyu-768 fangshuyu-768 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Treat URLs and path-like/code ASCII compounds as one semantic word while preserving per-character counting.
  • Count single standalone ASCII symbol runs and visible Han-adjacent slash separators consistently with Lark GUI behavior.
  • Update the lark-doc word statistics reference to describe the script-aligned counting behavior.

Verification

  • python3 -m py_compile skills/lark-doc/scripts/doc_word_stat.py
  • Target cached XML JSON fixture: script output 2172 / 4219, matching GUI 2172 / 4219
  • git diff --check

Summary by CodeRabbit

  • Bug Fixes
    • Improved word and character counting for document statistics, including URL-like text, mixed ASCII tokens, and slash-separated Chinese text.
    • Refined symbol counting so single visible symbols are handled more consistently.
  • Documentation
    • Clarified the rules for reading word and character statistics, with more precise counting scope and exceptions.

@github-actions github-actions Bot added domain/ccm PR touches the ccm domain size/M Single-domain feat or fix with limited business impact labels Jul 1, 2026
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f99b502-8868-4436-ba54-a6364b185cd5

📥 Commits

Reviewing files that changed from the base of the PR and between a6797ac and bfb2e7f.

📒 Files selected for processing (2)
  • skills/lark-doc/references/lark-doc-word-stat.md
  • skills/lark-doc/scripts/doc_word_stat.py

📝 Walkthrough

Walkthrough

Modifies the word/character counting script to recognize ASCII compound tokens (URLs, alphanumeric-separator patterns) and Han/Han slash separators as single counted units rather than character-by-character, lowers the symbol-run word threshold from 2 to 1, and updates related documentation describing counting rules.

Changes

Compound Token Counting Logic

Layer / File(s) Summary
Token detection constants
skills/lark-doc/scripts/doc_word_stat.py
New compiled regex constants URL_TOKEN_RE and ASCII_COMPOUND_TOKEN_RE added to detect URL-like and compound alphanumeric-separator tokens.
Index-based write loop and token handling
skills/lark-doc/scripts/doc_word_stat.py
Counter.write() reworked to an index-based loop attempting compound token and Han slash separator matches before falling back to _write_char; new helper methods _write_visible_ascii_separator, _write_ascii_compound_token, _match_ascii_compound_token classify matched characters and update word/char stats; _end_symbol_run threshold for counting a symbol run as a word lowered from length >= 2 to >= 1.
Counting rule documentation
skills/lark-doc/references/lark-doc-word-stat.md
Documentation for word_count and char_count updated to describe refined counting rules, including exceptions for English punctuation attached to words and visible symbol handling.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant CounterWrite as Counter.write
  participant TokenMatcher as _match_ascii_compound_token
  participant SeparatorHandler as _write_visible_ascii_separator
  participant CharWriter as _write_char

  Caller->>CounterWrite: write(text)
  loop for each index in text
    CounterWrite->>TokenMatcher: try match ASCII compound token
    alt token matched
      TokenMatcher->>CounterWrite: consume token, classify chars
    else Han slash separator
      CounterWrite->>SeparatorHandler: detect Han/Han "/" separator
      SeparatorHandler->>CounterWrite: end unit, count separator as word
    else
      CounterWrite->>CharWriter: _write_char(char)
    end
  end
Loading
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR has Summary and verification, but it omits the required Changes section and Related Issues, and the Test Plan is not in the requested format. Add a Changes section, rewrite Test Plan using the template’s checkbox format, and include a Related Issues entry.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: updating doc word statistics for compound tokens.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/doc-word-stat-compound-token-counting

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.51%. Comparing base (a6797ac) to head (bfb2e7f).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1706   +/-   ##
=======================================
  Coverage   74.51%   74.51%           
=======================================
  Files         850      850           
  Lines       87070    87070           
=======================================
  Hits        64879    64879           
  Misses      17223    17223           
  Partials     4968     4968           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

🚀 PR Preview Install Guide

🧰 CLI update

npm i -g https://pkg.pr.new/larksuite/cli/@larksuite/cli@bfb2e7feb30d9776a49ec685becc52311f5c9ece

🧩 Skill update

npx skills add larksuite/cli#feat/doc-word-stat-compound-token-counting -y -g

@fangshuyu-768 fangshuyu-768 merged commit 3788405 into main Jul 2, 2026
38 checks passed
@fangshuyu-768 fangshuyu-768 deleted the feat/doc-word-stat-compound-token-counting branch July 2, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain/ccm PR touches the ccm domain size/M Single-domain feat or fix with limited business impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants