Skip to content

Conversation

@jeanchristophe13v
Copy link
Contributor

@jeanchristophe13v jeanchristophe13v commented Nov 17, 2025

Issue & Why it matters

When retrieving help pages with btw_tool_docs_help_page(), the Arguments section contains verbose HTML table markup from tools::Rd2HTML(). This bloats token usage unnecessarily. For example, purrr::map help page has the Arguments section consume ~630 tokens, with lots of presentational markup like <table role="presentation">, <colgroup>, <tr>, <td>, etc.

To implement the concept of context engineering, we need to solve this problem to optimize token usage more effectively. Although this change is minor and the tokens saved may be negligible, it still embodies the essence of context engineering and makes the help doc more readable for both humans and AI.

Solution

Added a helper function simplify_help_tables() that extracts semantic content from argument tables and converts them to simple paragraph format before pandoc conversion. It:

  • Parses HTML with xml2::read_html()
  • Finds argument tables (table[role="presentation"])
  • Extracts parameter names and descriptions
  • Converts to: `param`: description

Result: ~27-30% token reduction in Arguments sections, while other sections (Description, Usage, Value, Examples) remain unchanged.

Testing

Tested with 15 functions from popular packages (ggplot2, dplyr, tidyr, purrr, readr, base, stats, utils). The temporary test scripts was included in inst/examples/demo_token_savings.R. After running it to directly see the difference, just delete it :)

Results:

  • Average reduction: 27%
  • Total tokens saved: ~5,100 across test cases
  • Best cases: purrr::map (46%), dplyr::mutate (42%)

Before:

#### Arguments

<table role="presentation">
<tr>
<td><code id=".x">.x</code></td>
<td><p>A list or atomic vector.</p></td>
</tr>
...

After:

#### Arguments

`.x`: A list or atomic vector.

`.f`: A function, specified in one of the following ways: ...

@jeanchristophe13v jeanchristophe13v force-pushed the main branch 3 times, most recently from 683a5e4 to fd563e6 Compare November 17, 2025 04:41
Copy link
Collaborator

@gadenbuie gadenbuie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeanchristophe13v, this is definitely a good idea and the extra table formatting elements is an unfortunate accident!

As you can see from the snapshot that changed, the intention was to have arguments presented as markdown tables, but the problem is that when arguments have descriptions that include more than one paragraph, pandoc can't convert the table into a simple markdown table and instead uses raw HTML.

My preference would be to re-format the arguments table to use headings for each argument, as that will naturally support argument descriptions regardless of content.

The final result should look something like this:

#### Arguments

##### `.x`

A list or atomic vector.

##### `.f`

A function, specified in one of the following ways: ...

(Note that btw_tool_docs_help_page() shifts heading levels up by one, so these are h3 and h4 headings in the HTML source.)

Also, I'd prefer that simplify_help_tables be named something like simplify_help_page_arguments() and live in R/tool-docs.R. In that vein, we should also be careful that we're only finding the arguments table in the arguments sub-section and we shouldn't modify any other tables.

Copy link
Contributor Author

@jeanchristophe13v jeanchristophe13v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I tested my original flat format with purrr::map's .f argument and found it was merging all the bullet points into one paragraph. I read some papers and confirmed presenting options as bullet points generally outperforms using plain descriptions[1].

I compared three approaches:

  1. Current flat format (~339 tokens) - but loses list structure
  2. Improved flat format (~341 tokens) - preserves paragraphs but still flattens lists
  3. Your heading format (~415 tokens) - preserves everything

The heading format uses about 22% more tokens, but I think it's worth it to keep those bullet points intact.

Changes made:

  • Renamed simplify_help_tables()simplify_help_page_arguments() and moved to R/tool-docs.R context
  • Now only targets the Arguments section (uses //h3[normalize-space(text())='Arguments'] since R help HTML doesn't set id attributes)
  • Uses <h3> tags that become #### after the shift
  • Preserves full HTML structure including lists, multiple paragraphs, code blocks, etc.

All tests pass now. Let me know if there's anything else that needs adjusting.

[1] Han, Y., Wu, Y., & Willard, J. (2025). Effect of Selection Format on LLM Performance. arXiv preprint. https://doi.org/10.48550/arXiv.2503.06926

Copy link
Collaborator

@gadenbuie gadenbuie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @jeanchristophe13v!

Two notes for future reference:

  1. I appreciate the pull requests! But it's still useful to start with an issue so we can talk through the approach. I don't mind that you started with a PR here, as long as you're okay with me potentially asking for some larger changes or recommending an entirely different approach. Just something to keep in mind.

  2. It's best not to create pull requests from the main branch of your fork. usethis has some excellent helper functions for managing pull requests, and I highly recommend the usethis workflow. Using feature branches keeps your main branch clean and makes it easier to stay up to date. Especially if the PR is squash-merged (squashed into a single commit when merged).

These are both minor things; I appreciate your contributions!

@gadenbuie gadenbuie merged commit bea5b96 into posit-dev:main Nov 17, 2025
7 of 11 checks passed
@jeanchristophe13v
Copy link
Contributor Author

@gadenbuie I really appreciate your guidance on best practices. I’ll take time to learn more about contributing to R open-source projects and proper GitHub workflows to avoid these issues in the future.

Thanks for your contributions and for being so understanding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants