New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some cpp comment stripping parsing #119
Conversation
The fix for cross-references didn't work. I had to resort to using an explicit role for the cross-refs to actually get resolved. Otherwise they all just show as italics (using a I'm not sure what is going on here, but I don't like having to explicitly use a role name for simple cross-refs (especially when there aren't duplicates and no warnings). |
4c24ed0
to
be91a02
Compare
Turns out the issue is that the C++ apigen extension inserts I'm working on a fix for this and will send out a PR. Separately, though, the |
Thanks! How about adding some unit tests of the new comment parsing, and other syntax improvements? |
Regarding |
I'm working on this now... I think I'll use several tests: 1 for each supported Doxygen cmd and another to test various comment syntaxes. |
I added a couple unit tests. These can be refined as we go, but I just wanted to test the additions in this PR. BTW, I had to alter array.h (see ebfe2ba) because the subsequent line's indentation was getting misinterpreted within the The /** A short description. */
// This is not a docstring, but might get lumped into `raw_comment`'s value
auto func();
// This non-docstring comment will cause the regex check for comment prefix to
// skip the following docstr (when using `raw_comment`)
/*! This is a docstring. */
auto func(int); It would be easy to adjust the new |
tests/cpp_api_parser_test.py
Outdated
) | ||
|
||
output = api_parser.generate_output(config) | ||
doc_strings = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to test each of these individually, so that there is just a single entity and you can just exactly check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I hadn't done it this way, then I never would've guessed that the re.sub(brief|details, "", txt)
call needed the re.MULTILINE
flag.
Are you ok with testing each regex pattern per test, instead of testing each supported command per test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why this particular test structure helped spot that issue. But I just meant that it would be clearer to have the test just generate one entity at a time, so that each test case is more isolated, at least for these comment syntax tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the re.sub() call was only replacing the first \\brief
on the first line but not \\details
on the third line of the same comment... The multiline flag fixed that.
I could split up this test into per comment syntax form. Initially, I thought this comment was for the other test_function_fields()
test.
Looking at the regex patterns, I don't know why you're looking for some fields like checks
, dchecks
, schecks
(not really sure what they are supposed to represent). I get the pre
, post
, and invariant
commands because those actually exist for Doxygen. The error
(which isn't treated like an admonition) and requires
(which seems equivalent to @concept
) commands seem to be something you added (they don't exist for Doxygen).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, checks, error, and requires are "custom" commands I added for tensorstore, along with corresponding custom sphinx docfields that I left out. I agree they shouldn't be provided by default --- it would be good instead to have a mechanism for custom doxygen commands though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can look into implementing something similar to Doxygen's ALIASES
setting. That is how breathe supports raw RST in Doxygen's XML output. Theoretically it can be used to add custom commands like the ones you're adding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split up the comment stripping tests into a class
of tests. With using raw_comment
, the test for trailing docstring comment is passing. 🎉
I could split up the test about function fields, but I think that can be done after the _normalize_doc_comment()
function matures more (it is currently pretty limited).
for line in body | ||
] | ||
body[-1] = body[-1].rstrip("*/").rstrip() | ||
body = dedent("\n".join(body)).splitlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why join with "\n" and then split again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see there is a dedent in there. But I think we should only dedent multiline comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is designed to handle both single line comments and multi-lined comment blocks (provided by raw_comment
).
Stripping the leading whitespace from each line after the comment syntax is stripped away isn't great for code blocks and MD style blockquotes.
/**
* @code{.cpp}
* if (debugLevel) {
* printf("some output"); // the leading whitespace needs to be preserved here
* }
* @endcode
*/
/// This is a blockquote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think we should only dedent multiline comments.
I added a condition to skip dedenting single line comments, but the doc builds fail because the leading whitespace is now being interpreted as content to the ext's generated .. highlight:: cpp
directive.
<cpp_apigen_rst_prolog>:6: ERROR: Error in "highlight" directive: no content permitted. .. highlight:: cpp Returns the data order.
Kinda glad I did the rebase on main now (this would've gone unnoticed until much later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, A single line comment (with respect to raw_comment
) is a block that has 1 line
/** A single line comment */
/// Not a single line comment because
/// it has multiple lines in sequence
So, I'm not sure what the advantage is for not dedenting a single line comment. Maybe we could delegate this to regex with re.sub(r"^(\s{0, 3})\S", "", text)
(notice there's no multiline flag). Any other concerns I can think of all require multiple lines (eg. lists)
Thanks
Is the change to
Does doxygen allow non-doc comments to be interspersed with doc comments? I guess it could be reasonable to allow, not sure how common it would be.
It sounds like the current approach, of processing each comment token individually, works fine. What do you see as the advantage to using I suppose one issue with the current approach is how to handle stripping of a single leading space on single-line doc comments: ///Is this a doc comment?
/// And is this indented relative to it? |
It mostly because Doxygen does not allow the continuation of a class DummyClass
{
/**
* Some docstring.
* @param arg This is 1 sentence of a paragraph.
* This is the second sentence of the same paragraph.
* @note This is a new paragraph because Doxygen doesn't allow
* nesting certain commands that expect a single paragraph
* @par
* This is a second paragraph in the note because we explicitly use the `par` command
* @warning This is a new paragraph
* @parblock
* This is a second paragraph because we use the `parblock` command.
*
* This is a third paragraph in the warning.
* @endparblock
* This is a stray paragraph not in the warning.
*/
void fun(bool arg);
} The """/**
* Some docstring.
* @param arg This is 1 sentence of a paragraph.
* This is the second sentence of the same paragraph.
* @note This is a new paragraph because Doxygen doesn't allow
* nesting certain commands that expect a single paragraph
* @par
* This is a second paragraph in the note because we explicitly use the `par` command
* @warning This is a new paragraph
* @parblock
* This is a second paragraph because we use the `parblock` command.
*
* This is a third paragraph in the warning.
* @endparblock
* This is a stray paragraph not in the warning.
*/""" with the line endings obviously.
Doxygen does allow this, and I agree that it isn't all that common. I'm just trying to think of ways that might break our approach. To compensate for this wouldn't be hard (using regex). It just something on the radar, not a really high priority.
This is why I join the stripped comment, dedent it, and split it back up again. Generally, Doxygen doesn't have to worry if the whitespace after the comment syntax isn't uniform because Markdown indents are at least 3 spaces or more. I should note that Doxygen does mandate that comments using the ///
///This is a doc comment
/// This is another doc comment within the same paragraph.
///
We need to improve the traversal better. Currently, it only returns a comment token if the comment is a single line. It is much more common to use struct dummy
{
int var; ///< This is the docstring.
} However, I just think its over complicated to ignore clang's functionality and parse the source in python (even though clang has already done it for us). |
I was attempting to support both rST syntax and a subset of doxygen syntax at the same time. That also would make it relatively easy to migrate incrementally. Do you think that is still feasible, or do we have to have a configuration option to interpret as either doxygen syntax or rST syntax, which would make it significantly more difficult to migrate? Maybe we can just convert the continuation lines after these doxygen commands appropriately.
Okay, I guess in any case when using
I would prefer to relax this blank line restriction --- unless it serves some important purpose.
If |
We can provide a config option; that was my original intention. However, people using breathe may already have their custom doxygen aliases setup to inject actual RST in their doc strings. This often looks like @rst
.. note::
This is an admonition.
This is a second paragraph in the note.
@endrst Where |
There's a huge difference between Doxygen syntax (which is mostly markdown) and RST syntax. I don't think expecting to parse a mix of the 2 is sustainable. |
What would the migration path be, then? My own opinion is that rST syntax is better than doxygen syntax, and I imagine that may naturally be a common opinion among those using Sphinx for documentation --- when using doxygen rST syntax is not an option, but here it is. But if it has to be toggled via a global option, then it would be difficult for users to gradually make use of more rST functionality. Instead they would have to convert all of their comments at once. I suppose the user could have two entries in Alternatively we could have a separate regular expression to indicate which symbols are documented using doxygen syntax. |
If they're migrating from Doxygen, then there will inevitably need to be changes made to the comments. If they're migrating from breathe, then it should be less painless because they're likely already writing their comments in rst (albeit encapsulated in Am I understanding your use of "migrating" correctly? I agree in that RST is the superior documenting syntax. I often look at motivations for using mkdocs or MyST parser as seriously flawed opinions. My grievances against Doxygen are endless... |
Yes, perhaps we should support them out of the box for compatibility with breathe. Does breathe somehow define these automatically, or is it up to the user to define these as aliases in their doxygen config, and these are just the recommended names? If they are just the recommended names, perhaps we should just allow the user to specify them like any other alias.
I'm thinking of the case where someone has a large existing codebase that they have been using doxygen or doxygen+breathe on, and they wish to switch to this extension, and also as they write new code avoid having to litter every doc comment with If there is just a global option, then they would have to convert all of their existing doc comments all at once. |
Are you going to change this to use |
Breathe recommends these names but does not automatically define them. Some people use exhale which automatically define these commands (based on the breathe recommendations). FYI, exhale is just a wrapper for breathe that aims to be more automated and a quicker onboarding of breathe into projects; it also includes some patches for breathe...
yes, I've already got a stash for this. Currently, it will default to the line-by-line search if |
The indentation removed here (on a subsequent line) is interpretted as a blockquote while still considered within the :param: field. This change also satisfies what the the Doxygen parser expects; meaning unexpected indentation is prohibited amidst single a multi-lined paragraph.
722bdad
to
81ffc73
Compare
adjust tests about leading whitespace for single line comments
7ca7c5c
to
808efa8
Compare
I think this currently doesn't handle the case where you have multiple consecutive comments of different types, e.g. a |
Correct. I think that's a rare scenario but I could do some extra handling in |
- reverted xrefs in Config docs (now that roles are fixed) - removed old algorithm for parsing docstring line-by-line - added regex to remove non-docstring comments from a raw_comment - revised docstring parsing test with pytest.mark.paramtrize - added 2 tests to check for proper removal of non-docstring comments
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
also added IDs to the growing parametrized test_comment_styles()
Thanks! This looks ready to merge! |
Sorry about the messy git history, I was kinda hoping this would get squashed. Also sorry for the delay, that other project that needed my attention has turned into a "whack-a-mole" carnival game. Ironically, it also uses the clang project... I'm going to need a vacation from clang when we get this feature finished. 😂 |
* fix cross-refs in docs * enable cpp.apigen verbosity and show number of found decls * adjust clang imports for quicker edits * fix windows path separator compensation * support `\` and `@` as cmd prefixes - support for param direction - support for retval cmd - add new strip_comment() to strip all forms of C++ comment syntax from comment tokens' text - use new strip_comment() instead of text.lstrip() This will also preserve consistent indentation that is needed for code-blocks (which isn't supported yet) - modify index_interva.h to test some of the new features * use explicit role for cross-refs * admonition importance of Linux path separators * change admonished text (per request) * change erroneous admonition text * allow blank docstr lines to get normalized - add multiline flag for re.sub(brief/details) call - ran black on api_parser.py * add some unit tests * add a blank line to array.h docstr * change array.h until we support multline comments The indentation removed here (on a subsequent line) is interpretted as a blockquote while still considered within the :param: field. This change also satisfies what the the Doxygen parser expects; meaning unexpected indentation is prohibited amidst single a multi-lined paragraph. * requested changes * try to get raw_comment first * update tests about comment stripping * only dedent multiline comments during stripping adjust tests about leading whitespace for single line comments * [no ci] remove outdated conditional statement * latest review requests - reverted xrefs in Config docs (now that roles are fixed) - removed old algorithm for parsing docstring line-by-line - added regex to remove non-docstring comments from a raw_comment - revised docstring parsing test with pytest.mark.paramtrize - added 2 tests to check for proper removal of non-docstring comments * return None when no raw_comment exists * replace non-doc comments with blank lines also added IDs to the growing parametrized test_comment_styles() * satisfy review request about demo src (`\returns`)
* fix cross-refs in docs * enable cpp.apigen verbosity and show number of found decls * adjust clang imports for quicker edits * fix windows path separator compensation * support `\` and `@` as cmd prefixes - support for param direction - support for retval cmd - add new strip_comment() to strip all forms of C++ comment syntax from comment tokens' text - use new strip_comment() instead of text.lstrip() This will also preserve consistent indentation that is needed for code-blocks (which isn't supported yet) - modify index_interva.h to test some of the new features * use explicit role for cross-refs * admonition importance of Linux path separators * change admonished text (per request) * change erroneous admonition text * allow blank docstr lines to get normalized - add multiline flag for re.sub(brief/details) call - ran black on api_parser.py * add some unit tests * add a blank line to array.h docstr * change array.h until we support multline comments The indentation removed here (on a subsequent line) is interpretted as a blockquote while still considered within the :param: field. This change also satisfies what the the Doxygen parser expects; meaning unexpected indentation is prohibited amidst single a multi-lined paragraph. * requested changes * try to get raw_comment first * update tests about comment stripping * only dedent multiline comments during stripping adjust tests about leading whitespace for single line comments * [no ci] remove outdated conditional statement * latest review requests - reverted xrefs in Config docs (now that roles are fixed) - removed old algorithm for parsing docstring line-by-line - added regex to remove non-docstring comments from a raw_comment - revised docstring parsing test with pytest.mark.paramtrize - added 2 tests to check for proper removal of non-docstring comments * return None when no raw_comment exists * replace non-doc comments with blank lines also added IDs to the growing parametrized test_comment_styles() * satisfy review request about demo src (`\returns`)
* Add default_literal_role * Add C++ apigen extension * Add default-literal-role and highlight-{push,pop} directives Also adds facilities for saving/restoring default role state. * Add {python_apigen,cpp_apigen,json_schema}_rst_{prolog,epilog} config options This also ensures that the default roles and highlight language are restored after generated JSON/C++/Python object descriptions are inserted. * Apply suggestions from code review Co-authored-by: Brendan <2bndy5@gmail.com> * Some cpp comment stripping parsing (#119) * fix cross-refs in docs * enable cpp.apigen verbosity and show number of found decls * adjust clang imports for quicker edits * fix windows path separator compensation * support `\` and `@` as cmd prefixes - support for param direction - support for retval cmd - add new strip_comment() to strip all forms of C++ comment syntax from comment tokens' text - use new strip_comment() instead of text.lstrip() This will also preserve consistent indentation that is needed for code-blocks (which isn't supported yet) - modify index_interva.h to test some of the new features * use explicit role for cross-refs * admonition importance of Linux path separators * change admonished text (per request) * change erroneous admonition text * allow blank docstr lines to get normalized - add multiline flag for re.sub(brief/details) call - ran black on api_parser.py * add some unit tests * add a blank line to array.h docstr * change array.h until we support multline comments The indentation removed here (on a subsequent line) is interpretted as a blockquote while still considered within the :param: field. This change also satisfies what the the Doxygen parser expects; meaning unexpected indentation is prohibited amidst single a multi-lined paragraph. * requested changes * try to get raw_comment first * update tests about comment stripping * only dedent multiline comments during stripping adjust tests about leading whitespace for single line comments * [no ci] remove outdated conditional statement * latest review requests - reverted xrefs in Config docs (now that roles are fixed) - removed old algorithm for parsing docstring line-by-line - added regex to remove non-docstring comments from a raw_comment - revised docstring parsing test with pytest.mark.paramtrize - added 2 tests to check for proper removal of non-docstring comments * return None when no raw_comment exists * replace non-doc comments with blank lines also added IDs to the growing parametrized test_comment_styles() * satisfy review request about demo src (`\returns`) * Add warning that C++ apigen is experimental * Add missing types-clang dependency * Add pydantic.mypy plugin * Change "libclang" extra -> "cpp" Co-authored-by: Brendan <2bndy5@gmail.com>
Includes
\
and@
prefixed Doxygen commands@retval
commandstrip_comment()
to accommodate for all forms of C++ comment syntaxI also enabled verbosity in the cpp.apigen demo and added a line that shows the number of declarations that will be parsed.