Doccursor #217

BrunoMSantos · 2023-11-16T19:04:27Z

Based on an old attempt by me and on #206.

I think this is worthwhile merging already, minus any review concern, so I'm submitting it as such. Given the amount of changes I wouldn't be surprised if it needs some iterations, but I'm quite happy with it already.

I did not go as far as the original in simplifying docstring. But that is independent work which shouldn't block this I think.

Anyway, I took a couple of early commits from #206 with maybe some minor changes. The rest kind of took a different approach, but with plenty of inspiration drawn from the original, which forced me to find ways around issues I wouldn't have tackled otherwise. So thanks for that @jnikula :)

-->8--

On to real concerns:

One possible concern is with 88bc569 and f46d74f, which are simple, but a bit chunkier than I'd like. I can look into breaking those down if need be. When I bundled those changes together, they were simpler, but then tiny little things had to be bolted on as I revisited earlier commits...

The biggest architectural difference is with the single cursor object, which I named DocCursor (goes with Docstring I guess).

Originally, I didn't consider merging both ideas and had already invested too much on the current implementation, but I did realize I could have something like this to better group related code:

class _Base:
   ...
class _Struct(_Base):
   ...
...

# This would be the only type in the API.
class DocString:
    def __init__(self, cc, ...):
        if cc.kind == STRUCT:
            self._backend = _Struct(...)
        ...

I tried going that way, but it would take too much work rebasing though, so I'd rather attempt that later instead of rewriting everything.

One thing I'm quite certain though: I love how the parser turned out with the single cursor type. It did require a tiny trick with the _COMMENTS variable and an init function, but I think it's well worth it. I also had to address the domain change on extern "C" {} differently to make it look nice. I'm quite stoked with the end result though. Those bits of magic are all in e9c1eee.

Another difference is in how I handled the multiple calls to certain big methods. The original initialized things on construction, I implemented a caching mechanism which I find pretty elegant and should do the job even better on account of being lazier :)

Finally, another thing I'm unsure about is returning None from cursor properties that don't apply to a given cursor or throw a meaningful message. Not a big deal to let it go for now I think.

At the risk of duplication, add more dedicated _get_header_line() to make it easier to understand what is needed where.

We want to wrap these in an object, but we'll try to do so incrementally. By moving this code into our own `doccursor` module, it makes sense that we import `CursorKind` and `TokenKind` from it instead of from Clang's library despite them being the same by design.

jnikula · 2023-11-16T20:28:27Z

Okay, I'm not very far in reviewing, but I'm super uneasy about DocCursor._COMMENTS. How does that work if you have multiple documents being parsed in parallel?

BrunoMSantos · 2023-11-16T20:34:55Z

Okay, I'm not very far in reviewing, but I'm super uneasy about DocCursor._COMMENTS. How does that work if you have multiple documents being parsed in parallel?

That may well be one of those things that didn't cross my sleep deprived mind 😅

Can be fixed by passing it to the instance's scope though, all the copies should still point to the same dict, no waste either way I believe. I literally did it that way because I was thinking it made it clearer there was only one.

jnikula · 2023-11-16T20:39:22Z

Can be fixed by passing it to the instance's scope though, all the copies should still point to the same dict, no waste either way I believe. I literally did it that way because I was thinking it made it clearer there was only one.

Yeah, I looked through the code, and I think it's a very small change to pass the dict around.

jnikula · 2023-11-16T21:17:14Z

I haven't looked into all the details yet, but from a high level, compared to my approach:

This is a nicer interface to work with in parser.py.
It's about the same from docstring.py.
I like the properties and the way all the fixups are methods.
I very much like how e.g. the get_tokens() fixup is much more natural, and prevents accidental use of the clang cursor's method.
I kind of miss the clarity of having multiple classes for cursors. Having to check for the cursor type all over the place instead of overloading is a bit meh. Maybe there could be a factory method to create a doccursor, hiding the multiple cursors from parser.py. Anyway, this can clearly be a future improvement too.

All in all, I like this better than my approach (and with the future addition of multiple classes, much better) and obviously much better than what we currently have.

Please fix the DocCursor._COMMENTS thing when you have a moment. I'll continue with detailed review over the weekend, independent of that.

Nice work! 👍

jnikula · 2023-11-16T21:23:21Z

Oh, when I first saw the caching I immediately thought "Premature optimization is the root of all evil" 😁

As a 1st step, we make a _very_ shallow wrapper around the Clang cursor. There is of course no point to it yet, but with this simple trick we are now using `DocCursor`s everywhere past `parse()`. The only effect of this commit is to worsen performance very slightly due to needless invocation of path specific code on all cursors. Running the full test suit is now ~9% slower on my machine. We shall revisit this later when it actually matters.

Another baby step, which is again mostly cosmetic. In doing this, we can remove plenty of the attributes we defined just to get cursor compatibility. We may want to expose some of this again later, but likely through a better API. For now we simplify the code as much as possible. After we should start thinking of how this will play out with docstring and improving these methods / attributes to provide a nice API for the new cursor.

We want to start improving the API in ways that will no longer mimic the Clang's cursor one. We need to prevent (e.g.) our cursor `type` property turns out to mean something different than Clang's own namesake (spoiler: it will!). There might be a few cases where the result might be the same, but that will not be obvious most of the times and it may still pose a performance issue anyway. As a rule, we stick to using the Clang cursor everywhere we can.

We want to do better than this later on, but for now we need to simplify the parser so that we focus instead on API alone with docstring as the biggest consumer. In order to do that the easy way, we will start by providing an API that reflects the current docstring API, then pass the cursor to docstring directly simplifying its own API and finally isolating the parser from docstring entirely. The idea is _not_ to mimic Clang's cursor any more! Any similarity in attributes and property names is coincidental from now on. By this point we've also made up the lost performance from the stub commit!

BrunoMSantos · 2023-11-16T23:31:42Z

Thanks for the quick review. Pushed some fixes, including the _COMMENTS thing.

Oh, when I first saw the caching I immediately thought "Premature optimization is the root of all evil" 😁

Well, the thought did cross my mind. That's why I put more effort in the commit message trying to find out exactly why we should care than actually implementing it xD

That said, you did much the same thing in your branch if I read correctly. Though instead of lazy caching you just called the methods ahead of time on __init__. I like this approach and I think it's well justified and that it will pay off. But we can drop it for now if you don't like it.

src/hawkmoth/doccursor.py

jnikula · 2023-11-17T23:48:25Z

I read through this commit by commit, and didn't spot any issues. I'm getting pretty tired though, so I'll get back to it still with fresh eyes.

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache. I'm not insisting on switching to a decorator now but I'm also undecided whether to merge or drop the caching at this point. I'll sleep on it.

BrunoMSantos · 2023-11-18T00:24:00Z

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache.

That looks pretty cool! I'm tired myself, but I'll look into it. It would simplify the code for sure, even if it's not that complicated to begin with.

It should also have the added benefit of not being tied to a doccursor instance if I understand it right. So, as long as it's the same base cursor (same hash), the cache would still work. E.g. iterating over the children multiple times currently creates new doccursors with an empty cache every time, but this should still work as the cursors would be identical every time.

jnikula · 2023-11-18T10:27:46Z

This on top fixes #201. I can send this separately, along with a test, but FYI.

diff --git a/src/hawkmoth/doccursor.py b/src/hawkmoth/doccursor.py
index 77e54058226c..109abe4334f0 100644
--- a/src/hawkmoth/doccursor.py
+++ b/src/hawkmoth/doccursor.py
@@ -70,7 +70,7 @@ class DocCursor:
 
     @property
     def name(self):
-        return self._cc.spelling
+        return self._cc.spelling if self._cc.spelling else self.decl_name
 
     @property
     def decl_name(self):

Edit: It occurs to me this could lead to infinite recursion, but I'm not sure yet if it's just theoretical.

jnikula · 2023-11-18T10:31:50Z

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache.

That looks pretty cool! I'm tired myself, but I'll look into it. It would simplify the code for sure, even if it's not that complicated to begin with.

It should also have the added benefit of not being tied to a doccursor instance if I understand it right. So, as long as it's the same base cursor (same hash), the cache would still work. E.g. iterating over the children multiple times currently creates new doccursors with an empty cache every time, but this should still work as the cursors would be identical every time.

Looks like there are pitfalls in using this on methods, need to investigate: https://www.youtube.com/watch?v=sVjtp6tGo0g

I'm thinking let's leave that for future improvement.

BrunoMSantos · 2023-11-18T10:46:36Z

This on top fixes #201. I can send this separately, along with a test, but FYI.
diff --git a/src/hawkmoth/doccursor.py b/src/hawkmoth/doccursor.py
index 77e54058226c..109abe4334f0 100644
--- a/src/hawkmoth/doccursor.py
+++ b/src/hawkmoth/doccursor.py
@@ -70,7 +70,7 @@ class DocCursor:
 
     @property
     def name(self):
-        return self._cc.spelling
+        return self._cc.spelling if self._cc.spelling else self.decl_name
 
     @property
     def decl_name(self):
Edit: It occurs to me this could lead to infinite recursion, but I'm not sure yet if it's just theoretical.

This doesn't need to be part of this PR, but I can add it if you prefer it that way.
If not, and if you don't mind, leave a comment explaining the possibility of infinite recursion in the issue if you don't mind. I'm not really seeing it right now, but I also haven't thought about it.

jnikula · 2023-11-18T10:52:54Z

This doesn't need to be part of this PR, but I can add it if you prefer it that way. If not, and if you don't mind, leave a comment explaining the possibility of infinite recursion in the issue if you don't mind. I'm not really seeing it right now, but I also haven't thought about it.

I can handle it afterwards. It was just FYI, in case you see any problems with it.

BrunoMSantos · 2023-11-18T10:53:53Z

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache.

That looks pretty cool! I'm tired myself, but I'll look into it. It would simplify the code for sure, even if it's not that complicated to begin with.
It should also have the added benefit of not being tied to a doccursor instance if I understand it right. So, as long as it's the same base cursor (same hash), the cache would still work. E.g. iterating over the children multiple times currently creates new doccursors with an empty cache every time, but this should still work as the cursors would be identical every time.

Looks like there are pitfalls in using this on methods, need to investigate: https://www.youtube.com/watch?v=sVjtp6tGo0g

I'm thinking let's leave that for future improvement.

Had a quick look. The pitfall is a bit unfortunate, but thinking about it with a fresh mind it's obvious it had to be that way and not the way I was positing it would work. Still, he volunteers a solution that may still be worthwhile versus the current implementation.

Let's leave that commit out for now then since it's not an issue yet, not even potentially (cursor not yet exposed to events and we know exactly how many times we call these things). I'll take the opportunity to think if there is a way to get the global cache working the way we would want, allowing multiple passes over cursors. Maybe we can recover the cursor instance from the cache instead of generating a new one on get_children, and then the pitfall is the feature. I actually have a plan forming in my head as I write.

BrunoMSantos · 2023-11-18T10:56:31Z

This doesn't need to be part of this PR, but I can add it if you prefer it that way. If not, and if you don't mind, leave a comment explaining the possibility of infinite recursion in the issue if you don't mind. I'm not really seeing it right now, but I also haven't thought about it.

I can handle it afterwards. It was just FYI, in case you see any problems with it.

I didn't, until you mentioned infinite recursions xD

That said, this all should be reworked as docstring can now be simplified and it's possible that name and decl_name don't make sense as an API then. Honestly I don't like those, they were hacks to weave docstring into the parser we had. Not that we shouldn't fix that in the current implementation in the meantime.

BrunoMSantos · 2023-11-18T10:58:03Z

By the way, I had never looked into how decorators worked and with that short video it all makes sense now, so thanks for that. It's a useful thing to know, I wish I had looked into it sooner ;)

jnikula · 2023-11-18T13:39:58Z

That said, this all should be reworked as docstring can now be simplified and it's possible that name and decl_name don't make sense as an API then. Honestly I don't like those, they were hacks to weave docstring into the parser we had. Not that we shouldn't fix that in the current implementation in the meantime.

Oh, I completely agree. The whole decl_name thing was always a hack, or a workaround at best. I'm hoping we could nuke it in the future.

jnikula · 2023-11-18T15:29:41Z

@BrunoMSantos Okay. I'm prepared to merge this as-is, or with doccursor: cache calls to certain internal methods removed. Your call.

BrunoMSantos · 2023-11-18T15:33:55Z

Great! But leave the cache commit out then. I'll have a better look at it and push another PR later, no need to rush it.

jnikula · 2023-11-18T15:41:17Z

Great! But leave the cache commit out then. I'll have a better look at it and push another PR later, no need to rush it.

Umm, would you mind doing that and force pushing the branch, please? I mean I can of course do it myself and push a separate branch and merge a separate pull request, but I'd rather it was your pull request that I merge.

Finally, we get to make the parser completely oblivious to the rendering layer! Hopefully docstring and doccursor can now evolve to something better. Closes jnikula#180. Though a lot of the potential is still to be realized.

BrunoMSantos · 2023-11-18T15:45:59Z

I forget this is not like email workflow :P

jnikula · 2023-11-19T10:11:15Z

@BrunoMSantos One thing I noticed that bugs me a little. TextDocstring is initialized with text and meta parameters directly, instead of creating DocCursor for them too.

I think DocCursor should deal with cases where either the clang cursor or the comment is None. The former would be for generic comment sections ("top level coments") and the latter to support :undoc-members: in the future. https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-option-automodule-undoc-members

BrunoMSantos · 2023-11-19T11:52:55Z

@BrunoMSantos One thing I noticed that bugs me a little. TextDocstring is initialized with text and meta parameters directly, instead of creating DocCursor for them too.

I think DocCursor should deal with cases where either the clang cursor or the comment is None. The former would be for generic comment sections ("top level coments") and the latter to support :undoc-members: in the future. https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-option-automodule-undoc-members

Ah yes, I intended to mention it in the PR for us to discuss it, but apparently forgot. I'm still expecting to address it, but I felt what I had was enough of an improvement already and I'm not sure I want DocCursor to represent things that are not cursors. I think I'd rather have a specialized Docstring class for those and call it in the parser appropriately. Actually, I forgot to remove the default value for the cursor, as I did not intend for it and the code will throw if it's attempted as we dereference the cursor in the constructor right away.

A cursor having no comment is already fine I think. Both by comments being None or by having no entry for the cursor hash. What's left is for the parser to iterate over those as well.

jnikula · 2023-11-19T12:05:02Z

I'm not sure I want DocCursor to represent things that are not cursors.

I guess it needs special casing somewhere, and the debate is where to have it.

Perhaps the answer depends on what's easiest for implementing #166.

A cursor having no comment is already fine I think. Both by comments being None or by having no entry for the cursor hash. What's left is for the parser to iterate over those as well.

I've got a branch implementing some of this, but it's stale now and needs rebasing. It also stalled, because currently we only recurse for things that have a comment, but recursing for everything opens up the struct vs. member can of worms again for things like:

struct foo {
    struct bar {
        int baz;
    } m;
}

We get the cursor for both struct bar and m and it gets a bit nasty. I'm considering implementing an intermediate version of :undoc-members: for enums only, to let you include the undocumented enumerators.

Anyway, getting a bit off-topic here...

jnikula added 2 commits November 16, 2023 18:29

docstring: add more dedicated _get_header_lines() methods

1984b99

At the risk of duplication, add more dedicated _get_header_line() to make it easier to understand what is needed where.

BrunoMSantos requested a review from jnikula November 16, 2023 19:04

jnikula mentioned this pull request Nov 16, 2023

draft: add "commented cursor" classes and use them #206

Closed

BrunoMSantos force-pushed the doccursor branch from 462fe31 to 055100f Compare November 16, 2023 23:28

BrunoMSantos added 4 commits November 16, 2023 23:29

BrunoMSantos force-pushed the doccursor branch from 055100f to 5aee698 Compare November 16, 2023 23:30

BrunoMSantos commented Nov 16, 2023

View reviewed changes

src/hawkmoth/doccursor.py Outdated Show resolved Hide resolved

BrunoMSantos force-pushed the doccursor branch 2 times, most recently from 7e44f65 to 59cc22b Compare November 17, 2023 00:18

BrunoMSantos force-pushed the doccursor branch from 59cc22b to 2ad9dcd Compare November 18, 2023 15:43

docstring: use cursor as input

0cd1f59

Finally, we get to make the parser completely oblivious to the rendering layer! Hopefully docstring and doccursor can now evolve to something better. Closes jnikula#180. Though a lot of the potential is still to be realized.

BrunoMSantos force-pushed the doccursor branch from 2ad9dcd to 0cd1f59 Compare November 18, 2023 15:44

jnikula approved these changes Nov 18, 2023

View reviewed changes

jnikula merged commit 14f47ee into jnikula:master Nov 18, 2023
5 checks passed

BrunoMSantos deleted the doccursor branch November 18, 2023 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doccursor #217

Doccursor #217

BrunoMSantos commented Nov 16, 2023 •

edited

jnikula commented Nov 16, 2023

BrunoMSantos commented Nov 16, 2023

jnikula commented Nov 16, 2023

jnikula commented Nov 16, 2023

jnikula commented Nov 16, 2023

BrunoMSantos commented Nov 16, 2023

jnikula commented Nov 17, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023 •

edited

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 19, 2023

BrunoMSantos commented Nov 19, 2023

jnikula commented Nov 19, 2023

Doccursor #217

Doccursor #217

Conversation

BrunoMSantos commented Nov 16, 2023 • edited

jnikula commented Nov 16, 2023

BrunoMSantos commented Nov 16, 2023

jnikula commented Nov 16, 2023

jnikula commented Nov 16, 2023

jnikula commented Nov 16, 2023

BrunoMSantos commented Nov 16, 2023

jnikula commented Nov 17, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023 • edited

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 18, 2023

BrunoMSantos commented Nov 18, 2023

jnikula commented Nov 19, 2023

BrunoMSantos commented Nov 19, 2023

jnikula commented Nov 19, 2023

BrunoMSantos commented Nov 16, 2023 •

edited

jnikula commented Nov 18, 2023 •

edited