Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doccursor #217

Merged
merged 7 commits into from
Nov 18, 2023
Merged

Doccursor #217

merged 7 commits into from
Nov 18, 2023

Conversation

BrunoMSantos
Copy link
Collaborator

@BrunoMSantos BrunoMSantos commented Nov 16, 2023

Based on an old attempt by me and on #206.

I think this is worthwhile merging already, minus any review concern, so I'm submitting it as such. Given the amount of changes I wouldn't be surprised if it needs some iterations, but I'm quite happy with it already.

I did not go as far as the original in simplifying docstring. But that is independent work which shouldn't block this I think.

Anyway, I took a couple of early commits from #206 with maybe some minor changes. The rest kind of took a different approach, but with plenty of inspiration drawn from the original, which forced me to find ways around issues I wouldn't have tackled otherwise. So thanks for that @jnikula :)

-->8--

On to real concerns:

One possible concern is with 88bc569 and f46d74f, which are simple, but a bit chunkier than I'd like. I can look into breaking those down if need be. When I bundled those changes together, they were simpler, but then tiny little things had to be bolted on as I revisited earlier commits...

The biggest architectural difference is with the single cursor object, which I named DocCursor (goes with Docstring I guess).

Originally, I didn't consider merging both ideas and had already invested too much on the current implementation, but I did realize I could have something like this to better group related code:

class _Base:
   ...
class _Struct(_Base):
   ...
...

# This would be the only type in the API.
class DocString:
    def __init__(self, cc, ...):
        if cc.kind == STRUCT:
            self._backend = _Struct(...)
        ...

I tried going that way, but it would take too much work rebasing though, so I'd rather attempt that later instead of rewriting everything.

One thing I'm quite certain though: I love how the parser turned out with the single cursor type. It did require a tiny trick with the _COMMENTS variable and an init function, but I think it's well worth it. I also had to address the domain change on extern "C" {} differently to make it look nice. I'm quite stoked with the end result though. Those bits of magic are all in e9c1eee.

Another difference is in how I handled the multiple calls to certain big methods. The original initialized things on construction, I implemented a caching mechanism which I find pretty elegant and should do the job even better on account of being lazier :)

Finally, another thing I'm unsure about is returning None from cursor properties that don't apply to a given cursor or throw a meaningful message. Not a big deal to let it go for now I think.

At the risk of duplication, add more dedicated _get_header_line() to
make it easier to understand what is needed where.
We want to wrap these in an object, but we'll try to do so
incrementally.

By moving this code into our own `doccursor` module, it makes sense that
we import `CursorKind` and `TokenKind` from it instead of from Clang's
library despite them being the same by design.
@jnikula
Copy link
Owner

jnikula commented Nov 16, 2023

Okay, I'm not very far in reviewing, but I'm super uneasy about DocCursor._COMMENTS. How does that work if you have multiple documents being parsed in parallel?

@BrunoMSantos
Copy link
Collaborator Author

Okay, I'm not very far in reviewing, but I'm super uneasy about DocCursor._COMMENTS. How does that work if you have multiple documents being parsed in parallel?

That may well be one of those things that didn't cross my sleep deprived mind 😅

Can be fixed by passing it to the instance's scope though, all the copies should still point to the same dict, no waste either way I believe. I literally did it that way because I was thinking it made it clearer there was only one.

@jnikula
Copy link
Owner

jnikula commented Nov 16, 2023

Can be fixed by passing it to the instance's scope though, all the copies should still point to the same dict, no waste either way I believe. I literally did it that way because I was thinking it made it clearer there was only one.

Yeah, I looked through the code, and I think it's a very small change to pass the dict around.

@jnikula
Copy link
Owner

jnikula commented Nov 16, 2023

I haven't looked into all the details yet, but from a high level, compared to my approach:

  • This is a nicer interface to work with in parser.py.
  • It's about the same from docstring.py.
  • I like the properties and the way all the fixups are methods.
  • I very much like how e.g. the get_tokens() fixup is much more natural, and prevents accidental use of the clang cursor's method.
  • I kind of miss the clarity of having multiple classes for cursors. Having to check for the cursor type all over the place instead of overloading is a bit meh. Maybe there could be a factory method to create a doccursor, hiding the multiple cursors from parser.py. Anyway, this can clearly be a future improvement too.

All in all, I like this better than my approach (and with the future addition of multiple classes, much better) and obviously much better than what we currently have.

Please fix the DocCursor._COMMENTS thing when you have a moment. I'll continue with detailed review over the weekend, independent of that.

Nice work! 👍

@jnikula
Copy link
Owner

jnikula commented Nov 16, 2023

Oh, when I first saw the caching I immediately thought "Premature optimization is the root of all evil" 😁

As a 1st step, we make a _very_ shallow wrapper around the Clang cursor.
There is of course no point to it yet, but with this simple trick we are
now using `DocCursor`s everywhere past `parse()`.

The only effect of this commit is to worsen performance very slightly
due to needless invocation of path specific code on all cursors. Running
the full test suit is now ~9% slower on my machine. We shall revisit
this later when it actually matters.
Another baby step, which is again mostly cosmetic.

In doing this, we can remove plenty of the attributes we defined just to
get cursor compatibility. We may want to expose some of this again
later, but likely through a better API. For now we simplify the code as
much as possible.

After we should start thinking of how this will play out with docstring
and improving these methods / attributes to provide a nice API for the
new cursor.
We want to start improving the API in ways that will no longer mimic the
Clang's cursor one. We need to prevent (e.g.) our cursor `type` property
turns out to mean something different than Clang's own namesake
(spoiler: it will!).

There might be a few cases where the result might be the same, but that
will not be obvious most of the times and it may still pose a
performance issue anyway. As a rule, we stick to using the Clang cursor
everywhere we can.
We want to do better than this later on, but for now we need to simplify
the parser so that we focus instead on API alone with docstring as the
biggest consumer. In order to do that the easy way, we will start by
providing an API that reflects the current docstring API, then pass the
cursor to docstring directly simplifying its own API and finally
isolating the parser from docstring entirely.

The idea is _not_ to mimic Clang's cursor any more! Any similarity in
attributes and property names is coincidental from now on.

By this point we've also made up the lost performance from the stub
commit!
@BrunoMSantos
Copy link
Collaborator Author

Thanks for the quick review. Pushed some fixes, including the _COMMENTS thing.

Oh, when I first saw the caching I immediately thought "Premature optimization is the root of all evil" 😁

Well, the thought did cross my mind. That's why I put more effort in the commit message trying to find out exactly why we should care than actually implementing it xD

That said, you did much the same thing in your branch if I read correctly. Though instead of lazy caching you just called the methods ahead of time on __init__. I like this approach and I think it's well justified and that it will pay off. But we can drop it for now if you don't like it.

@BrunoMSantos BrunoMSantos force-pushed the doccursor branch 2 times, most recently from 7e44f65 to 59cc22b Compare November 17, 2023 00:18
@jnikula
Copy link
Owner

jnikula commented Nov 17, 2023

I read through this commit by commit, and didn't spot any issues. I'm getting pretty tired though, so I'll get back to it still with fresh eyes.

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache. I'm not insisting on switching to a decorator now but I'm also undecided whether to merge or drop the caching at this point. I'll sleep on it.

@BrunoMSantos
Copy link
Collaborator Author

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache.

That looks pretty cool! I'm tired myself, but I'll look into it. It would simplify the code for sure, even if it's not that complicated to begin with.

It should also have the added benefit of not being tied to a doccursor instance if I understand it right. So, as long as it's the same base cursor (same hash), the cache would still work. E.g. iterating over the children multiple times currently creates new doccursors with an empty cache every time, but this should still work as the cursors would be identical every time.

@jnikula
Copy link
Owner

jnikula commented Nov 18, 2023

This on top fixes #201. I can send this separately, along with a test, but FYI.

diff --git a/src/hawkmoth/doccursor.py b/src/hawkmoth/doccursor.py
index 77e54058226c..109abe4334f0 100644
--- a/src/hawkmoth/doccursor.py
+++ b/src/hawkmoth/doccursor.py
@@ -70,7 +70,7 @@ class DocCursor:
 
     @property
     def name(self):
-        return self._cc.spelling
+        return self._cc.spelling if self._cc.spelling else self.decl_name
 
     @property
     def decl_name(self):

Edit: It occurs to me this could lead to infinite recursion, but I'm not sure yet if it's just theoretical.

@jnikula
Copy link
Owner

jnikula commented Nov 18, 2023

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache.

That looks pretty cool! I'm tired myself, but I'll look into it. It would simplify the code for sure, even if it's not that complicated to begin with.

It should also have the added benefit of not being tied to a doccursor instance if I understand it right. So, as long as it's the same base cursor (same hash), the cache would still work. E.g. iterating over the children multiple times currently creates new doccursors with an empty cache every time, but this should still work as the cursors would be identical every time.

Looks like there are pitfalls in using this on methods, need to investigate: https://www.youtube.com/watch?v=sVjtp6tGo0g

I'm thinking let's leave that for future improvement.

@BrunoMSantos
Copy link
Collaborator Author

This on top fixes #201. I can send this separately, along with a test, but FYI.

diff --git a/src/hawkmoth/doccursor.py b/src/hawkmoth/doccursor.py
index 77e54058226c..109abe4334f0 100644
--- a/src/hawkmoth/doccursor.py
+++ b/src/hawkmoth/doccursor.py
@@ -70,7 +70,7 @@ class DocCursor:
 
     @property
     def name(self):
-        return self._cc.spelling
+        return self._cc.spelling if self._cc.spelling else self.decl_name
 
     @property
     def decl_name(self):

Edit: It occurs to me this could lead to infinite recursion, but I'm not sure yet if it's just theoretical.

This doesn't need to be part of this PR, but I can add it if you prefer it that way.
If not, and if you don't mind, leave a comment explaining the possibility of infinite recursion in the issue if you don't mind. I'm not really seeing it right now, but I also haven't thought about it.

@jnikula
Copy link
Owner

jnikula commented Nov 18, 2023

This doesn't need to be part of this PR, but I can add it if you prefer it that way. If not, and if you don't mind, leave a comment explaining the possibility of infinite recursion in the issue if you don't mind. I'm not really seeing it right now, but I also haven't thought about it.

I can handle it afterwards. It was just FYI, in case you see any problems with it.

@BrunoMSantos
Copy link
Collaborator Author

It occurred to me the caching might be possible to implement using a decorator, and googling found https://docs.python.org/dev/library/functools.html#functools.lru_cache.

That looks pretty cool! I'm tired myself, but I'll look into it. It would simplify the code for sure, even if it's not that complicated to begin with.
It should also have the added benefit of not being tied to a doccursor instance if I understand it right. So, as long as it's the same base cursor (same hash), the cache would still work. E.g. iterating over the children multiple times currently creates new doccursors with an empty cache every time, but this should still work as the cursors would be identical every time.

Looks like there are pitfalls in using this on methods, need to investigate: https://www.youtube.com/watch?v=sVjtp6tGo0g

I'm thinking let's leave that for future improvement.

Had a quick look. The pitfall is a bit unfortunate, but thinking about it with a fresh mind it's obvious it had to be that way and not the way I was positing it would work. Still, he volunteers a solution that may still be worthwhile versus the current implementation.

Let's leave that commit out for now then since it's not an issue yet, not even potentially (cursor not yet exposed to events and we know exactly how many times we call these things). I'll take the opportunity to think if there is a way to get the global cache working the way we would want, allowing multiple passes over cursors. Maybe we can recover the cursor instance from the cache instead of generating a new one on get_children, and then the pitfall is the feature. I actually have a plan forming in my head as I write.

@BrunoMSantos
Copy link
Collaborator Author

This doesn't need to be part of this PR, but I can add it if you prefer it that way. If not, and if you don't mind, leave a comment explaining the possibility of infinite recursion in the issue if you don't mind. I'm not really seeing it right now, but I also haven't thought about it.

I can handle it afterwards. It was just FYI, in case you see any problems with it.

I didn't, until you mentioned infinite recursions xD

That said, this all should be reworked as docstring can now be simplified and it's possible that name and decl_name don't make sense as an API then. Honestly I don't like those, they were hacks to weave docstring into the parser we had. Not that we shouldn't fix that in the current implementation in the meantime.

@BrunoMSantos
Copy link
Collaborator Author

By the way, I had never looked into how decorators worked and with that short video it all makes sense now, so thanks for that. It's a useful thing to know, I wish I had looked into it sooner ;)

@jnikula
Copy link
Owner

jnikula commented Nov 18, 2023

That said, this all should be reworked as docstring can now be simplified and it's possible that name and decl_name don't make sense as an API then. Honestly I don't like those, they were hacks to weave docstring into the parser we had. Not that we shouldn't fix that in the current implementation in the meantime.

Oh, I completely agree. The whole decl_name thing was always a hack, or a workaround at best. I'm hoping we could nuke it in the future.

@jnikula
Copy link
Owner

jnikula commented Nov 18, 2023

@BrunoMSantos Okay. I'm prepared to merge this as-is, or with doccursor: cache calls to certain internal methods removed. Your call.

@BrunoMSantos
Copy link
Collaborator Author

Great! But leave the cache commit out then. I'll have a better look at it and push another PR later, no need to rush it.

@jnikula
Copy link
Owner

jnikula commented Nov 18, 2023

Great! But leave the cache commit out then. I'll have a better look at it and push another PR later, no need to rush it.

Umm, would you mind doing that and force pushing the branch, please? I mean I can of course do it myself and push a separate branch and merge a separate pull request, but I'd rather it was your pull request that I merge.

Finally, we get to make the parser completely oblivious to the rendering
layer! Hopefully docstring and doccursor can now evolve to something
better.

Closes jnikula#180. Though a lot of the potential is still to be realized.
@BrunoMSantos
Copy link
Collaborator Author

I forget this is not like email workflow :P

@jnikula jnikula merged commit 14f47ee into jnikula:master Nov 18, 2023
5 checks passed
@BrunoMSantos BrunoMSantos deleted the doccursor branch November 18, 2023 17:05
@jnikula
Copy link
Owner

jnikula commented Nov 19, 2023

@BrunoMSantos One thing I noticed that bugs me a little. TextDocstring is initialized with text and meta parameters directly, instead of creating DocCursor for them too.

I think DocCursor should deal with cases where either the clang cursor or the comment is None. The former would be for generic comment sections ("top level coments") and the latter to support :undoc-members: in the future. https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-option-automodule-undoc-members

@BrunoMSantos
Copy link
Collaborator Author

@BrunoMSantos One thing I noticed that bugs me a little. TextDocstring is initialized with text and meta parameters directly, instead of creating DocCursor for them too.

I think DocCursor should deal with cases where either the clang cursor or the comment is None. The former would be for generic comment sections ("top level coments") and the latter to support :undoc-members: in the future. https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-option-automodule-undoc-members

Ah yes, I intended to mention it in the PR for us to discuss it, but apparently forgot. I'm still expecting to address it, but I felt what I had was enough of an improvement already and I'm not sure I want DocCursor to represent things that are not cursors. I think I'd rather have a specialized Docstring class for those and call it in the parser appropriately. Actually, I forgot to remove the default value for the cursor, as I did not intend for it and the code will throw if it's attempted as we dereference the cursor in the constructor right away.

A cursor having no comment is already fine I think. Both by comments being None or by having no entry for the cursor hash. What's left is for the parser to iterate over those as well.

@jnikula
Copy link
Owner

jnikula commented Nov 19, 2023

I'm not sure I want DocCursor to represent things that are not cursors.

I guess it needs special casing somewhere, and the debate is where to have it.

Perhaps the answer depends on what's easiest for implementing #166.

A cursor having no comment is already fine I think. Both by comments being None or by having no entry for the cursor hash. What's left is for the parser to iterate over those as well.

I've got a branch implementing some of this, but it's stale now and needs rebasing. It also stalled, because currently we only recurse for things that have a comment, but recursing for everything opens up the struct vs. member can of worms again for things like:

struct foo {
    struct bar {
        int baz;
    } m;
}

We get the cursor for both struct bar and m and it gets a bit nasty. I'm considering implementing an intermediate version of :undoc-members: for enums only, to let you include the undocumented enumerators.

Anyway, getting a bit off-topic here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants