Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Consumption of doxyindex & doxygenclass #315

Open
ax3l opened this issue Feb 17, 2017 · 44 comments
Open

Memory Consumption of doxyindex & doxygenclass #315

ax3l opened this issue Feb 17, 2017 · 44 comments
Labels
code Source code enhancement Improvements, additions (also cosmetics) upstream Issue in external software

Comments

@ax3l
Copy link
Contributor

ax3l commented Feb 17, 2017

I am trying to set up automated build of our docs on rtd, which is limited to 1 GB mem usage (and 15min build time).

The sphinx step

reading sources... [ 25%] doxyindex

is at this point a rather large bottleneck. Is it possible to somehow limit the memory consumption of it? (Is this even a breathe issue or a sphinx issue?)

@michaeljones
Copy link
Collaborator

Hi, you might want to try changing the cache that Breathe uses. By default it just stuffs all the results of parsing files into a dictionary which is created here:

https://github.com/michaeljones/breathe/blob/master/breathe/parser/__init__.py#L99

The cache was introduced to speed things up but realistically it might get very big. If you replace that dictionary with an object that has a dictionary interface but never stores any data then you might have luck reducing the memory usage. It is possible that some kind of LRU cache might represent the best of both worlds but I've no idea what the cache hit/miss pattern would be so it is hard to say.

@michaeljones
Copy link
Collaborator

Also, thanks for using Breathe and for raising this issue! :)

@michaeljones
Copy link
Collaborator

Do you have any results from this at all? Happy to introduce a basic cache/no-cache flag to the main configuration options to help people deal with this.

@ax3l
Copy link
Contributor Author

ax3l commented Apr 13, 2017

I did not try to modify the cache yet but are now at a point where even selecting a few dozen classes (out of many 100; number of doxygen xml files is: 2791) in a page via doxygenclass:: becomes so memory-hungry that RTD's 1GB virtual memory limit gets broken... :(

Example: https://github.com/ComputationalRadiationPhysics/picongpu/blob/342b90c15328d5b4ec821cd1bfec452e3a0c940d/docs/source/dev/pmacc.rst

update: I tried skipping the cache build at all which makes the build slow as expected...

@ax3l ax3l changed the title Memory Consumption of doxyindex Memory Consumption of doxyindex & doxygenclass Apr 13, 2017
@ax3l
Copy link
Contributor Author

ax3l commented Apr 13, 2017

digging a bit deeper, the underlying problem seems to be that single entries in the cache can easily get up to MBytes in size.

The main consumers seem to be compound.(member)RefTypeSub / compound.listofallmembersTypeSub from a first glimpse.

@ax3l
Copy link
Contributor Author

ax3l commented Apr 14, 2017

the interesting thing is that the .xml files that doxygen creates, with all their well-known formatting overhead, only sum up to 84 MByte for the project I am reporting. I wonder where all the overhead is coming from?

@ax3l
Copy link
Contributor Author

ax3l commented Apr 14, 2017

I think the main issue is the memory footprint of the xml parser minidom, which's parsed result object is cached.

I think we have two ways to speed things up while keeping a cache.

A) keep everything in place but cache the raw file and parse on every access in-RAM
B) replace the xml parser with something that does not introduce a memory overhead of 20x while reading a text file

I will hack around a bit and report back.

@ax3l
Copy link
Contributor Author

ax3l commented Apr 14, 2017

Since all the reports on minidom are discouraging (huge overheads in general, supposedly also quite some memleaks) from what I found on stackoverflow, I started investigating for B).

I tried starting to replace minidom with ElementTree which can potentially also be drop-in replaced later with lxml(2) for speed.

Did not succeed yet fully, but here is the branch so far: (progress logged in commit message)
https://github.com/ax3l/breathe/tree/topic-xmlElementTree

@ax3l
Copy link
Contributor Author

ax3l commented Jul 31, 2017

@michaeljones although I don't have the time to work on this deeper (sorry!), I think this issue should become some priority for the project since it needs coordination and is quite limiting for deployments on e.g. readthedocs.

I think the overall project should migrate away from the memory-leaking and slow minidom (see comment+links above) to ElementTree/lxml as soon as anyone can work on that :)

@ax3l
Copy link
Contributor Author

ax3l commented Sep 11, 2017

@svenevs since you are working on exhale, did you see the same performance issues with large projects with breathe? :)

My try to get rid of minidom in breathe are quite stalled but I documented the progress above and it's quite an issue on read-the-docs with limited resources. Just wanted to ping you.

@svenevs
Copy link

svenevs commented Sep 12, 2017

Hehe. LOL well Exhale is reliably slower for different reasons.

Here's a braindump. It is probably more information than you want...

I'm finishing up some touches to upload it to pip (finally, initial upload had a small bug and apparently I did get templates working).

Worth Keeping in Mind

The "legacy" exhale was done in one file, and (through dubious __dict__ crawling of breathe) explicitly crawled the breathe API to infer relationships. In the rewrite, I switched to BeautifulSoup with an lxml backend. The reason I bring this up is because I'm pretty sure the only reason "legacy" exhale even worked is because breathe loads the ENTIRE XML DOM into memory. Don't quote me on that because I really have no idea. I was never able to figure out how to extract the actual documentation throughbreathe compounds, so this could also indicate that the entire XML DOM is not loaded into memory.

Notes on BeautifulSoup and lxml

  1. BeautifulSoup for parsing because I'm lazy and its crazy convenient. It allows me to modify the parsed XML in place, which was really useful for extracting file level documentation. I can't use the breathe file directive because it will generate hundreds of duplicate ids.
    • This should not apply to breathe, because the generated API was constructed to match the Doxygen hierarchy. AKA the parsing shouldn't need to change.
    • Trying to switch to BeautifulSoup for breathe is an exercise in futility, as the generated super stuff is already in play.
      • In the Exhale rewrite, I literally poured over XML document after XML document and basically just had to decide what portions of the XML document I needed to care about were. breathe already has the full XML specs represented.
  2. An undocumented but reliable feature of RTD is that lxml is already installed. It does add some potential incompatibility (for Windows in particular) when doing pip install breathe, but you can also fallback on minidom if not present.
    • I don't know why it's installed, but I think it is safe to assume it will always be there.
    • For breathe, setup.py basically would need to try - except importing lxml before specifying the requirements for installation. I don't think there is a way to try installing lxml and then fallback though, which leaves people vulnerable to having to first pip install lxml and then pip install breathe. Presumably most people will overlook that, even if its documented.
    • Correspondingly, the parser would do the same thing but in the except just import the existing minidom or whatever.

Parsing the API and Memory

In Exhale, I was very cautious to not build out the entire graph while loading in all XML documents. I know for a fact that I can get away with this, but I'm not sure that breathe can. The general workflow was

  1. Parse all of index.xml and start creating nodes to represent the graph.
  2. After index.xml is closed, individual (e.g., class or namespace or file) XML documents are opened, parsed, and closed.
    • I do have a future TODO because for files I do save the entire XML programlisting. That needs to go away at some point.

So basically, the question for breathe really is: does the ENTIRE XML DOM get loaded into memory at once? The above approach works because index.xml enumerates every possible type in the whole framework. Then for things where <refid>.xml exists, you can use that to get more information such as base or derived classes, documentation of specific refid's etc.

In the end

I think I'm ready to re-upload to pip tonight, I'll post back when I have because Exhale prints out the times it takes to parse and generate the rst docs, so it may be useful for you to compare that to breathe times. To test the breathe timing, though, you'd need to be working with a Development Install and inject some of your own benchmarks.

Note: I don't actually have any memory benchmarks for Exhale, only timing. I only know how to do memory benchmarks in C++...

@ax3l
Copy link
Contributor Author

ax3l commented Sep 12, 2017

Thanks for the detailed answer!

does the ENTIRE XML DOM get loaded into memory at once?

the main problem is not that the entire XML dom gets loaded at once. that is fine and efficient for access to it and even large projects seldom generate more than 100 MB of plain XML (that is smaller in RAM when parsed). The main problem is that minidom is used for parsing, which is cluttered with memory leaks and blows it up more then ten-fold in RAM.

@svenevs
Copy link

svenevs commented Sep 12, 2017

Ahhh Ok I see. So you have two problems:

  1. minidom being minidom
  2. doxygenindex specifically being the culprit directive.

If the API is stable enough, did you try using breathe-apidoc to generate the structure?

TBH though I tried giving exhale a crack at it and I definitely don't think it'll work for you. (Note: I did something really dumb which is why the requirements.txt in that is using the github tag). this was a QUICK test on my part, i hit one error and quit...your project is giant.

  1. The parsing of the XML tree still skyrockets; I think it got up to 4GB. So I guess I wasn't very clever after all. I probably need to del stuff explicitly.

    • Or else I really need to stop explicitly storing the XML programlisting........
  2. Exhale finishes it's part, but the build itself after fails. Could be a me-bug, but I'm pretty sure it's breathe and templates. The build failure for me was

    Exception occurred:
      File "/usr/local/lib/python2.7/site-packages/docutils/nodes.py", line 582, in __setitem__
        self.children[key] = item
    IndexError: list assignment index out of range
    

Anywho, I guess the point is both seem to consume a LOT of memory when parsing the Doxygen XML for your project. For Exhale, I guess I have to lay blame on BeautifulSoup (slash me not explicitly deleting things), and being generally wasteful about what I'm storing. For small and medium sized projects it seems ok, but yours is way too complex 😢

@ax3l
Copy link
Contributor Author

ax3l commented Sep 12, 2017

Thx for giving it a try! Yes, I think the underlying problem is actually just 1. since each and every minidom XML read is memory hungry/leaking. doxygenindex just happens to access all XML data, but using many macros of all other kinds leads to the same RAM expense (I already managed to cross the 1GB RTD limit with just documenting a few 100 classes & files of ours).

My minidom replacements above toward ElementTree already reduced memory footprints significantly in my tests. I am just not well enough into breathe's architecture to finish it to finally work again with all functionality :'(

@ax3l
Copy link
Contributor Author

ax3l commented Jan 22, 2018

@svenevs @michaeljones @SylvainCorlay did by coincidence any of you try to reduce the memory footprint of breathe? On RTD, I get regularly killed in builds due to the leaks in minidom :'(

I tried above to get rid of minidom but I don't understand the project deep enough to get it refactored, unfortunately... any form of help / PRs to lighten the burden would be deeply appreciated :)

@svenevs
Copy link

svenevs commented Jan 22, 2018

So if we're convinced that minidom is the problem, switching to lxml is probably the best course of action. Doing so will not be easy, though. The original code base was generated using generateDS, how it got setup originally is described here, though I'm not sure what files were used to generate it in the first place. Based on the current parser setup, I believe you only need to use index.xsd and compound.xsd. You can obtain those by just generated Doxygen XML documentation once, they'll show up next to the other XML documents (e.g. right next to index.xml).

I think you may be able to regenerate using lxml, but then you'd need to crawl through the diffs and figure out what changed. The generateDS docs say it requires lxml, that may not have been the case when breathe was first created. It's not clear to me what the answer is, but I think there is a way to possibly have both available in the same code base. I'm looking here. But I also think that requires that breathe had been generated using export, which I don't think is true.

Myself, I would vote for (if it actually works), replacing minidom with lxml entirely. If I remember correctly, minidom is pure python, whereas lxml has a C backend (that is a belief, not a fact).

So to do it, I think the steps you wold need to take are

  1. Regenerate the super mod stuff with generateDS, making sure that the lxml stuff actually gets created. There's likely more than just setting export to etree, you'd have to read the docs there more carefully.
  2. Assuming you get something very similar to the existing parser package, you would then need to make sure the changes get spliced back in.

Then I think all that changes is this method to use lxml instead?

https://github.com/michaeljones/breathe/blob/08577a4b7f9a9173517d7ddf91bd1fa50eac1252/breathe/parser/index.py#L47-L60

I don't know if the parser package can just be replaced though, there's likely a lot more to it :/

@ax3l
Copy link
Contributor Author

ax3l commented Jan 22, 2018

thanks for the information, that is super useful!

I gave it a little try and indeed in recent versions generateDS depends on lxml with a fallback to ElementTree :)

@michaeljones do you recall or did you save the original call to generateDS for the parser creation? Especially the super-class generation I don't understand how they got created in the first place. Did you take any specific Doxygen project for the generation? (Mine only contain a compound.xsd and an index.xsd...)

I tried so far:

generateDS -f -o compound.py -s compoundsuper.py --super="compoundsuper" ~/src/picongpu/docs/xml/compound.xsd
generateDS -f -o index.py -s indexsuper.py --super="indexsuper" ~/src/picongpu/docs/xml/index.xsd

which at least updates me the four files in breathe/parser/, but not the __init__.py etc.

But installing and trying it leads to errors of the form

File ".local/lib/python2.7/site-packages/breathe-4.7.3-py2.7.egg/breathe/finder/core.py", line 29, in create_finder
    return self.finders[data_object.node_type](self.project_info, data_object, self)
AttributeError: 'DoxygenType' object has no attribute 'node_type'

@michaeljones
Copy link
Collaborator

I'm afraid I'm not up to date on this thread yet, though I will try to catch up soon. I don't believe I have the original generation command saved as the generated files have long since been edited due to parts that felt that they were missing somehow. I think there are levels of indirection in the xml that weren't handled well so manual adjustments seemed necessary and at the same time the doxygen xml seemed stable enough that regenerating from a new schema was not going to be a win.

I believe that the generated files used to have the command at the top, or perhaps just a comment indicating that they were generated. I removed it when it became apparent that I wasn't going to regenerate them. It might still be available in the source history though.

@ax3l
Copy link
Contributor Author

ax3l commented Jan 22, 2018

Thanks for catching up! :)

Unfortunately, the old commands only have a Generated Mon Feb 9 19:08:05 2009 by generateDS.py. on top. Newer versions of generateDS add the full command line that was used for creation, including version information.

@svenevs
Copy link

svenevs commented Jan 22, 2018

I gave it a little try and indeed in recent versions generateDS depends on lxml with a fallback to ElementTree :)

Sweet. I believe what this means is, if this is all successful, breathe would not need to depend on lxml (likely more Windows friendly). But assuming it tries to use lxml first, it will definitely use that on RTD! If this all works out we can add to the installation guide try to pip install lxml first, but if you cannot get it working breathe will still work. I can help test that down the road.

File ".local/lib/python2.7/site-packages/breathe-4.7.3-py2.7.egg/breathe/finder/core.py", line 29, in create_finder
    return self.finders[data_object.node_type](self.project_info, data_object, self)
AttributeError: 'DoxygenType' object has no attribute 'node_type'

That is part of "step 2", sorry if that wasn't clear. You'll have to manually splice that all back in to the super / sub classes 😱 E.g.

https://github.com/michaeljones/breathe/blob/08577a4b7f9a9173517d7ddf91bd1fa50eac1252/breathe/parser/compound.py#L100-L102

That guy is the only thing that actually tells the rest of breathe what this is (either for directives, signatures, or rendering things like #include <path>).

Also, I'm not sure if you are aware, but you can do "editable" installs with pip which will make this much easier for you to develop than continually reinstalling.

# In directory
# /some/path/breathe
#     - setup.py
#     breathe/
#        parser/ ...
$ pip install -e .

The egg link is generated in the local directory, and you can keep updating the files here without needing to re-install.

@ax3l
Copy link
Contributor Author

ax3l commented Jan 23, 2018

That is part of "step 2",

I think the main work now is, after replacing the old generated files with newer ones, to go to manually written files like the finder. We there have to make sure they do not use old "minidom" type members such as node_type but the name member attributes exposed by the new (lxml/elementtree) parser. Side note: e.g. node_type is just the tag name as it looks so far. That was too late at night, reading your comment again makes sense, it's just an other attribute in the generated files. Making small progress...

@ax3l
Copy link
Contributor Author

ax3l commented Feb 13, 2018

Btw, found the changelog of generateDS (probably something like 1.18 in 2009 to 2.29 today.

@ax3l
Copy link
Contributor Author

ax3l commented Feb 13, 2018

I have currently added all required node_type attributes and am now swiping through side-effects.

@ax3l
Copy link
Contributor Author

ax3l commented Feb 14, 2018

I am not fully building, but from a first test the memory consumption is still really bad...
https://github.com/ax3l/breathe/tree/topic-elemTree

@svenevs
Copy link

svenevs commented Feb 15, 2018

Are you able to quantify "really bad"? Is it still greater than 4GB?

@svenevs
Copy link

svenevs commented Feb 17, 2018

Also, what version of python are you using? In compound.py, the docTitleType has 270 parameters. More than 255 was introduced with python 3.7.

@ax3l
Copy link
Contributor Author

ax3l commented Feb 21, 2018

Thanks for the hint.
Yes, the memory consumption is still similarly high, at least ~1.9 GByte for parsing of my ~130 MB doxygen XML input (but it does not yet build fully with my changes). I think I never saw up to 4 GByte on my machine, ~2.3 GB is what I see with master.

I am testing with Python 2.7 and 3.5 - interesting issue with the compound.py param count!

I am not sure if I posted it already, but this is how one can reproduce it with our data:

git clone https://github.com/ComputationalRadiationPhysics/picongpu.git
cd picongpu
git checkout e3cc934c31020ef19a387a9987fac1f5b7595742
cd docs

doxygen

du -hs xml/
# 132 MB

make html
# njom njom, RAM be gone

@svenevs
Copy link

svenevs commented Feb 21, 2018

Are you working with an installed copy of your breathe/topic-elemTree branch? That's what I installed and ran your above instructions, and the 255 args thing will arise:

sven:~/Desktop/picongpu/docs> make html
Running Sphinx v1.6.7
making output directory...

Exception occurred:
  File "/usr/local/lib/python3.6/site-packages/breathe/parser/__init__.py", line 3, in <module>
    from . import compound
  File "/usr/local/lib/python3.6/site-packages/breathe/parser/compound.py", line 6809
    , emsp=None, thinsp=None, zwnj=None, zwj=None, lrm=None, rlm=None, ndash=None, mdash=None, lsquo=None, rsquo=None, sbquo=None, ldquo=None, rdquo=None, bdquo=None, dagger=None, Dagger=None, permil=None, lsaquo=None, rsaquo=None, euro=None, tm=None, valueOf_=None, mixedclass_=None, content_=None):
                    ^
SyntaxError: more than 255 arguments
The full traceback has been saved in /var/folders/wx/j7610hj10r3_h1xs4xr41bgw0000gn/T/sphinx-err-jai8fnd_.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [html] Error 1

Note: I'm not really sure that going through and fixing this will remedy the memory consumption errors. Did you ever look into the cache stuff? I feel like that may be where the memory overhead is coming from. I'm not familiar with the implementation, but presumably sphinx is trying to "pickle" everything that is cached. I would not be surprised if that is the real problem here. However, with "regular" breathe installed it fails to build at a different point, and du -hs build/doctrees only reports 1.9M. So that may not be the issue either 😢

I was playing around with generateDS and hit that and got really sad. It is solvable, but has to be done manually. Interestingly, though, the original parser does not have this problem:

https://github.com/michaeljones/breathe/blob/08577a4b7f9a9173517d7ddf91bd1fa50eac1252/breathe/parser/compound.py#L1042-L1048

Maybe you can just ignore it (and keep the original classes)? I think maybe breathe parses the documentation manually rather than relying on this to be represented by the XML? Having these >255 arguments means later in the pipeline somewhere you have to check them all and turn them into markup (I think).

I lost motivation to pursue, but this is how you would solve it. I think there are only two or three classes this applies to, and it's mostly just because it's representing all of the possible markup options from Doxygen. The above error comes from something like this

class docTitleType(GeneratedsSuper):
    node_type = "doctitle"

    subclass = None
    superclass = None
    def __init__(self, ulink=None, bold=None, emphasis=None, MANY_OTHERS):
        self.original_tagname_ = None
        if ulink is None:
            self.ulink = []
        else:
            self.ulink = ulink
        if bold is None:
            self.bold = []
        else:
            self.bold = bold
        if emphasis is None:
            self.emphasis = []
        else:
            self.emphasis = emphasis

For both compound.py and compoundsuper.py, you have to manually change this to be **kwargs instead of default args, but then I think the rest of it will be OK. By switching to **kwargs, though, you now have to be worried about whether or not it was specified. Python already has a solution to this though, using dict.get(key, defaultIfNotFound)

    def __init__(self, **kwargs):
        self.original_tagname_ = None
        # I'm 95% sure you can still do kwargs.get(...)
        self.ulink = kwargs.get("ulink", [])
        self.bold = kwargs.get("bold", [])
        self.emphasis = kwargs.get("emphasis", [])

In the sub-class, instead of

class docTitleTypeSub(supermod.docTitleType):

    node_type = "doctitle"

    def __init__(self, ulink=None, bold=None, emphasis=None, MANY_OTHERS):
        arglist_ = (ulink, bold, emphasis, MANY_OTHERS)
        super(docTitleTypeSub, self).__init__(*arglist_)

you just forward kwargs:

    def __init__(self, **kwargs):
        super(docTitleTypeSub, self).__init__(kwargs)

@svenevs
Copy link

svenevs commented Apr 27, 2018

@ax3l (and anyone else interested) I gave a stab at fixing generateDS to use a **kwargs approach, initial work can be found here. The author has been very supportive so far, but he and I are both worried about making a change like this as it could have unexpected consequences.

@michaeljones do you happen do recall when you did generateDS way back in the day if you ran into issues with >254 arguments? If so, did you just delete those classes? Or has doxygen updated / added things that make it so that the 254 problem exists now, but did not back then?

@nikhilkalige
Copy link

Would using doxygen generated sqlite database instead of parsing xml help speed up the generation?

@vermeeren
Copy link
Collaborator

It would probably be quicker than the current method. A proper, efficient, XML parser would be at least (around) as fast as using a sqlite database though.

@nikhilkalige
Copy link

nikhilkalige commented Mar 15, 2019

@melvinvermeeren Our codebase has around 5000 thousand files, as of now this takes around 6-7 hours to complete, so looking for ideas to speed up the process.

Note: https://github.com/mosra/m.css finishes in minutes, I should take a look to see if I can get some ideas from there.

@cheshirekow
Copy link

Sorry to revive a very old thread, but I'm trying to debug some breathe performance issues. My sphinx build goes from 20sec to 3min when I add breathe (and one page referencing a few dozen .. doxygenstruct:: ). I didn't fully follow this thread but here is my understanding:

  1. The root cause might be due to the use of minidom to parse the xml. It is perhaps leaky and uses significant memory to store the internal representation of the DOM
  2. The way this presents itself is likely via caching of the xml themselves.

There's quite a bit of discussion on replacing minidom, something about some generated code (which I'm not really following), and there was one comment suggesting that disabling the cache actually might help. @ax3l , @svenevs, @michaeljones and anyone else who might be following this thread: do you think 1-2 weeks of dedicated engineering time would be sufficient to solve this problem? Or is it even bigger than that?

I have some person hours I can donate to the project, but not for too long. Where would be the best place for them to start? (parser/init.py line 99 doesn't exist anymore. I didn't see anything obviously doing caching in there).

@vermeeren are you the current maintainer? I saw that you commented on #544. Maybe you know?

@svenevs
Copy link

svenevs commented Jul 8, 2022

Brief response from phone, the snag is that the way breathe represents it's hierarchy internally was using something to take the doxygen xml schema and turn it into Python classes (generateDS).

For as close as I got, I don't think it's a great path forward. I think there's enough test cases in place to know if things fully break, meaning changing something like the backend parsing could be achievable. But I'm not sure how much the rest of breathe needs to change if the gernerateDS graph goes away (all the compound super stuff).

My vote at this point would be using lxml to parse, or beautiful soup + lxml, but beautiful soup may add noticeable overhead. FWIW I heard a tale about lxml and opening too many files at once leading to slowdowns, so if implemented you'd want to be sure to close things as soon as you're done with them.

Changing the xml parsing would be hugely valuable to this project, but I'm not sure how substantive of an effort this will be (current maintainers could give a better idea). If it gets started but not finished I'm open to trying to slowly pick up the slack, this is something we've all desired for a long time but it's a big cookie.

@vermeeren
Copy link
Collaborator

@cheshirekow Although I am indeed the current (primary) maintainer since 2018-ish there are many technical parts of Breathe that I haven't worked with in-depth, which includes the XML parser. Generally speaking I have an idea of what and where but there are many specifics that I don't yet know much of. (Often this is enough to help with issues, contributions and questions, but for plenty of complex topics it isn't.)

So I'm afraid I likely know less than the other participants in this thread and have no idea how much time it would take to implement the changes desired or what you could run into. @michaeljones probably knows most, but considering the years passed since he wrote the current implementation it might be tricky to recall the specifics.

@michaeljones
Copy link
Collaborator

Yes, it has been a long time and the details are quite blurry to me. I've just re-read the thread but don't have anything specific to add.

I realise it is all long overdue. I would be happy enough to attempt it if properly funded. Though my preferences are not really for Python any more. When I think of tackling this, I dream of re-writing some of the parsing & data handling in Rust and expose just enough to the Python layer for it to fill out the structures that Sphinx needs. That would be silly if there was a more drop in alternative for minidom that would solve the issue more cleanly but I'm not active enough in the Python ecosystem to understand that route.

A step forward perhaps, would be for people to provide example code bases to work with so there is something to profile and inspect and then anyone that attempts it to report data and findings back here. I don't really know what the memory tracing and performance tooling is like in Python though.

@jakobandersen
Copy link
Collaborator

Some time ago I looked into regenerating the parser with generateDS both because the Doxygen XML has changed with newer versions, but also because it seems to use lxml now. So I think that may be easiest first step.
However, the parser has been manually tweaked since the auto-generation (82 commits have changed the parser/ folder), and many of them add functionality which may or may not be auto-generated nowadays. So my investigations back then recursed into making tests for these things, but we don't really have a proper testing framework for the things in the examples/specific/ folder.

So, assuming we would have development resources, my suggestion for a "roadmap" would be:

  1. Make a testing setup (see Investigate snapshot tests for output #733) where we for each test can specify
  • Input source code (C, C++, etc.).
  • Doxygen config.
  • Sphinx input (rst).
  • Sphinx config.
  • Expected doctree (i.e., the XML builder output of Sphinx), per combination of Sphinx and Doxygen.
    Right now we have a bunch of files and folders, but maybe we should have all this written inline in the Python test files. At least we should make it as convenient as possible to add a new test.
    Probably one can use more advanced features of pytest to make this very easy, a la the tests in Sphinx that uses fixtures to set up a complete build.
  1. Port all the existing 'tests' to the new setup.
  2. Add tests for all the things changed manually in the parser folder.
  3. Set up script to regenerate the parser using generateDS.

@DuTogira
Copy link

I took the liberty of profiling breathe, to the best of my ability, to help isolate what's causing such a significant slowdown.

I profiled a private project which takes about 30 seconds to generate doxygen documentation for, but to which the introduction of breathe adds an addition ~4 minutes of build time. Due to its nature, I can't share the project itself, but I can share the methodology by which I went about profiling.
After downloading both sphinx's and breathe's source code into my project's main folder, I introduced cProfile as a wrapper around sphinx's sphinx/application.py Sphinx.build. Any actions which happen outside of that call stack would not have been captured by my profiling, but the profiler did run in the neighborhood of 4 minutes (which lines up with breathe's added overhead).

tottime represents time spent directly in a function, while cumtime is the sum of tottime, plus any time spent in calls to sub-functions.
BreatheProfileRedacted

Based on my profiling, I'm not certain that caching is culprit when it pertains to runtime performance. It seems that minidom's parsing itself is under-performing. This supports the conclusion that replacing minidom may be the best path forward for breathe, though it does suggest that the issue goes deeper than memory mismanagement.

@michaeljones
Copy link
Collaborator

Thanks for sharing those results @DuTogira - definitely interesting to see.

@DuTogira
Copy link

DuTogira commented Jul 25, 2022

@michaeljones / @svenevs based on this, what's the best path forward? It seems we have multiple avenues, including:

  1. switching to lxml as the default parser when available
  2. Changing away from generateDS towards backend parsing

Are either of these feasible avenues forward which we could hope for in the near future?

@michaeljones
Copy link
Collaborator

I'm not sure anyone is in a good position to answer that. It would take some time to assess the current state of the code and figure out the best way forward. To my knowledge, no one active on the project has a full understanding of the code base due to either having not worked on all of it or having forgotten what they once knew.

There are a few of us who could be funded to work on it but otherwise it is for others to pick up. Speaking for myself, I'm afraid that even getting into a position where I could advise on the best way forward would take time that I don't currently have the motivation for.

I'm sorry that these statements are a bit of a bummer. I'm keen to see the project properly funded by the companies that benefit from its existence but I'm not sure how to go about making that a reality so I'm neither moving forward with funding nor with active work in the absence of funding. We do have an open collective if you (or your company should there be one) are in a position to offer funding.

@cheshirekow
Copy link

We do have an open collective if you (or your company should there be one) are in a position to offer funding.

Can you tell me more about the funding model via this collective. Can we "buy a feature" or is this a general "support us if you use us" kind of thing. I can certainly request my employer to fund the performance improvements we are looking for and I'm happy to push the buttons to get that started.

We can also fund the development with (some) eng hours to do the work, though that would go best with some partnership to guide the developer on what needs to be done where.

@michaeljones
Copy link
Collaborator

Thank you for asking. We don't have a well established model yet but at least two of our maintainers are able to work under a "buy a feature"approach, I think. I'm not experienced with how those discussions go or how to price the work but it would definitely be something that we'd be happy to explore.

We'd also be happy to have "pay it if you use it" funding and put those funds towards specific feature work but I can understand if that seems like less of a direct return for some business interests.

@AnthonyDiGirolamo
Copy link

If anyone is still interested in lxml I started a PR for it #891 It's not complete but unit tests are passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code Source code enhancement Improvements, additions (also cosmetics) upstream Issue in external software
Projects
None yet
Development

No branches or pull requests

9 participants