Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
add adoptExternalDocument to public header #240
Allows the creation of LxmlDocument structs used throughout lxml.etree from a raw libxml xmlDoc pointer in a C/Cython extension.
cimport lxml.includes.etreepublic as cetree from lxml.includes.etreepublic cimport tree # import the lxml.etree module in Python cdef object etree from lxml import etree # initialize the access to the C-API of lxml.etree cetree.import_lxml__etree() from lxml.includes.etreepublic cimport _Document, documentFactory from my_extension cimport some_c_function _html_parser = etree.HTMLParser() cdef _Document parse_html(html): cdef _Document doc cdef tree.xmlDoc* c_doc c_doc = some_c_function(html) doc = documentFactory(c_doc, _html_parser) return doc
My current use case is the one mentioned in #239 (closed in favor of this, simpler to move that extension to a separate project), where I would like to use an external C HTML parser (gumbo-parser), build a libxml tree from its output (gumbo-libxml), and have the ability to run XPaths, cleaner, etc. on said tree using lxml.
This could generally open the door for using other C parsers for lxml.
referenced this pull request
Apr 19, 2017
Would it be acceptable for you to have a combined C-API function instead that wraps both
Proposal in search of a better name:
That function would then also check that the tree doesn't contain any non-NULL
Speaking of which, please take a look at the newly added
See this ticket for why it was added:
The author refers to the Gumbo parser as well, but I can't say whether he ended up using it. Sorry, I had completely forgotten about that part.
@scoder ah, hadn't seen that PR. I think I'd prefer to access it directly from C/Cython rather than rope in PyCapsules. Would something like this work on your end?
cdef public api _ElementTree adoptExternalDocument(xmlDoc* c_doc, parser, bint is_owned): if c_doc is NULL: raise TypeError doc = _adoptForeignDoc(c_doc, parser, is_owned) return _elementTreeFactory(doc, None)