Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add adoptExternalDocument to public header #240

Merged
merged 3 commits into from Apr 22, 2017

Conversation

albarrentine
Copy link
Contributor

Allows the creation of LxmlDocument structs used throughout lxml.etree from a raw libxml xmlDoc pointer in a C/Cython extension.

Example usage:

cimport lxml.includes.etreepublic as cetree
from lxml.includes.etreepublic cimport tree

# import the lxml.etree module in Python
cdef object etree
from lxml import etree

# initialize the access to the C-API of lxml.etree
cetree.import_lxml__etree()

from lxml.includes.etreepublic cimport _Document, documentFactory

from my_extension cimport some_c_function

_html_parser = etree.HTMLParser()

cdef _Document parse_html(html):
    cdef _Document doc
    cdef tree.xmlDoc* c_doc
    c_doc = some_c_function(html)
    doc = documentFactory(c_doc, _html_parser)
    return doc

My current use case is the one mentioned in #239 (closed in favor of this, simpler to move that extension to a separate project), where I would like to use an external C HTML parser (gumbo-parser), build a libxml tree from its output (gumbo-libxml), and have the ability to run XPaths, cleaner, etc. on said tree using lxml.

This could generally open the door for using other C parsers for lxml.

@scoder
Copy link
Member

scoder commented Apr 22, 2017

Would it be acceptable for you to have a combined C-API function instead that wraps both _documentFactory() and _elementFactory()? I don't see a use case for the documentFactory() all by itself, so I'd rather avoid exposing that low-level function.

Proposal in search of a better name: wrapExternalElement(xmlNode* c_node, parser)

That function would then also check that the tree doesn't contain any non-NULL _private proxy object pointers yet, and raise an error if it does.

@scoder
Copy link
Member

scoder commented Apr 22, 2017

Speaking of which, please take a look at the newly added adopt_external_document() function in lxml.etree.pyx, which was implemented for the same purpose, but is accessible from the Python level using a PyCapsule interface. I'd be happy to add a C-level API as well, unless this already suits you. It does have the advantage of not requiring any lxml headers etc.

See this ticket for why it was added:
https://bugs.launchpad.net/lxml/+bug/1595781

The author refers to the Gumbo parser as well, but I can't say whether he ended up using it. Sorry, I had completely forgotten about that part.

@albarrentine
Copy link
Contributor Author

albarrentine commented Apr 22, 2017

@scoder ah, hadn't seen that PR. I think I'd prefer to access it directly from C/Cython rather than rope in PyCapsules. Would something like this work on your end?

cdef public api _ElementTree adoptExternalDocument(xmlDoc* c_doc, parser, bint is_owned):
    if c_doc is NULL:
        raise TypeError
    doc = _adoptForeignDoc(c_doc, parser, is_owned)
    return _elementTreeFactory(doc, None)

@albarrentine albarrentine changed the title add documentFactory to public header add adoptExternalDocument to public header Apr 22, 2017
@scoder
Copy link
Member

scoder commented Apr 22, 2017

Yes, perfect. Thanks!

@scoder scoder merged commit 536790d into lxml:master Apr 22, 2017
@albarrentine albarrentine deleted the document_factory_public_api branch April 22, 2017 18:21
@albarrentine
Copy link
Contributor Author

albarrentine commented Apr 22, 2017

That's great! Confirmed that it's working in https://github.com/thatdatabaseguy/gumbo_lxml. Would you be able to cut a new version of lxml that I can require?

@scoder
Copy link
Member

scoder commented Apr 22, 2017

Haven't got a release date yet, but I was planning to release 3.8.0 pretty soon, yes.

@albarrentine
Copy link
Contributor Author

Sounds good, will keep an eye out for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants