New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add adoptExternalDocument to public header #240

Merged
merged 3 commits into from Apr 22, 2017

Conversation

Projects
None yet
2 participants
@albarrentine
Contributor

albarrentine commented Apr 14, 2017

Allows the creation of LxmlDocument structs used throughout lxml.etree from a raw libxml xmlDoc pointer in a C/Cython extension.

Example usage:

cimport lxml.includes.etreepublic as cetree
from lxml.includes.etreepublic cimport tree

# import the lxml.etree module in Python
cdef object etree
from lxml import etree

# initialize the access to the C-API of lxml.etree
cetree.import_lxml__etree()

from lxml.includes.etreepublic cimport _Document, documentFactory

from my_extension cimport some_c_function

_html_parser = etree.HTMLParser()

cdef _Document parse_html(html):
    cdef _Document doc
    cdef tree.xmlDoc* c_doc
    c_doc = some_c_function(html)
    doc = documentFactory(c_doc, _html_parser)
    return doc

My current use case is the one mentioned in #239 (closed in favor of this, simpler to move that extension to a separate project), where I would like to use an external C HTML parser (gumbo-parser), build a libxml tree from its output (gumbo-libxml), and have the ability to run XPaths, cleaner, etc. on said tree using lxml.

This could generally open the door for using other C parsers for lxml.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Apr 22, 2017

Member

Would it be acceptable for you to have a combined C-API function instead that wraps both _documentFactory() and _elementFactory()? I don't see a use case for the documentFactory() all by itself, so I'd rather avoid exposing that low-level function.

Proposal in search of a better name: wrapExternalElement(xmlNode* c_node, parser)

That function would then also check that the tree doesn't contain any non-NULL _private proxy object pointers yet, and raise an error if it does.

Member

scoder commented Apr 22, 2017

Would it be acceptable for you to have a combined C-API function instead that wraps both _documentFactory() and _elementFactory()? I don't see a use case for the documentFactory() all by itself, so I'd rather avoid exposing that low-level function.

Proposal in search of a better name: wrapExternalElement(xmlNode* c_node, parser)

That function would then also check that the tree doesn't contain any non-NULL _private proxy object pointers yet, and raise an error if it does.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Apr 22, 2017

Member

Speaking of which, please take a look at the newly added adopt_external_document() function in lxml.etree.pyx, which was implemented for the same purpose, but is accessible from the Python level using a PyCapsule interface. I'd be happy to add a C-level API as well, unless this already suits you. It does have the advantage of not requiring any lxml headers etc.

See this ticket for why it was added:
https://bugs.launchpad.net/lxml/+bug/1595781

The author refers to the Gumbo parser as well, but I can't say whether he ended up using it. Sorry, I had completely forgotten about that part.

Member

scoder commented Apr 22, 2017

Speaking of which, please take a look at the newly added adopt_external_document() function in lxml.etree.pyx, which was implemented for the same purpose, but is accessible from the Python level using a PyCapsule interface. I'd be happy to add a C-level API as well, unless this already suits you. It does have the advantage of not requiring any lxml headers etc.

See this ticket for why it was added:
https://bugs.launchpad.net/lxml/+bug/1595781

The author refers to the Gumbo parser as well, but I can't say whether he ended up using it. Sorry, I had completely forgotten about that part.

@albarrentine

This comment has been minimized.

Show comment
Hide comment
@albarrentine

albarrentine Apr 22, 2017

Contributor

@scoder ah, hadn't seen that PR. I think I'd prefer to access it directly from C/Cython rather than rope in PyCapsules. Would something like this work on your end?

cdef public api _ElementTree adoptExternalDocument(xmlDoc* c_doc, parser, bint is_owned):
    if c_doc is NULL:
        raise TypeError
    doc = _adoptForeignDoc(c_doc, parser, is_owned)
    return _elementTreeFactory(doc, None)
Contributor

albarrentine commented Apr 22, 2017

@scoder ah, hadn't seen that PR. I think I'd prefer to access it directly from C/Cython rather than rope in PyCapsules. Would something like this work on your end?

cdef public api _ElementTree adoptExternalDocument(xmlDoc* c_doc, parser, bint is_owned):
    if c_doc is NULL:
        raise TypeError
    doc = _adoptForeignDoc(c_doc, parser, is_owned)
    return _elementTreeFactory(doc, None)

@albarrentine albarrentine changed the title from add documentFactory to public header to add adoptExternalDocument to public header Apr 22, 2017

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Apr 22, 2017

Member

Yes, perfect. Thanks!

Member

scoder commented Apr 22, 2017

Yes, perfect. Thanks!

@scoder scoder merged commit 536790d into lxml:master Apr 22, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@albarrentine albarrentine deleted the albarrentine:document_factory_public_api branch Apr 22, 2017

@albarrentine

This comment has been minimized.

Show comment
Hide comment
@albarrentine

albarrentine Apr 22, 2017

Contributor

That's great! Confirmed that it's working in https://github.com/thatdatabaseguy/gumbo_lxml. Would you be able to cut a new version of lxml that I can require?

Contributor

albarrentine commented Apr 22, 2017

That's great! Confirmed that it's working in https://github.com/thatdatabaseguy/gumbo_lxml. Would you be able to cut a new version of lxml that I can require?

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Apr 22, 2017

Member

Haven't got a release date yet, but I was planning to release 3.8.0 pretty soon, yes.

Member

scoder commented Apr 22, 2017

Haven't got a release date yet, but I was planning to release 3.8.0 pretty soon, yes.

@albarrentine

This comment has been minimized.

Show comment
Hide comment
@albarrentine

albarrentine Apr 22, 2017

Contributor

Sounds good, will keep an eye out for it.

Contributor

albarrentine commented Apr 22, 2017

Sounds good, will keep an eye out for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment