Skip to content
This repository has been archived by the owner on Jan 15, 2020. It is now read-only.

New C extension backend #58

Closed
wants to merge 13 commits into from
Closed

New C extension backend #58

wants to merge 13 commits into from

Conversation

rtobar
Copy link
Contributor

@rtobar rtobar commented Oct 19, 2016

This pull request is to add a new backend based on a C extension. The C extension exposes three generators: basic_parse, parse and items, which work the same way their counterparts in the other backends and in the common module work. Reference counting seems correct, creating neither memory corruptions nor memory leaks when parsing big amounts of data.

The main advantage of this new backend is its speed. Local tests indicate a gain of ~10x for the items generator, and even more for the parse and basic_parse generators, when compared with the cffi backend. This makes this backend comparable to the json module in terms of speed, while still giving users an iterative way of consuming a JSON stream, and therefore keeping memory consumption low.

I've re-organized the commits from their original form to make the pull request more compact and manageable. Commits include not only the new backend, but also their integration to the package installation procedure (optional, only if the compiler/libraries are found) and to the tests suite. I've also modified the travis configuration to use YAJL2 for testing, effectively testing more backends now (4 instead of 2).

This new C extension backend implements three objects (basic_parse, parse and
items) which work like an iterator (i.e., they implement the iter and iternext
methods). They do the same work their counterparts in the other backends do;
that is, basic_parse generates tuples of (event, value), parse adds a path to
that tuple, and items generates objects out of those. The fact that they are
written in C, plus a few other minor goodies, means that their performance is
substantially better than the other backends.

The code tries to take all error cases into account and return the correct
values when needed. Reference counting seems correct when things work
correctly; that is no memory corruption or memory leak occurs when parsing bigs
amount of data. Reference counting might be incorrect in some error situations,
but that's work future work.

The C extension itself is called _yajl2. Its functionality is in turn exposed
via the yajl2_c python module which is what users should finally import.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
The new backend requires a C compiler and the YAJL development headers and
library to be present on the system. If the conditions are not right we don't
include extension to the setup process.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Tests run but were not reporting which backends were not being tested. The
find_yajl_ctypes function also failed sometimes with OSError instead of
YAJLImportError.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Using YAJL2 means that we test more backends (i.e., python, yajl2, yajl2_cffi
and yajl2_c) than we currently do with libyajl1 (i.e., python and yajl only).
One can see the difference in the number of tests run (32 against 58).

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
This is purely for aeasthetic reasons. There was also a bit a duplicated code
in the macro definitions which has now been unified.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@rtobar
Copy link
Contributor Author

rtobar commented Nov 3, 2016

Hi Ivan,

Any chance to get this work reviewed and integrated?

gen->path always contains at least 1 element, and when a map_start event is
found a new None element is added to the list. Thus the logic to find whether
this map key is the first element of the prefix returned by this generator was
wrong and generated ".key" for a JSON like '{"key": 1}'

I've added a unit test to make sure this problem doesn't arise again.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@isagalaev
Copy link
Owner

Any chance to get this work reviewed and integrated?

Yes! Looks like great stuff. I'll look at it once I get around ijson again. Being ve-e-ery busy at the moment :-)

This new way allow for environments to be passed down (CC, CFLAGS et al.) and
also leaves less temporary files behind.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
* ICRAR - International Centre for Radio Astronomy Research
* (c) UWA - The University of Western Australia, 2016
* Copyright by UWA (in the framework of the ICRAR)
* All rights reserved
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's compatible with the BSD license of ijson. The copyright statement itself is fine of course, since you're the author, but "All rights reserved" specifically contradicts the explicit grant of the BSD license to do pretty much anything. Would UWA agree to license this work under BSD, do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm, good point, I haven't given much thought to that particular phrase really, I simply added it from the template I've seen in other codes of ours without realizing it conflicted with the license. I will have to bring this up and come back with an answer.

from ijson import common
from ._yajl2 import basic_parse as _basic_parse # @UnresolvedImport
from ._yajl2 import parse as _parse # @UnresolvedImport
from ._yajl2 import items as _items # @UnresolvedImport
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: I'd prefer _yajl2 renamed to pyyajl2 and importing the module itself instead of the individual objects, and refer to them as pyyajl2.basic_parse, pyyajl2.parse and pyyajl2.items from code. This would also remove the need to start their names with _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing the C module itself instead of its individual parts sounds fine, I will change that.

Regarding the name, what is the motivation for having pyyajl2 instead of _yajl2? In the python library python wrappers for C-written seem to be usually named likewise (e.g. ssl imports _ssl, socket imports _socket, struct imports _struct). Following that I named the C module _yajl2.c, without the final "c" in the name as it is already present in the extension. I think also that the name pyyajl2 somehow would imply that the extension is written in pure python, which in turn is confusing again since there is already a python backend.

Let me know if you want to reconsider the naming; otherwise I will go ahead with the names you suggest.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyyajl2 was just a subjective preference… But I see your point on consistency with other libraries, so let's leave the naming as is. Thanks for pointing it out!

tests.py Outdated
@@ -205,6 +211,7 @@ def test_api(self):
{'backend': import_module('ijson.backends.%s' % name)},
)
except ImportError:
print("Skipping backend", name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would add unnecessary noise to tests output, let's not add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no worries, I'll take that out.

@isagalaev
Copy link
Owner

Hi Rodrigo! I have finally found some time to review this pull request. It is a great addition and I'm looking forward to committing it after we iron out a few things I mentioned above. Thanks a lot!

@rtobar
Copy link
Contributor Author

rtobar commented Jan 8, 2017

Hi Ivan,

Thanks to you for taking the time to review these changes. I'm currently on holiday, and won't be back to the office until the last week of January, so I might or not be able to do something before then.

I responded to your observations above, which all seem very reasonable. I'm expecting small feedback only on one small point, the rest I'll implement as requested.

Thanks again!

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@isagalaev
Copy link
Owner

Thanks for a quick reply! I'll try to be more responsive as well :-) Take your holiday time!

@KenjiTakahashi
Copy link

KenjiTakahashi commented Jan 12, 2017

Hi, I have a need to iteratively parse very large JSON objects and I've decided to try this branch out for more speed.
It works fine with Py2.7, but with Py3.6, I get a segfault.
Here's a backtrace

(gdb) bt
#0  0x00007fbaf1320635 in ?? () from /usr/lib/libpython3.6m.so.1.0
#1  0x00007fbaf13b579e in PyArg_ParseTupleAndKeywords () from /usr/lib/libpython3.6m.so.1.0
#2  0x00007fbaeff881b3 in basicparse_init (self=0x7fbaf0476e30, args=<optimized out>, kwargs=<optimized out>) at ijson/backends/_yajl2.c:233
#3  0x00007fbaeff87cdf in parsegen_init (self=0x7fbaf0476e30, args=<optimized out>, kwargs=<optimized out>) at ijson/backends/_yajl2.c:447
#4  0x00007fbaeff87c09 in itemsgen_init (self=0x7fbaf0476e30, args=<optimized out>, kwargs=0x0) at ijson/backends/_yajl2.c:760
#5  0x00007fbaf14432ce in ?? () from /usr/lib/libpython3.6m.so.1.0
#6  0x00007fbaf145d85c in _PyObject_FastCallDict () from /usr/lib/libpython3.6m.so.1.0
#7  0x00007fbaf13e4396 in ?? () from /usr/lib/libpython3.6m.so.1.0
#8  0x00007fbaf13a5067 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.6m.so.1.0
#9  0x00007fbaf13e3d4a in ?? () from /usr/lib/libpython3.6m.so.1.0
#10 0x00007fbaf13e4303 in ?? () from /usr/lib/libpython3.6m.so.1.0
#11 0x00007fbaf13a5067 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.6m.so.1.0
#12 0x00007fbaf13e4757 in PyEval_EvalCodeEx () from /usr/lib/libpython3.6m.so.1.0
#13 0x00007fbaf13a4d4b in PyEval_EvalCode () from /usr/lib/libpython3.6m.so.1.0
#14 0x00007fbaf1486112 in ?? () from /usr/lib/libpython3.6m.so.1.0
#15 0x00007fbaf148897d in PyRun_FileExFlags () from /usr/lib/libpython3.6m.so.1.0
#16 0x00007fbaf1488b67 in PyRun_SimpleFileExFlags () from /usr/lib/libpython3.6m.so.1.0
#17 0x00007fbaf147ca91 in Py_Main () from /usr/lib/libpython3.6m.so.1.0
#18 0x0000000000400a5d in main ()
(gdb)

and a simple code that replicates it

from io import StringIO

import ijson.backends.yajl2_c as ijson

test = StringIO('[{"events": 1}]')

items = ijson.items(test, 'item.events')

Thanks for working on this, cheers!

@rtobar
Copy link
Contributor Author

rtobar commented Jan 13, 2017

@KenjiTakahashi Python 3.6 is fairly new, it wasn't around at the time of submitting these changes, and thus it hadn't been tested until now. I'll definitely have a look at this new issue once I'm back at work in ~1.5 weeks; until then you can use Python 3.5 which should work fine.

This was due to the missing final NULL in he keyword list. This should have
always have been there, but until CPython 3.6 the error wasn't evident (and
even with 3.6 it's not always reproducible -- for instance, the unit tests all
nicely run).

I'm also extending the python versions used by Travis to include 3.5 and 3.6
for better coverage.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
This phrase is a formality that no longer has legal effects, but caused
confusion with the permissive grants of the 3-clause BSD license under which
ijson is distributed.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@rtobar
Copy link
Contributor Author

rtobar commented Jan 24, 2017

Hi @isagalaev,

I'm finally back at work!

After some reading I learned that the "All rights reserved" phrase has no legal effect at all nowadays, and thus I simply removed it from the two files I added in this pull request. The importing of the C module has been changed as agreed, and the additional printing from the tests has been removed as well.

I've also included a fix for the bug reported by @KenjiTakahashi, and extended the python versions used by travis to include 3.5 and 3.6.

@KenjiTakahashi
Copy link

@isagalaev any plans for getting this in?

@rtobar
Copy link
Contributor Author

rtobar commented May 24, 2017

pinging again

@isagalaev
Copy link
Owner

Sorry, I'm trying to find time to merge it and release a new version. I thought I'd be able to do it during PyCon past weekend but didn't get around…

@rtobar
Copy link
Contributor Author

rtobar commented Oct 24, 2017

Just the usual ping...

@KenjiTakahashi
Copy link

It's been some time again, hasn't it :-)?

setup.py Outdated
libraries = ['yajl'])
setupArgs['ext_modules'] = [yajl_ext]

setup(**setupArgs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no new line :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a commit for this, was it an actual problem though or just aesthetic?

I haven't heard this is an actual issue that blocks compilation/usage,
but on the other hand it doesn't hurt.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@wichert
Copy link

wichert commented Jun 6, 2019

@isagalaev Yearly ping - is there a chance you can take a look at this again and hopefully merge it?

@isagalaev
Copy link
Owner

Hey everyone. I lost track of things happening here a while ago and don't remember why I didn't merge it in the first place. It's also unlikely I'll find time to get back to it any time soon. As far as I understand, @rtobar fork is alive and well, so I recommend everyone just use it.

(I'll close the ticket to avoid confusion, but feel free to continue conversation if needed.)

@isagalaev isagalaev closed this Jun 6, 2019
@wichert
Copy link

wichert commented Jun 7, 2019

@isagalaev Assuming @rtobar is interested, would you be willing to give him PyPi privileges to make new releases?

@rtobar
Copy link
Contributor Author

rtobar commented Jun 7, 2019

Sure, I wouldn't mind on my side. My user in PyPI is rtobar too.

@isagalaev
Copy link
Owner

@rtobar hey! Added you on PyPI as an owner. I can't seem to remove myself, so feel free to either leave me there for precaution or take it over completely.

@rtobar
Copy link
Contributor Author

rtobar commented Jun 11, 2019

Thanks Ivan! I already exercised this and uploaded a new 2.4 version of ijson to PyPI, including binary wheels for the main x64 linux/macos.

@isagalaev
Copy link
Owner

Yay! Glad not to be a bottleneck anymore! And thanks to @wichert for kicking me in the right direction.

arthurzam added a commit to arthurzam/gentoo that referenced this pull request Oct 24, 2019
Maintenance over this package was passed to new developer, based on
previous maintainer comment:
isagalaev/ijson#58 (comment)

Package-Manager: Portage-2.3.78, Repoman-2.3.17
Signed-off-by: Zamarin Arthur <arthurzam@gmail.com>
gentoo-bot pushed a commit to gentoo/gentoo that referenced this pull request Nov 9, 2019
Maintenance over this package was passed to new developer, based on
previous maintainer comment:
isagalaev/ijson#58 (comment)

Package-Manager: Portage-2.3.78, Repoman-2.3.17
Signed-off-by: Zamarin Arthur <arthurzam@gmail.com>
Signed-off-by: Michał Górny <mgorny@gentoo.org>
@rtobar rtobar deleted the cext branch February 11, 2020 02:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants