New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric decimals being converted into Decimal() objects #13

Closed
mittenchops opened this Issue Sep 7, 2013 · 16 comments

Comments

Projects
None yet
5 participants
@mittenchops
Copy link

mittenchops commented Sep 7, 2013

I'm reading from a geographic shapefile that has, among other data types, a list of coordinates, such as...

{
  "type": "FeatureCollection",
  "features": [
{
  "geometry": {
    "type": "Polygon", 
    "coordinates": [
      [
        [
          -123.09098499999993, 
          45.77992400000005
        ], 

The coordinates are getting converted in ijson into:

[Decimal('-123.09098499999993), Decimal('45.77992400000005')]

I'm not sure whether this is a bug or a feature, but it's making my database choke. =)

My ijson loader looks like this:

def loader(shapefile, collection):
    jf = open(shapefile)
    data = ijson.items(jf,'features')
    for d in data:
        try:
            db[collection].insert(d)
        except:
            pass

My vanilla loader, without ijson, looked like this (and worked, but not on the large files I need ijson for)

def loader(shapefile, collection):
    jf = open(shapefile)
    data = json.load(jf)
    for d in data['features']:
        db[collection].insert(d)

My database is complaining about the Decimal() objects. Where would I add something to convert only the Decimal() objects into their numeric values?

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Sep 7, 2013

This is indeed a feature, I chose Decimals as it's the only way to represent numbers of arbitrary precision. You can work around it in two ways:

  • Your database driver might have a way to provide a converter for data types where you can float(...) Decimal values.
  • Otherwise you can convert those values on the lower level of parsing by manually combining parsed events from a backend with the items helper:
from ijson import common, yajl2
from itertools import imap

def floaten(event):
    if event[1] == 'number':
        return (event[0], event[1], float(event[2]))
    else:
        return event

jf = open(shapefile)
events = imap(floaten, yajl2.parse(jf))
data = common.items(events, 'features')

Some notes:

  • If yajl2 backend is not available on your machine, try using yajl and then python (this is what ijson does under the hood when you just import ijson)
  • In Python 3 there's no need to import imap from itertools, you can just use regular built-in map. In Python 2 map would load everything in memory, defeating the purpose.
  • I didn't test the code, expect typos :-)
@mittenchops

This comment has been minimized.

Copy link
Author

mittenchops commented Sep 7, 2013

Ah, sorry, I submitted the pull request before I saw this.

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Sep 7, 2013

You're very fast :-).

Anyway, since using Decimal is a deliberate decision the more straightforward way to "fix" it would be to just replace Decimals with floats in backends, not introduce a backward conversion after the fact :-)

@mittenchops

This comment has been minimized.

Copy link
Author

mittenchops commented Sep 7, 2013

You're quick, too, thanks! Yeah, it looked deliberate and my hack clearly felt like a hack, so that makes sense. =)

I'm trying to work this in and I guess having trouble with some facet of yagl2 as you predicted.

>>> from ijson import common, yajl2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name yajl2

So, I tried

>>> from ijson.backends import yajl2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/machine/local/lib/python2.7/site-packages/ijson/backends/yajl2.py", line 14, in <module>
    yajl = backends.find_yajl(2)
  File "/home/machine/local/lib/python2.7/site-packages/ijson/backends/__init__.py", line 14, in find_yajl
    raise YAJLImportError('YAJL version %s.x required, found %s.%s.%s' % (required, major, minor, micro))
>>> from ijson.backends import yajl
>>> from ijson.backends import python
>>> from itertools import imap
>>> from ijson.backends import yajl2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/machine/local/lib/python2.7/site-packages/ijson/backends/yajl2.py", line 14, in <module>
    yajl = backends.find_yajl(2)
  File "/home/machine/local/lib/python2.7/site-packages/ijson/backends/__init__.py", line 14, in find_yajl
    raise YAJLImportError('YAJL version %s.x required, found %s.%s.%s' % (required, major, minor, micro))
ijson.backends.YAJLImportError: YAJL version 2.x required, found 1.0.12

Ruh roh.

$ pip install yajl

was fine, but

$ pip install yajl2

doesn't seem to exist. What's the pip package name that would update this?

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Sep 7, 2013

It seems that you can import yajl just fine, so go ahead and use it. This is exactly why ijson has backends for both 1.x and 2.x versions.

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Sep 7, 2013

btw, yajl is a C library, having nothing to do with Python, so it shouldn't be installable by pip at all. If pip install yajl works this may be some other Python wrapper, not the library itself.

@mittenchops

This comment has been minimized.

Copy link
Author

mittenchops commented Sep 9, 2013

Cool, that's super helpful. So, I've tried to implement your correct solution above (again for the particular problem of shapefiles for geomap applications---it's geojson format), and after I got all the dependencies straight, I still seem to be doing something wrong, maybe missing a flag somewhere. It does not seem to be iterating---I'm pretty sure it's rather dumping everything at once.

import json, pymongo, shapefile, sys, ijson
from ijson import common
from ijson.backends import yajl

def floaten(event):
    if event[1] == 'number':
        return (event[0], event[1], float(event[2]))
    else:
        return event

# This does not actually work for some reason.
def loader(shapefile, collection):
    with open(shapefile) as jf:
        events = imap(floaten, yajl.parse(jf))
        i = 0
        for d in common.items(events, 'features'):
            print('Loading record {}'.format(str(i)))
            db[collection].insert(d)
            time.sleep(1)
            i = i + 1
            print("Upload of shapefile: {} into collection: {} successful".format(shapefile, collection))

I'm getting the right number of records, so I know it's working, but it's loading the entire set at once, and then only sleeping after. (I added the sleep timer just to verify whether it was iterating or loading all records at once.)

Is there something obviously silly I've misunderstood? Sorry to bother again.

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Sep 9, 2013

The code looks right from the point of view of ijson. I don't see you importing imap but I think you just left it out of the snippet. My first hypothesis would be that Mongo driver itself may buffer values on .insert(d) in memory instead of committing them to the database. The simplest way to check it is to comment out this line (and the sleep(1) too) and see if it still does that long pause at the end.

@mittenchops

This comment has been minimized.

Copy link
Author

mittenchops commented Sep 9, 2013

Yeah, weird. I found it worked when I added an additional level of loop and changed....

    with open(shapefile) as jf:
       events = imap(floaten, yajl.parse(jf))
       for d in common.items(events, 'features'):
            db[collection].insert(d)

to

    with open(shapefile) as jf:
        events = imap(floaten, yajl.parse(jf))
        for data in common.items(events, 'features'):
            for d in data:
                db[collection].insert(d)

It still seems to take some time to presumably do the imap() operation before it gets to the insert---I notice a delay before the data hits the DB. But this way I do see it iterating one at a time into the DB, and it no longer times out for large files. Apparently, the first way was actually creating an array in my DB, which was being implicitly inserted in a batch mode.

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Sep 10, 2013

It can't be imap. The whole point of it (as opposed to map) is that it doesn't iterate over the source itself, instead it creates another iterator that, when iterated, takes source values one by one and applies a mapping function (flatten here) on each iteration.

In your case the inner loop might make things better because it inserts more frequently and it may cause the Mongo driver to commit its buffer also more frequently. This is just a guess, though.

(I believe I can safely close the ticket as it is not something that should be fixed in ijson.)

@isagalaev isagalaev closed this Sep 10, 2013

@kimballfrank

This comment has been minimized.

Copy link

kimballfrank commented Jan 10, 2014

I have a problem with this issue being closed. We cannot always control the backend API we are communicating with. In some cases, we can't "fix" this issue at the backend layer. I also think that the proposed solution here using imap just seems counter intuitive.

Would you consider making the the Decimal swapping feature of ijson optional?

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Jan 10, 2014

This is not some separate swapping feature: I just had to choose one of the two standard Python ways of represent numbers. I chose Decimal because it works for all numbers while IEEE float has well-known problems with precision. So, no, I don't like the idea of switching it to float by default. Neither do I like the idea of making it configurable as it will complicate the code for the only purpose of making it work in corner cases where the consumer doesn't understand decimals (which is, frankly, a bug as decimal was in the standard library since Python 2.5).

I wasn't proposing my workaround code as some sort of a ready-made solution required to use ijson. I was merely demonstrating the idea that when you have two libraries with incompatible data format you could write a glue layer that would marry them. You can use any code style that you deem appropriate.

@kimballfrank

This comment has been minimized.

Copy link

kimballfrank commented Jan 10, 2014

Fair enough. :) I just forked the code and swapped the the Decimal for Float. For my use case, this is working flawlessly. Thanks for a good library.

@isagalaev

This comment has been minimized.

Copy link
Owner

isagalaev commented Jan 12, 2014

That's certainly another way to do it :-)

@asavpatel92

This comment has been minimized.

Copy link

asavpatel92 commented Apr 1, 2015

@mittenchops currently I am having same problem... I am new to ijson can you guide me where do I need to make changes??? my db only supports int values so I probably will have to convert all values to string and then insert it into db.

@SalomonSmeke

This comment has been minimized.

Copy link

SalomonSmeke commented Sep 24, 2018

I wanted to add what landed on the codebase I'm working on so that this isn't an arduous read for anyone in the future:

def floaten(iter_ijson_parser):
    """Ensure numeric events are Float and not Decimal by filtering a ijson event stream.

    Args:
        iter_ijson_parser: ijson event stream.

    Returns:
        generator: ijson event stream.
    """
    for prefix, event, value in iter_json_parser:
        if event == 'number' and isinstance(value, Decimal):
            value = float(value)
        yield prefix, event, value

Usage:

   for prefix, event, value in floaten(ijson.parse(<str>)):
      # do stuff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment