Skip to content
This repository has been archived by the owner on Jan 15, 2020. It is now read-only.

Iterating over a collection!?!? #62

Closed
ghost opened this issue Mar 26, 2017 · 24 comments
Closed

Iterating over a collection!?!? #62

ghost opened this issue Mar 26, 2017 · 24 comments

Comments

@ghost
Copy link

ghost commented Mar 26, 2017

Hello, I have a 12gb JSON file that I am trying to iterate over.

The format is in a collection style as follows:
{
"key1": {...},
"key2": {...},
"key3": {...},
...
}

I cannot for the life of me figure it out. Pretty much, I want to chunk the root level keys one at a time, and handle them individually.

Any help would greatly be appreciated!

Thanks!

Connor

@rtobar
Copy link
Contributor

rtobar commented Mar 27, 2017

Is that your highest structure level (and thus you have only one big object in your 12gb JSON file)? ijson.items could only be used here at the highest level as only one prefix can be given, and that would have to be ''. At the end you would get the same big, high-level object you would get via json.load.

What you will probably need to do is iterate over ijson.parse to manually receive the individual parsing events (map_key, start_array, etc) and react to that. For instance, the following document:

{
        "key1": [0],
        "key2": [1],
        "key3": [2]
}

produces this series of events:

('', 'start_map', None),
('', 'map_key', 'key1'),
('key1', 'start_array', None)
('key1.item', 'number', 0)
('key1', 'end_array', None)
('', 'map_key', 'key2')
('key2', 'start_array', None)
('key2.item', 'number', 1)
('key2', 'end_array', None)
('', 'map_key', 'key3')
('key3', 'start_array', None)
('key3.item', 'number', 2)
('key3', 'end_array', None)
('', 'end_map', None)

@ghost
Copy link
Author

ghost commented Mar 27, 2017

So with my experimentation, I have this small chunk of code.

    match_keys = open("match_data_keys", "r")
    match_data = open("match_data", "r")

    for match in match_keys:
        json_obj = ijson.items(match_data, match.strip())
        print(json_obj)

Each iteration of this for loop returns a generator object of each sub-JSON object.

Below I am providing a better idea of my structure.

{
  "111": {
    "name": "John",
    "gender": "M"
  },
  "222": {
    "name": "Alex",
    "gender": "F"
  },
  "333": {
    "name": "Nick",
    "gender": "M"
  }
}

The block of code I mentioned above seems to return a generator for each key in the JSON object.

A generator for "111", "222", and "333"

If this of any use?

Thanks so much for getting back to me so quickly!

@rtobar
Copy link
Contributor

rtobar commented Mar 28, 2017

In your experimental code you are iterating over match_data for every key in match_keys which is not optimal. I still think the best way is to go with ijson.parse instead of ijson.items, and that way you can iterate only once over your data.

I imagine it similar to this:

with open('keys', 'rb') as k:
    keys = set(iter(k))

with open('data', 'rb') as data:
    for evt in ijson.parse(data):
        if evt[0] in keys and evt[1] == 'start_map':
            # start constructing your object
        elif evt[0] in keys and evt[1] == 'end_map':
            # finish constructing your object, etc

Now, if you could actually change the data itself then it would be simply easier if you organized it as an array instead of an object with many attributes, and make it look like this instead:

[
  {
    "name": "John",
    "gender": "M"
  },
  {
    "name": "Alex",
    "gender": "F"
  },
  {
    "name": "Nick",
    "gender": "M"
  }
]

In that case you can simply:

with open('data', 'rb') as data:
    for obj in ijson.items(data, 'item'):
        print("%s is %s" % (obj['name'], obj['gender'])

@isagalaev
Copy link
Owner

isagalaev commented Mar 29, 2017

import ijson
from ijson.common import ObjectBuilder


def objects(file):
    key = '-'
    for prefix, event, value in ijson.parse(file):
        if prefix == '' and event == 'map_key':  # found new object at the root
            key = value  # mark the key value
            builder = ObjectBuilder()
        elif prefix.startswith(key):  # while at this key, build the object
            builder.event(event, value)
            if event == 'end_map':  # found the end of an object at the current key, yield
                yield key, builder.value


for key, value in objects(open('example.json', 'rb')):
    print(key, value)

@lin1000
Copy link

lin1000 commented Sep 26, 2017

Hi @isagalaev , thank you for your sample and it's quite helpful. Here is a enhanced version it to avoid duplicate object creation for nested data object in json.

def objects(self,file):
    key = '-'
    for prefix, event, value in ijson.parse(file):
        if prefix == '' and event == 'map_key':  # found new object at the root
            key = value  # mark the key value
            builder = ObjectBuilder()
        elif prefix.startswith(key):  # while at this key, build the object
            builder.event(event, value)
            if prefix == key+".item" and event == 'end_map':  # found the end of an object at the current key, yield
                yield key, builder.value  

@nicholastulach
Copy link

[
  {
    "name": "John",
    "gender": "M"
  },
  {
    "name": "Alex",
    "gender": "F"
  },
  {
    "name": "Nick",
    "gender": "M"
  }
]

In that case you can simply:

with open('data', 'rb') as data:
    for obj in ijson.items(data, 'item'):
        print("%s is %s" % (obj['name'], obj['gender'])

I have this exact case, but what would the second argument to items be? It makes no sense that it is 'item', since that isn't defined anywhere. How do you use ijson to loop through a JSON file where the top-level element is an array like in the sample quoted above?

@isagalaev
Copy link
Owner

isagalaev commented Sep 29, 2017

@nicholastulach item is a special "keyword" denoting an element of an array. So yes, for a top-level array the query in the second argument would be simply 'item'.

@lin1000
Copy link

lin1000 commented Sep 30, 2017

@nicholastulach it's will be easy to understand if you try to print out the values of prefix, event, value. The "item" will appear in the prefix part.

for prefix, event, value in ijson.parse(filename):
    print((prefix,event,value))

ijson.items(data, 'item') is just a wrapper of ijson.parse to help you loop through the JSON file. So the "item"

@prathameshpatel
Copy link

I am working on JSON files with size in gigabytes and was researching ijson as compared to regular json. What is the performance difference between:

  1. getting objects through ijson.items() iteration
  2. individual events through ijson.parse() iteration
    ???

BTW, my dataset is primarily a list with empty top level prefix and every datapoint has almost same map-keys.
In my case the runtime for object extraction into a big list is (using timeit):

  1. ijson = 343.9375227150062 sec
  2. ijson yajl2 backend = 224.40192600000046 sec
  3. ijson yajl2_cffi backend = 206.8481009999996 sec

My primary question is in regards to my attempts to improve time efficiency.

@rtobar
Copy link
Contributor

rtobar commented Apr 16, 2018

Hi @prathameshpatel,

The performance difference between the different methods should be fairly similar, with one only adding some abstractions on top of the other. I would recommend to go for the ijson.items unless you have a good reason to analyze the underlying events via ijson.parse or ijson.basic_parse.

A more important difference, as you have found out, is the backend. The fastest backend in ijson at the moment is cffi. However, there is a pull request to add a C backend, which should improve times up to ~10x compared to the cffi backend (or until reading data from the disk becomes a bottleneck). It think in your case it would definitely be worthwhile to try out this branch. The code should eventually be merged into the master branch of this repository, but for the time being and until the pull request is merged, you will need to take the code from the cext branch on my fork of the repository.

@prathameshpatel
Copy link

Hi @rtobar ,

Thanks for the reply, it helps clear some things. Can you help me use your additional C backend files in my original ijson installation ?
I have tracked down the location of ijson installation on my system. Should I just add the two additional backend files - _yajl2.c and yajl2_c.py to the installation directory/ijson/backends/ ?
Or is there some additional process I need to follow ? And what about the usage ?

P.S: Sorry for these persistent inquiries. I am new to python and data analysis, and I can't find answers to these questions anywhere. Any help would be great.
Thanks.

@rtobar
Copy link
Contributor

rtobar commented Apr 17, 2018

Hi @prathameshpatel,

You should first get a copy of the code in my repository's cext branch (e.g., git clone -b cext https://github.com/rtobar/ijson.git). Then you need to install it via the normal installation procedure (e.g., python setup.py install or pip install .). The setup.py script should try to find your yajl2 installation and compile the extension. If yajl2 is not found, try using the CFLAGS and LDFLAGS environment variables to pass down the appropriate compilation and linking flags.

The new backend is called yajl2_c and it is imported like the rest of the backends:

import ijson.backends.yajl2_c as ijson

Then use the ijson object just like you use in the other backends as well.

@soyf
Copy link

soyf commented Sep 6, 2019

Hey guys, is it possible to read into memory in chunks using ijson?

For example I am using:
for json_item in ijson.items(json_file, 'item'):

And while it does work and it avoids the 'out of memory' error, it's really slow. I am converting a 100+ GB json to CSV. My json is a list with empty top level prefix similar to example below. So rather than loading one element at a time into memory, is there an option to load perhaps 1000 at a time?

Thank you!

[
{
"name": "John",
"gender": "M"
},
{
"name": "Alex",
"gender": "F"
},
{
"name": "Nick",
"gender": "M"
}
]

@rtobar
Copy link
Contributor

rtobar commented Sep 6, 2019

Did you try the C backend? I'm assuming you are using the default backend that is written in pure Python.

@soyf
Copy link

soyf commented Sep 6, 2019

Hi @rtobar,

I haven't tried it yet. Is there any installation documentation that you can reference me to? This would be installed on Red Hat Fedora.

@rtobar
Copy link
Contributor

rtobar commented Sep 8, 2019

The latest version in PyPI includes it, so it's a matter of getting that and then you will get it. Instructions in how to select it are in https://pypi.org/project/ijson/

@isagalaev
Copy link
Owner

So rather than loading one element at a time into memory, is there an option to load perhaps 1000 at a time?

This is not how the parser works, though. It loads a fixed amount of bytes (16K) into a buffer and parses out however many JSON items you're requesting out of it, one at a time. It's not slow because of it, as all the data is already in memory. It is slow because it's pure Python, and as @rtobar says you should simply import the C backend explicitly which has the exact same API but is way, way faster.

@soyf
Copy link

soyf commented Sep 9, 2019

Thanks so much @rtobar and @isagalaev for your help with this! We installed the C backend but are now facing an error.

I changed my import to be:
import ijson.backends.yajl2_c as ijson

And now getting this error:
for json_item in ijson.items(json_file, 'item'):
TypeError: expected bytes, str found

I thought maybe it's as simple as just having to cast to bytes. And tried the following but now getting another error:
for json_item in str.encode(ijson.items(json_file, 'item')):
TypeError: descriptor 'encode' requires a 'str' object but received a '_yajl2.items'

Thanks again guys. I feel like we are getting close! (I hope)

@rtobar
Copy link
Contributor

rtobar commented Sep 10, 2019

@soyf you'll need to open your file in 'rb' mode, not only 'r', which seems to be what's happening.

@soyf
Copy link

soyf commented Sep 10, 2019

Thank you @rtobar!! Got it working!

It is quite faster. Before it took 10 min to write 10k records. Now it took only 3 minutes to do the same.

Many thanks!!

@soyf
Copy link

soyf commented Oct 4, 2019

Thank you @rtobar!! Got it working!

It is quite faster. Before it took 10 min to write 10k records. Now it took only 3 minutes to do the same.

Many thanks!!

Hi @rtobar and @isagalaev. I wanted to provide you another update. Before when I said it took 10 min to write 10k records, I was using a Pandas DF in my process. I now realize the DF was a lot of overhead. After removing it, this process is now flying! It converted roughly 9 mill rows to csv in 35 min.

I have one additional question, is there a way to stream from and write directly to S3 bucket?

@rtobar
Copy link
Contributor

rtobar commented Oct 4, 2019

Great to hear the good results @soyf, that's very encouraging :)

Regarding your other question, reading off, and writing to S3 would be something you'd have to do yourself. ijson only requires a file-like object; where does the object read data from, or what you do with the resulting parsed results it's up to you.

For future reference, please also be aware that ijson is now maintained under ICRAR/ijson, so posts her might not get too much attention.

@isagalaev
Copy link
Owner

For future reference, please also be aware that ijson is now maintained under ICRAR/ijson, so posts her might not get too much attention.

Good point. I'm closing the issue here. Also, @soyf, I'd say ask on StackOverflow for streaming data in/out of S3. I'm pretty sure it should be possible to fish an open socket out of boto3 internals, but I don't have an answer right away.

@soyf
Copy link

soyf commented Oct 4, 2019

Thanks guys

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants