Iterating over a collection!?!? #62

ghost · 2017-03-26T05:30:19Z

Hello, I have a 12gb JSON file that I am trying to iterate over.

The format is in a collection style as follows:
{
"key1": {...},
"key2": {...},
"key3": {...},
...
}

I cannot for the life of me figure it out. Pretty much, I want to chunk the root level keys one at a time, and handle them individually.

Any help would greatly be appreciated!

Thanks!

Connor

rtobar · 2017-03-27T06:08:43Z

Is that your highest structure level (and thus you have only one big object in your 12gb JSON file)? ijson.items could only be used here at the highest level as only one prefix can be given, and that would have to be ''. At the end you would get the same big, high-level object you would get via json.load.

What you will probably need to do is iterate over ijson.parse to manually receive the individual parsing events (map_key, start_array, etc) and react to that. For instance, the following document:

{
        "key1": [0],
        "key2": [1],
        "key3": [2]
}

produces this series of events:

('', 'start_map', None),
('', 'map_key', 'key1'),
('key1', 'start_array', None)
('key1.item', 'number', 0)
('key1', 'end_array', None)
('', 'map_key', 'key2')
('key2', 'start_array', None)
('key2.item', 'number', 1)
('key2', 'end_array', None)
('', 'map_key', 'key3')
('key3', 'start_array', None)
('key3.item', 'number', 2)
('key3', 'end_array', None)
('', 'end_map', None)

ghost · 2017-03-27T13:32:32Z

So with my experimentation, I have this small chunk of code.

    match_keys = open("match_data_keys", "r")
    match_data = open("match_data", "r")

    for match in match_keys:
        json_obj = ijson.items(match_data, match.strip())
        print(json_obj)

Each iteration of this for loop returns a generator object of each sub-JSON object.

Below I am providing a better idea of my structure.

{
  "111": {
    "name": "John",
    "gender": "M"
  },
  "222": {
    "name": "Alex",
    "gender": "F"
  },
  "333": {
    "name": "Nick",
    "gender": "M"
  }
}

The block of code I mentioned above seems to return a generator for each key in the JSON object.

A generator for "111", "222", and "333"

If this of any use?

Thanks so much for getting back to me so quickly!

rtobar · 2017-03-28T02:47:18Z

In your experimental code you are iterating over match_data for every key in match_keys which is not optimal. I still think the best way is to go with ijson.parse instead of ijson.items, and that way you can iterate only once over your data.

I imagine it similar to this:

with open('keys', 'rb') as k:
    keys = set(iter(k))

with open('data', 'rb') as data:
    for evt in ijson.parse(data):
        if evt[0] in keys and evt[1] == 'start_map':
            # start constructing your object
        elif evt[0] in keys and evt[1] == 'end_map':
            # finish constructing your object, etc

Now, if you could actually change the data itself then it would be simply easier if you organized it as an array instead of an object with many attributes, and make it look like this instead:

[
  {
    "name": "John",
    "gender": "M"
  },
  {
    "name": "Alex",
    "gender": "F"
  },
  {
    "name": "Nick",
    "gender": "M"
  }
]

In that case you can simply:

with open('data', 'rb') as data:
    for obj in ijson.items(data, 'item'):
        print("%s is %s" % (obj['name'], obj['gender'])

isagalaev · 2017-03-29T03:33:08Z

import ijson
from ijson.common import ObjectBuilder


def objects(file):
    key = '-'
    for prefix, event, value in ijson.parse(file):
        if prefix == '' and event == 'map_key':  # found new object at the root
            key = value  # mark the key value
            builder = ObjectBuilder()
        elif prefix.startswith(key):  # while at this key, build the object
            builder.event(event, value)
            if event == 'end_map':  # found the end of an object at the current key, yield
                yield key, builder.value


for key, value in objects(open('example.json', 'rb')):
    print(key, value)

lin1000 · 2017-09-26T06:23:41Z

Hi @isagalaev , thank you for your sample and it's quite helpful. Here is a enhanced version it to avoid duplicate object creation for nested data object in json.

def objects(self,file):
    key = '-'
    for prefix, event, value in ijson.parse(file):
        if prefix == '' and event == 'map_key':  # found new object at the root
            key = value  # mark the key value
            builder = ObjectBuilder()
        elif prefix.startswith(key):  # while at this key, build the object
            builder.event(event, value)
            if prefix == key+".item" and event == 'end_map':  # found the end of an object at the current key, yield
                yield key, builder.value

nicholastulach · 2017-09-29T17:41:33Z

[
  {
    "name": "John",
    "gender": "M"
  },
  {
    "name": "Alex",
    "gender": "F"
  },
  {
    "name": "Nick",
    "gender": "M"
  }
]

In that case you can simply:

with open('data', 'rb') as data:
    for obj in ijson.items(data, 'item'):
        print("%s is %s" % (obj['name'], obj['gender'])

I have this exact case, but what would the second argument to items be? It makes no sense that it is 'item', since that isn't defined anywhere. How do you use ijson to loop through a JSON file where the top-level element is an array like in the sample quoted above?

isagalaev · 2017-09-29T19:43:14Z

@nicholastulach item is a special "keyword" denoting an element of an array. So yes, for a top-level array the query in the second argument would be simply 'item'.

lin1000 · 2017-09-30T11:33:06Z

@nicholastulach it's will be easy to understand if you try to print out the values of prefix, event, value. The "item" will appear in the prefix part.

for prefix, event, value in ijson.parse(filename):
    print((prefix,event,value))

ijson.items(data, 'item') is just a wrapper of ijson.parse to help you loop through the JSON file. So the "item"

prathameshpatel · 2018-04-16T00:51:24Z

I am working on JSON files with size in gigabytes and was researching ijson as compared to regular json. What is the performance difference between:

getting objects through ijson.items() iteration
individual events through ijson.parse() iteration
???

BTW, my dataset is primarily a list with empty top level prefix and every datapoint has almost same map-keys.
In my case the runtime for object extraction into a big list is (using timeit):

ijson = 343.9375227150062 sec
ijson yajl2 backend = 224.40192600000046 sec
ijson yajl2_cffi backend = 206.8481009999996 sec

My primary question is in regards to my attempts to improve time efficiency.

rtobar · 2018-04-16T02:18:48Z

Hi @prathameshpatel,

The performance difference between the different methods should be fairly similar, with one only adding some abstractions on top of the other. I would recommend to go for the ijson.items unless you have a good reason to analyze the underlying events via ijson.parse or ijson.basic_parse.

A more important difference, as you have found out, is the backend. The fastest backend in ijson at the moment is cffi. However, there is a pull request to add a C backend, which should improve times up to ~10x compared to the cffi backend (or until reading data from the disk becomes a bottleneck). It think in your case it would definitely be worthwhile to try out this branch. The code should eventually be merged into the master branch of this repository, but for the time being and until the pull request is merged, you will need to take the code from the cext branch on my fork of the repository.

prathameshpatel · 2018-04-17T05:33:23Z

Hi @rtobar ,

Thanks for the reply, it helps clear some things. Can you help me use your additional C backend files in my original ijson installation ?
I have tracked down the location of ijson installation on my system. Should I just add the two additional backend files - _yajl2.c and yajl2_c.py to the installation directory/ijson/backends/ ?
Or is there some additional process I need to follow ? And what about the usage ?

P.S: Sorry for these persistent inquiries. I am new to python and data analysis, and I can't find answers to these questions anywhere. Any help would be great.
Thanks.

rtobar · 2018-04-17T05:49:48Z

Hi @prathameshpatel,

You should first get a copy of the code in my repository's cext branch (e.g., git clone -b cext https://github.com/rtobar/ijson.git). Then you need to install it via the normal installation procedure (e.g., python setup.py install or pip install .). The setup.py script should try to find your yajl2 installation and compile the extension. If yajl2 is not found, try using the CFLAGS and LDFLAGS environment variables to pass down the appropriate compilation and linking flags.

The new backend is called yajl2_c and it is imported like the rest of the backends:

import ijson.backends.yajl2_c as ijson

Then use the ijson object just like you use in the other backends as well.

soyf · 2019-09-06T16:40:31Z

Hey guys, is it possible to read into memory in chunks using ijson?

For example I am using:
for json_item in ijson.items(json_file, 'item'):

And while it does work and it avoids the 'out of memory' error, it's really slow. I am converting a 100+ GB json to CSV. My json is a list with empty top level prefix similar to example below. So rather than loading one element at a time into memory, is there an option to load perhaps 1000 at a time?

Thank you!

[
{
"name": "John",
"gender": "M"
},
{
"name": "Alex",
"gender": "F"
},
{
"name": "Nick",
"gender": "M"
}
]

rtobar · 2019-09-06T16:52:02Z

Did you try the C backend? I'm assuming you are using the default backend that is written in pure Python.

soyf · 2019-09-06T17:28:18Z

Hi @rtobar,

I haven't tried it yet. Is there any installation documentation that you can reference me to? This would be installed on Red Hat Fedora.

rtobar · 2019-09-08T02:00:01Z

The latest version in PyPI includes it, so it's a matter of getting that and then you will get it. Instructions in how to select it are in https://pypi.org/project/ijson/

isagalaev · 2019-09-08T02:32:37Z

So rather than loading one element at a time into memory, is there an option to load perhaps 1000 at a time?

This is not how the parser works, though. It loads a fixed amount of bytes (16K) into a buffer and parses out however many JSON items you're requesting out of it, one at a time. It's not slow because of it, as all the data is already in memory. It is slow because it's pure Python, and as @rtobar says you should simply import the C backend explicitly which has the exact same API but is way, way faster.

soyf · 2019-09-09T19:48:37Z

Thanks so much @rtobar and @isagalaev for your help with this! We installed the C backend but are now facing an error.

I changed my import to be:
import ijson.backends.yajl2_c as ijson

And now getting this error:
for json_item in ijson.items(json_file, 'item'):
TypeError: expected bytes, str found

I thought maybe it's as simple as just having to cast to bytes. And tried the following but now getting another error:
for json_item in str.encode(ijson.items(json_file, 'item')):
TypeError: descriptor 'encode' requires a 'str' object but received a '_yajl2.items'

Thanks again guys. I feel like we are getting close! (I hope)

rtobar · 2019-09-10T00:55:15Z

@soyf you'll need to open your file in 'rb' mode, not only 'r', which seems to be what's happening.

soyf · 2019-09-10T02:11:23Z

Thank you @rtobar!! Got it working!

It is quite faster. Before it took 10 min to write 10k records. Now it took only 3 minutes to do the same.

Many thanks!!

soyf · 2019-10-04T13:54:20Z

Thank you @rtobar!! Got it working!

It is quite faster. Before it took 10 min to write 10k records. Now it took only 3 minutes to do the same.

Many thanks!!

Hi @rtobar and @isagalaev. I wanted to provide you another update. Before when I said it took 10 min to write 10k records, I was using a Pandas DF in my process. I now realize the DF was a lot of overhead. After removing it, this process is now flying! It converted roughly 9 mill rows to csv in 35 min.

I have one additional question, is there a way to stream from and write directly to S3 bucket?

rtobar · 2019-10-04T13:58:58Z

Great to hear the good results @soyf, that's very encouraging :)

Regarding your other question, reading off, and writing to S3 would be something you'd have to do yourself. ijson only requires a file-like object; where does the object read data from, or what you do with the resulting parsed results it's up to you.

For future reference, please also be aware that ijson is now maintained under ICRAR/ijson, so posts her might not get too much attention.

isagalaev · 2019-10-04T14:02:55Z

For future reference, please also be aware that ijson is now maintained under ICRAR/ijson, so posts her might not get too much attention.

Good point. I'm closing the issue here. Also, @soyf, I'd say ask on StackOverflow for streaming data in/out of S3. I'm pretty sure it should be possible to fish an open socket out of boto3 internals, but I don't have an answer right away.

soyf · 2019-10-04T14:06:21Z

Thanks guys

SalomonSmeke mentioned this issue May 2, 2019

how to transform the txt(saving my dictionary) to dictionary that can be used in Python?(about 10) #71

Closed

rtobar mentioned this issue Sep 13, 2019

Add support for user-specified mapping type [was: Parsing into OrderedDict] ICRAR/ijson#7

Closed

isagalaev closed this as completed Oct 4, 2019

ltalirz mentioned this issue Jan 15, 2020

add support for iterating over key/value pairs? ICRAR/ijson#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating over a collection!?!? #62

Iterating over a collection!?!? #62

ghost commented Mar 26, 2017

rtobar commented Mar 27, 2017

ghost commented Mar 27, 2017

rtobar commented Mar 28, 2017

isagalaev commented Mar 29, 2017 •

edited

Loading

lin1000 commented Sep 26, 2017 •

edited by isagalaev

Loading

nicholastulach commented Sep 29, 2017

isagalaev commented Sep 29, 2017 •

edited

Loading

lin1000 commented Sep 30, 2017

prathameshpatel commented Apr 16, 2018

rtobar commented Apr 16, 2018

prathameshpatel commented Apr 17, 2018

rtobar commented Apr 17, 2018

soyf commented Sep 6, 2019

rtobar commented Sep 6, 2019

soyf commented Sep 6, 2019

rtobar commented Sep 8, 2019

isagalaev commented Sep 8, 2019

soyf commented Sep 9, 2019

rtobar commented Sep 10, 2019

soyf commented Sep 10, 2019

soyf commented Oct 4, 2019

rtobar commented Oct 4, 2019

isagalaev commented Oct 4, 2019

soyf commented Oct 4, 2019

Iterating over a collection!?!? #62

Iterating over a collection!?!? #62

Comments

ghost commented Mar 26, 2017

rtobar commented Mar 27, 2017

ghost commented Mar 27, 2017

rtobar commented Mar 28, 2017

isagalaev commented Mar 29, 2017 • edited Loading

lin1000 commented Sep 26, 2017 • edited by isagalaev Loading

nicholastulach commented Sep 29, 2017

isagalaev commented Sep 29, 2017 • edited Loading

lin1000 commented Sep 30, 2017

prathameshpatel commented Apr 16, 2018

rtobar commented Apr 16, 2018

prathameshpatel commented Apr 17, 2018

rtobar commented Apr 17, 2018

soyf commented Sep 6, 2019

rtobar commented Sep 6, 2019

soyf commented Sep 6, 2019

rtobar commented Sep 8, 2019

isagalaev commented Sep 8, 2019

soyf commented Sep 9, 2019

rtobar commented Sep 10, 2019

soyf commented Sep 10, 2019

soyf commented Oct 4, 2019

rtobar commented Oct 4, 2019

isagalaev commented Oct 4, 2019

soyf commented Oct 4, 2019

isagalaev commented Mar 29, 2017 •

edited

Loading

lin1000 commented Sep 26, 2017 •

edited by isagalaev

Loading

isagalaev commented Sep 29, 2017 •

edited

Loading