-
Notifications
You must be signed in to change notification settings - Fork 130
Iterating over a collection!?!? #62
Comments
Is that your highest structure level (and thus you have only one big object in your 12gb JSON file)? What you will probably need to do is iterate over {
"key1": [0],
"key2": [1],
"key3": [2]
} produces this series of events: ('', 'start_map', None),
('', 'map_key', 'key1'),
('key1', 'start_array', None)
('key1.item', 'number', 0)
('key1', 'end_array', None)
('', 'map_key', 'key2')
('key2', 'start_array', None)
('key2.item', 'number', 1)
('key2', 'end_array', None)
('', 'map_key', 'key3')
('key3', 'start_array', None)
('key3.item', 'number', 2)
('key3', 'end_array', None)
('', 'end_map', None) |
So with my experimentation, I have this small chunk of code.
Each iteration of this for loop returns a generator object of each sub-JSON object. Below I am providing a better idea of my structure.
The block of code I mentioned above seems to return a generator for each key in the JSON object. A generator for "111", "222", and "333" If this of any use? Thanks so much for getting back to me so quickly! |
In your experimental code you are iterating over match_data for every key in match_keys which is not optimal. I still think the best way is to go with I imagine it similar to this: with open('keys', 'rb') as k:
keys = set(iter(k))
with open('data', 'rb') as data:
for evt in ijson.parse(data):
if evt[0] in keys and evt[1] == 'start_map':
# start constructing your object
elif evt[0] in keys and evt[1] == 'end_map':
# finish constructing your object, etc Now, if you could actually change the data itself then it would be simply easier if you organized it as an array instead of an object with many attributes, and make it look like this instead: [
{
"name": "John",
"gender": "M"
},
{
"name": "Alex",
"gender": "F"
},
{
"name": "Nick",
"gender": "M"
}
] In that case you can simply: with open('data', 'rb') as data:
for obj in ijson.items(data, 'item'):
print("%s is %s" % (obj['name'], obj['gender']) |
import ijson
from ijson.common import ObjectBuilder
def objects(file):
key = '-'
for prefix, event, value in ijson.parse(file):
if prefix == '' and event == 'map_key': # found new object at the root
key = value # mark the key value
builder = ObjectBuilder()
elif prefix.startswith(key): # while at this key, build the object
builder.event(event, value)
if event == 'end_map': # found the end of an object at the current key, yield
yield key, builder.value
for key, value in objects(open('example.json', 'rb')):
print(key, value) |
Hi @isagalaev , thank you for your sample and it's quite helpful. Here is a enhanced version it to avoid duplicate object creation for nested data object in json. def objects(self,file):
key = '-'
for prefix, event, value in ijson.parse(file):
if prefix == '' and event == 'map_key': # found new object at the root
key = value # mark the key value
builder = ObjectBuilder()
elif prefix.startswith(key): # while at this key, build the object
builder.event(event, value)
if prefix == key+".item" and event == 'end_map': # found the end of an object at the current key, yield
yield key, builder.value |
I have this exact case, but what would the second argument to |
@nicholastulach |
@nicholastulach it's will be easy to understand if you try to print out the values of prefix, event, value. The "item" will appear in the prefix part.
ijson.items(data, 'item') is just a wrapper of ijson.parse to help you loop through the JSON file. So the "item" |
I am working on JSON files with size in gigabytes and was researching ijson as compared to regular json. What is the performance difference between:
BTW, my dataset is primarily a list with empty top level prefix and every datapoint has almost same map-keys.
My primary question is in regards to my attempts to improve time efficiency. |
Hi @prathameshpatel, The performance difference between the different methods should be fairly similar, with one only adding some abstractions on top of the other. I would recommend to go for the A more important difference, as you have found out, is the backend. The fastest backend in |
Hi @rtobar , Thanks for the reply, it helps clear some things. Can you help me use your additional C backend files in my original ijson installation ? P.S: Sorry for these persistent inquiries. I am new to python and data analysis, and I can't find answers to these questions anywhere. Any help would be great. |
Hi @prathameshpatel, You should first get a copy of the code in my repository's The new backend is called
Then use the |
Hey guys, is it possible to read into memory in chunks using ijson? For example I am using: And while it does work and it avoids the 'out of memory' error, it's really slow. I am converting a 100+ GB json to CSV. My json is a list with empty top level prefix similar to example below. So rather than loading one element at a time into memory, is there an option to load perhaps 1000 at a time? Thank you! [ |
Did you try the C backend? I'm assuming you are using the default backend that is written in pure Python. |
Hi @rtobar, I haven't tried it yet. Is there any installation documentation that you can reference me to? This would be installed on Red Hat Fedora. |
The latest version in PyPI includes it, so it's a matter of getting that and then you will get it. Instructions in how to select it are in https://pypi.org/project/ijson/ |
This is not how the parser works, though. It loads a fixed amount of bytes (16K) into a buffer and parses out however many JSON items you're requesting out of it, one at a time. It's not slow because of it, as all the data is already in memory. It is slow because it's pure Python, and as @rtobar says you should simply import the C backend explicitly which has the exact same API but is way, way faster. |
Thanks so much @rtobar and @isagalaev for your help with this! We installed the C backend but are now facing an error. I changed my import to be: And now getting this error: I thought maybe it's as simple as just having to cast to bytes. And tried the following but now getting another error: Thanks again guys. I feel like we are getting close! (I hope) |
@soyf you'll need to open your file in 'rb' mode, not only 'r', which seems to be what's happening. |
Thank you @rtobar!! Got it working! It is quite faster. Before it took 10 min to write 10k records. Now it took only 3 minutes to do the same. Many thanks!! |
Hi @rtobar and @isagalaev. I wanted to provide you another update. Before when I said it took 10 min to write 10k records, I was using a Pandas DF in my process. I now realize the DF was a lot of overhead. After removing it, this process is now flying! It converted roughly 9 mill rows to csv in 35 min. I have one additional question, is there a way to stream from and write directly to S3 bucket? |
Great to hear the good results @soyf, that's very encouraging :) Regarding your other question, reading off, and writing to S3 would be something you'd have to do yourself. ijson only requires a file-like object; where does the object read data from, or what you do with the resulting parsed results it's up to you. For future reference, please also be aware that ijson is now maintained under ICRAR/ijson, so posts her might not get too much attention. |
Good point. I'm closing the issue here. Also, @soyf, I'd say ask on StackOverflow for streaming data in/out of S3. I'm pretty sure it should be possible to fish an open socket out of boto3 internals, but I don't have an answer right away. |
Thanks guys |
Hello, I have a 12gb JSON file that I am trying to iterate over.
The format is in a collection style as follows:
{
"key1": {...},
"key2": {...},
"key3": {...},
...
}
I cannot for the life of me figure it out. Pretty much, I want to chunk the root level keys one at a time, and handle them individually.
Any help would greatly be appreciated!
Thanks!
Connor
The text was updated successfully, but these errors were encountered: