Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read json records in chunks using ijson? #89

Closed
Rstar1998 opened this issue Mar 15, 2023 · 4 comments
Closed

How to read json records in chunks using ijson? #89

Rstar1998 opened this issue Mar 15, 2023 · 4 comments
Labels
question Further information is requested

Comments

@Rstar1998
Copy link

Rstar1998 commented Mar 15, 2023

I need to read a huge json file and insert in mongo db. I want to read the json records in chunks of 1 million or any number . How do I achieve such thing using ijson ?

So I have 2GB Json file which I need to load it to mongo database using python.
I used the following piece of code

  import pandas as pd
  from pymongo import MongoClient
  import json
  from ast import literal_eval
  import ijson
  import time 
  
  client = MongoClient("mongodb://localhost:27017/")
  database = client['dfg']
  collection = database['xcv']
  
  
  start = time.time()
  
  with open("huge_json_data.json", "rb") as f:
      collection.insert_many([ record for record in ijson.items(f, "item")] )
  
  end = time.time()
  print(end - start)
  
  client.close()

The problem is that , this process takes a huge amount of time and memory since a 2GB file is read in a list and given to insert_many to load in mongodb. Is it possible to load the file in chunks of 10000 rows and insert ?
Like

    with open("huge_json_data.json", "rb") as f:
        for chunk in ijson.items(f, "item",chunk_size=10000):
              collection.insert_many(chunk)

feel free to correct me if I am following wrong approach or is there any other solution by which I can solve my issue ?

data sample

[ 

    {
            "item_id": 1,
            "temp": 2,
            "time": "2023-03-14 00:00:00",
            "item_list": [
                {
                    "i_id": 0,
                    "i_name": "",
                    "i_brand": "",
                    "i_l_category": "",
                    "i_stock_qnty": 10,
                    "s": 0.90
                },
                {
                    "i_id": 1,
                    "i_name": "",
                    "i_brand": "",
                    "i_l_category": "",
                    "i_stock_qnty": 10,
                    "score": 0.90
                }
               
            ]
        }
    ,
    ......... 100000 such records

]

@Rstar1998 Rstar1998 added the question Further information is requested label Mar 15, 2023
@rtobar
Copy link

rtobar commented Mar 15, 2023

@Rstar1998 please follow the advice given in the template: share what you've tried, ask more precise questions, hopefully also some example data, etc. With such a broad description there's little help you can get.

@Rstar1998
Copy link
Author

@rtobar I have updated my description. Let me know if any more info is needed.

@rtobar
Copy link

rtobar commented Mar 15, 2023

Thanks @Rstar1998, that's much clearer now :-)

The problem is that you are creating a single list with all the results, then feeding it to MongoDB. That is what's causing the problem, not the ijson iteration itself. What you need is to indeed chunk the results from the ijson iteration and feed those chunks to MongoDB.

To answer you direct question: no, ijson doesn't offer chunking itself. The good news is that we don't really need to, as this is a simple and common task. You could for example use itertools.islice for that, which doesn't require much work. Something like (taken from https://docs.python.org/3/library/itertools.html#itertools-recipes, see the one for "batched"):

    items = ijson.items(f, "item")
    while (batch := tuple(islice(items, n))):
        # insert batch into MongoDB

@Rstar1998
Copy link
Author

@rtobar . Thank you very much .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants