Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This is very very slow on my computer #24

Closed
vongohren opened this issue Mar 6, 2020 · 16 comments
Closed

This is very very slow on my computer #24

vongohren opened this issue Mar 6, 2020 · 16 comments
Labels
question Further information is requested

Comments

@vongohren
Copy link

So I have a json file, 330 ish MB.
The content is like this

{
  "locations" : [ {
    "timestampMs" : "1231313131313",
    "latitudeE7" : 111111111,
    "longitudeE7" : 123123131,
    "accuracy" : 36,
    "activity" : [ {
      "timestampMs" : "1211211121121",
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 75
      }, {
        "type" : "ON_FOOT",
        "confidence" : 10
      }, {
        "type" : "IN_VEHICLE",
        "confidence" : 5
      }, {
        "type" : "ON_BICYCLE",
        "confidence" : 5
      }, {
        "type" : "UNKNOWN",
        "confidence" : 5
      }, {
        "type" : "WALKING",
        "confidence" : 5
      }, {
        "type" : "RUNNING",
        "confidence" : 5
      } ]
    } ]
  }, {........

Meaning an array of locations.
If I run this through json_load, then iterate over the file, pull out the two map_keys I want. It takes about 20 seconds. It is doable.

But I cannot load the whole thing in memory anymore, it is to big for mye infrastructure, so I found this lib. But when I run fex

    locations = ijson.kvitems(json_file, 'locations.item')
    timestampMsObjects = (v for k, v in locations if k == 'timestampMs')
    timestampMs = list(timestampMsObjects)

It takes many many minutes. I dont know how long acctually becaus i quit it everytime it goes to far.

Why is this? Im just trying to get the length of that list.
See how many points im working with.

Afterwards I want to pull out 3 map_keys, and combine them into a smaller object. But just need ti naje sure this software is fast enough.

Anyone with some insight on this?

@rtobar
Copy link

rtobar commented Mar 7, 2020

What backend are you using? See what happens when you try to import the other backends explicitly (see the README file for instructions). I suspect you are using the python backend, which is most likely the cause of the trouble.

Also, what platform are you on, and how did you get ijson? There are binary wheels in PyPI for most Linux/Mac combinations in which the yajl2_c backend should work correctly.

@rtobar
Copy link

rtobar commented Mar 7, 2020

Separately, do you need to iterate over the whole file a first time? If you are extracting/filtering only some information out of it you could break sooner and not read all of it -- unless of course the filtering itself depends on the length of the locations array

@vongohren
Copy link
Author

@rtobar I think I might have jumped a bit early to how to use this. I did not look at any backend stuff cause I thought it was for special cases.
I import ijson to my current code and run it, so no handling, but guess I need to look at it.

Right now Im running it on a Mac, and im aiming to run it in a docker container, where the VM has a memory restriction of 2GB.

I will come back with some questions after I read up on your suggestion

@vongohren
Copy link
Author

@rtobar I cant really get backends to load in any different way.

import ijson.backends.yajl2_cffi as ijson

but with yajl2_c

But was not able to load this, it failed.

Is ther anything else one should do to load the backend properly?

@rtobar
Copy link

rtobar commented Mar 8, 2020

@vongohren sorry, but I couldn't understand what exactly work and what didn't. Did both yajl2_cffi and yajl2_c fail? Also, please let me know how you installed ijson -- whether you installed it manually from the repository (works, but if you don't have the yajl2 library available in your system it won't build the yajl2_c backend) or using pip (preferred method, the yajl2_c backend should work out of the box).

A different test you can try out is running he https://github.com/ICRAR/ijson/blob/master/benchmark.py tool. Download that file, and run benchmark.py -l to get a list of the backends you have available. You can also try benchmark.py -i your_file.json -M kvitems to see how long it takes to parse via kvitems with the different backends (and you can use the -B flag to select a particular backend, if available). See all help with benchmark.py --help.

@vongohren
Copy link
Author

vongohren commented Mar 8, 2020

@rtobar thanks for the patience 😁 And sorry for the sparse communication. Im very new to the python environment so I still need to learn how all the different things can stick together 🤓So not sure how I add yajl2 to my MAC, or the docker container this eventually is going to run in

That benchmarking tool did give some insights!

Backends:
 - python
Benchmarks:
 - long_list
 - big_int_object
 - big_decimal_object
 - big_null_object
 - big_bool_object
 - big_str_object
 - big_longstr_object
 - object_with_10_keys
 - empty_lists
 - empty_objects

I guess I have very few backends available.
But I did install via pip, so as you say, the backend should work out of the box?
To test this I just cloned this repo and ran it on my bare mac. Meaning that pip was not used for this benchmark test.

Is there a way to run the benchmark inside the venv where I did pip install the ijson?

It did also finish the test, which is not optimal. You think it can be faster?

#mbytes,method,test_case,backend,time,mb_per_sec
321.567, kvitems, locations.json, python, 147.779, 2.176

@vongohren
Copy link
Author

vongohren commented Mar 8, 2020

Ok, so a bit silly, but I just ran brew install yajl and the cloned benchmark gave yajl2 as a possible backend. But is it suppos eto show yajl2_c as a possible backend aswell, if I can use it? Because I see yajl2, is slower than the two other versions?

#mbytes,method,test_case,backend,time,mb_per_sec
321.567, kvitems, location.json, python, 168.581, 1.907
321.567, kvitems, location.json, yajl2, 104.954, 3.064

But it is still quite time consuming. Should it be this high? Or are there any other tweaking stuff I can do?

@rtobar
Copy link

rtobar commented Mar 8, 2020

@vongohren thanks for all the details, now things are becoming clear. Indeed you were using the python backend, which was my initial suspicion. If you pip install cffi that will also give you access to the yajl2_cffi too. But in the other hand you still don't have the yajl2_c backend.

What version of python (and MacOS) are you running? If it's 3.8 that might explain it, as I think (from memory) I had to skip generating binary wheels for that version. This is not the case for Linux wheels, which are generated for all python versions correctly.

Now that you have yajl installed, you could try to compile the package yourself, hoping that you will end up with a usable yajl2_c backend for your tests (again, when building your container this shouldn't be a problem, as the package installed with pip should have it). The yajl2_c backend is usually ~10x faster than yajl2 and yajl2_cffi, so you should be down to reasonable times.

@vongohren
Copy link
Author

Im running:
MacOS: 10.15.3
Python: 3.6.7

Im getting my code to run when i add import ijson.backends.yajl2_c as ijson
I pressume that it is using the right lib for the best speed then?
Or can it turn back to some default mode?

Im running this simple code

    locations = ijson.kvitems(json_file, 'locations.item')
    timestampMsObjects = (v for k, v in locations if k == 'timestampMs')
    print(timestampMsObjects)
    timestampMs = list(timestampMsObjects)
    print(len(timestampMs))

It takes Parsing the file in 241.8568 seconds, which is not that great. What might the reason be?
This is the benchmarking iv got on the same file. Should i see yajl2_c in that list?

#mbytes,method,test_case,backend,time,mb_per_sec
321.567, kvitems, location.json, python, 168.581, 1.907
321.567, kvitems, location.json, yajl2, 104.954, 3.064

Iv also found this library: https://pypi.org/project/jsonslicer/#description, that was able to get through the file and I could handle all entries in about 98.12s. Without any special configuration.

I would love to understand if im not able to run yajl2_c, that is why its not showing its true speed, or if this is a limet?
Or maybe my code approach is bad?

Im basically trying to map the jsonfile with just a couple of the map_keys, included.

@vongohren
Copy link
Author

I also tried this code on this many entries: 1062126
It never finished, had to quit it

        parser = ijson.parse(json_file)
        f_out.write("{\"locations\":[")
        for prefix, event, value in parser:
            if(event == "end_map"):
                f_out.write("}")

        f_out.write("]}")

So maybe I'm taking som wrong approach to your lib?

@vongohren
Copy link
Author

Iv might have found the culprit. Suddenly my function was blazingly fast.
I removed a memory profiler 😰
Jsonslice did this in 6.6 seconds

ijson got the running down to 10.8 seconds.

But still, I got it faster with jsonslicer

@rtobar
Copy link

rtobar commented Mar 8, 2020

This is the benchmarking iv got on the same file. Should i see yajl2_c in that list?

Yes. I'm still puzzled: you said that in your code you do import ijson.backends.yajl2_c as ijson, but you don't see that backend on the benchmark list, meaning that the benchmark can't import it. Maybe try running benchmark.py from a different directory, not directly from the top-level directory of the ijson repo (e.g., put it under /tmp and execute it there), python might be getting confused and loading ijson from the repo instead of the version you have installed.

So maybe I'm taking som wrong approach to your lib?

It seems you can do better than what you are doing. You mentioned a couple of times you just want to take some map keys out of the JSON stream, and it that case kvitems is a killersituations because you don't really need ijson to create objects for you. You can probably do what you need with ijson.parse, which will be faster than kvitems as it doesn't create any objects.

Or try also using `ijson.items(f, 'locations.item.timestampMs'). That should return only those values and nothing else, rather than building loads of objects that you end up discarding anyway.

@jpmckinney
Copy link

jpmckinney commented Mar 9, 2020

If you’re just looking for the best performance, try pip install ijson==3.0rc2, which is faster than JsonSlicer on their own benchmark.

@vongohren
Copy link
Author

Thanks @jpmckinney that is interseting, I will look at it!

@vongohren
Copy link
Author

@rtobar cool, thanks will check it up. It takes some time because this is a hobby project. But I appriciate the feedback. The jsonslicer have not provided feedback yet, so I would more preferrebly use this repo which do answer :)

@vongohren
Copy link
Author

@rtobar @jpmckinney thanks for the followup, I will close this as Im moving onwards with a satisiefied result. But the feedback and assitance is much appreciated

@rtobar rtobar added the question Further information is requested label Mar 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants