New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_json with lines=True not using buff/cache memory #17048

Closed
louispotok opened this Issue Jul 21, 2017 · 23 comments

Comments

Projects
None yet
6 participants
@louispotok
Contributor

louispotok commented Jul 21, 2017

I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.

I'm on Ubuntu, and the free command shows >12GB of "Available" memory, most of which is "buff/cache".

I'm able to read the file into a dataframe by iterating over the file like so:

dfs = []
with open(fp, 'r') as f:
    while True:
        lines = list(itertools.islice(f, 1000))
        
        if lines:
            lines_str = ''.join(lines)
            dfs.append(pd.read_json(StringIO(lines_str), lines=True))
        else:
            break

df = pd.concat(dfs)

You'll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.

It seems that pd.read_json with lines=True doesn't use the available memory, which looks to me like a bug.

@gfyoung gfyoung added the IO JSON label Jul 21, 2017

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 21, 2017

Member

@louispotok : that behavior does sound buggy to me, but before I label it as such, could you provide a minimal reproducible example for us?

Member

gfyoung commented Jul 21, 2017

@louispotok : that behavior does sound buggy to me, but before I label it as such, could you provide a minimal reproducible example for us?

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Jul 21, 2017

Contributor

Happy to, but what exactly would constitute an example here? I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

Contributor

louispotok commented Jul 21, 2017

Happy to, but what exactly would constitute an example here? I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 21, 2017

Member

I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

Just provide the smallest possible JSON file that causes this MemoryError to occur.

Member

gfyoung commented Jul 21, 2017

I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

Just provide the smallest possible JSON file that causes this MemoryError to occur.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 21, 2017

Contributor

The lines=True impl is currently not designed this way. If you subsittue your soln into the current impl does it pass the test suite?

Contributor

jreback commented Jul 21, 2017

The lines=True impl is currently not designed this way. If you subsittue your soln into the current impl does it pass the test suite?

@jreback jreback added the Performance label Jul 21, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 21, 2017

Contributor
Contributor

jreback commented Jul 21, 2017

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Jul 24, 2017

Contributor

@gfyoung I'm still not sure exactly what would be most helpful for you here.

I tried doing head -n 10 path/to/file | testing.py, where testing.py contains: df = pd.read_json(sys.stdin, lines=True), and then varying how many lines to pass.

Results:
Every million lines is about .8 GB, according to head -n 1000000 path/to/file | wc -c. And I did these each a few times in varying orders, always the same results.

  • 1Million lines: success.
  • 1.3M lines: success
  • 2M lines: got "Killed" and also killed a watch in another terminal window, with message "unable to fork process: Cannot allocate memory"
  • 3M lines: got "MemoryError" (I had a watch running here too, no problems at all)
  • Full file: got "MemoryError"
Contributor

louispotok commented Jul 24, 2017

@gfyoung I'm still not sure exactly what would be most helpful for you here.

I tried doing head -n 10 path/to/file | testing.py, where testing.py contains: df = pd.read_json(sys.stdin, lines=True), and then varying how many lines to pass.

Results:
Every million lines is about .8 GB, according to head -n 1000000 path/to/file | wc -c. And I did these each a few times in varying orders, always the same results.

  • 1Million lines: success.
  • 1.3M lines: success
  • 2M lines: got "Killed" and also killed a watch in another terminal window, with message "unable to fork process: Cannot allocate memory"
  • 3M lines: got "MemoryError" (I had a watch running here too, no problems at all)
  • Full file: got "MemoryError"
@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Jul 24, 2017

Contributor

@jreback I think your question was for me, but I don't know how to do what you described. Are there instructions you could point me to?

Contributor

louispotok commented Jul 24, 2017

@jreback I think your question was for me, but I don't know how to do what you described. Are there instructions you could point me to?

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 24, 2017

Member

1.3M lines: success

Yikes! That's a pretty massive file. That does certainly help us with regards to what we would need to do to reproduce this issue.

Member

gfyoung commented Jul 24, 2017

1.3M lines: success

Yikes! That's a pretty massive file. That does certainly help us with regards to what we would need to do to reproduce this issue.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 24, 2017

Member

I think your question was for me, but I don't know how to do what you described.

Here is the documentation for making contributions to the repository. Essentially @jreback is asking if you could somehow incorporate your workaround in your issue description into the implementation of read_json, which you can find in pandas/io/json/json.py.

A quick glance there indicates what might be the issue: we're putting ALL of the lines into a list in memory! Your workaround might be able to address that.

Member

gfyoung commented Jul 24, 2017

I think your question was for me, but I don't know how to do what you described.

Here is the documentation for making contributions to the repository. Essentially @jreback is asking if you could somehow incorporate your workaround in your issue description into the implementation of read_json, which you can find in pandas/io/json/json.py.

A quick glance there indicates what might be the issue: we're putting ALL of the lines into a list in memory! Your workaround might be able to address that.

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Jul 24, 2017

Contributor

Thanks! I added it for one of the possible input types. You can see it here. It passes all the existing tests, and I'm now able to use it to load that file.

I think this is much slower than the previous implementation, and I don't know whether it can be extended to other input types. We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

Contributor

louispotok commented Jul 24, 2017

Thanks! I added it for one of the possible input types. You can see it here. It passes all the existing tests, and I'm now able to use it to load that file.

I think this is much slower than the previous implementation, and I don't know whether it can be extended to other input types. We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 24, 2017

Member

We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

I think it would make sense to add such a parameter. We have it for read_csv. Try adding that and let us know how that works! This looks pretty good so far.

Member

gfyoung commented Jul 24, 2017

We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

I think it would make sense to add such a parameter. We have it for read_csv. Try adding that and let us know how that works! This looks pretty good so far.

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Jul 25, 2017

Contributor

Using the chunksize param in read_csv returns a TextFileReader, though, right? Won't that be confusing?

Contributor

louispotok commented Jul 25, 2017

Using the chunksize param in read_csv returns a TextFileReader, though, right? Won't that be confusing?

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 25, 2017

Member

@louispotok : IMO, it would not because there's more confusion when people try to pass in the same parameters to one read_* function that they're used to passing in for another and found out they don't work or don't exist. Thus, you would be doing all read_json users a service by adding a similar parameter as exists in read_csv. 😄

Member

gfyoung commented Jul 25, 2017

@louispotok : IMO, it would not because there's more confusion when people try to pass in the same parameters to one read_* function that they're used to passing in for another and found out they don't work or don't exist. Thus, you would be doing all read_json users a service by adding a similar parameter as exists in read_csv. 😄

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Jul 27, 2017

Contributor

@gfyoung Makes sense. Here's the latest with the chunksize param.

I still don't know how to make it work on any of the other filepath_or_buffer branches or really what are the input types that would trigger those. I would need an explanation of what's happening there to extend this.

Contributor

louispotok commented Jul 27, 2017

@gfyoung Makes sense. Here's the latest with the chunksize param.

I still don't know how to make it work on any of the other filepath_or_buffer branches or really what are the input types that would trigger those. I would need an explanation of what's happening there to extend this.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 27, 2017

Member

I would need an explanation of what's happening there to extend this.

Certainly. We accept three types of inputs for read_json:

  • file-path (this option BTW is not clearly documented, so a PR to make this clearer is welcome!)
  • file-object
  • valid JSON string

Your contribution would address the first two options. You have at this addressed the first one. The second comes in the conditional that checks if the filepath_or_buffer has a read method. Thus, you should also add your logic there under that check (we'll handle refactoring later).

Member

gfyoung commented Jul 27, 2017

I would need an explanation of what's happening there to extend this.

Certainly. We accept three types of inputs for read_json:

  • file-path (this option BTW is not clearly documented, so a PR to make this clearer is welcome!)
  • file-object
  • valid JSON string

Your contribution would address the first two options. You have at this addressed the first one. The second comes in the conditional that checks if the filepath_or_buffer has a read method. Thus, you should also add your logic there under that check (we'll handle refactoring later).

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Aug 3, 2017

Contributor

Okay @gfyoung , Thanks for your help. I added it to that conditional you mentioned as well. Latest here. Passes the tests.

I also changed the behavior so that if chunksize is not explicitly passed, we try to read it all at once. My thinking is that using chunksize changes the performance drastically, and better to let people make this tradeoff explicitly without changing the default behavior.

From here, what are the next steps? There's probably a bit of cleanup you'd like me to do -- let me know. Thanks again!

Contributor

louispotok commented Aug 3, 2017

Okay @gfyoung , Thanks for your help. I added it to that conditional you mentioned as well. Latest here. Passes the tests.

I also changed the behavior so that if chunksize is not explicitly passed, we try to read it all at once. My thinking is that using chunksize changes the performance drastically, and better to let people make this tradeoff explicitly without changing the default behavior.

From here, what are the next steps? There's probably a bit of cleanup you'd like me to do -- let me know. Thanks again!

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Aug 3, 2017

Member

@louispotok : Sure thing. Just submit a PR, and we'll be happy to review!

Member

gfyoung commented Aug 3, 2017

@louispotok : Sure thing. Just submit a PR, and we'll be happy to review!

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Aug 3, 2017

Contributor

Here goes: #17168.

Contributor

louispotok commented Aug 3, 2017

Here goes: #17168.

@alessandrobenedetti

This comment has been minimized.

Show comment
Hide comment
@alessandrobenedetti

alessandrobenedetti Jul 14, 2018

Hi,
I am experimenting with Json of various sizes.
I am using the Pandas read_json with lines=True and noticing a great memory output in the parsing phase.
Using a chunksize of 10.000 to experiment :
For example :
Input Json : 280 Mb Memory Usage : up to 2.6 Gb Resulting Data Frame : 400 Mb (because of dtypes, not much I can do with this)
Input Json : 4Gb Memory Usage : up to 28 Gb Resulting Data Frame : 6Gb .
It seems the memory necessary to parse the Json is definitely too much ( not sure if there are better ways to read big Json in Pandas).
Furthermore it seems this memory remains allocated to the Python process.
Now I am a Python newbie, so this may be perfectly fine and this memory may just remain for Python in a buffer to be used in case of necessity ( it doesn't grow up when the data frame start getting processed).
But it look suspicious.
Let me know if you noticed the same and find out any tips or tricks for that!
Thanks in advance

alessandrobenedetti commented Jul 14, 2018

Hi,
I am experimenting with Json of various sizes.
I am using the Pandas read_json with lines=True and noticing a great memory output in the parsing phase.
Using a chunksize of 10.000 to experiment :
For example :
Input Json : 280 Mb Memory Usage : up to 2.6 Gb Resulting Data Frame : 400 Mb (because of dtypes, not much I can do with this)
Input Json : 4Gb Memory Usage : up to 28 Gb Resulting Data Frame : 6Gb .
It seems the memory necessary to parse the Json is definitely too much ( not sure if there are better ways to read big Json in Pandas).
Furthermore it seems this memory remains allocated to the Python process.
Now I am a Python newbie, so this may be perfectly fine and this memory may just remain for Python in a buffer to be used in case of necessity ( it doesn't grow up when the data frame start getting processed).
But it look suspicious.
Let me know if you noticed the same and find out any tips or tricks for that!
Thanks in advance

@louispotok

This comment has been minimized.

Show comment
Hide comment
@louispotok

louispotok Jul 15, 2018

Contributor

@alessandrobenedetti

I've definitely experienced some of what you're describing.

First, the read_json function probably uses more memory overall than it needs to. I don't fully know why that is or how to improve it - that probably belongs in a separate issue if it's important to what you're doing.

Second, when lines=True, I think you're right that all the memory isn't actually being used, it's just not being released back to the OS, so it's a bit spurious.

Third, if you read with lines=True and a small chunksize, you should be fine either way.

Contributor

louispotok commented Jul 15, 2018

@alessandrobenedetti

I've definitely experienced some of what you're describing.

First, the read_json function probably uses more memory overall than it needs to. I don't fully know why that is or how to improve it - that probably belongs in a separate issue if it's important to what you're doing.

Second, when lines=True, I think you're right that all the memory isn't actually being used, it's just not being released back to the OS, so it's a bit spurious.

Third, if you read with lines=True and a small chunksize, you should be fine either way.

@alessandrobenedetti

This comment has been minimized.

Show comment
Hide comment
@alessandrobenedetti

alessandrobenedetti Jul 16, 2018

hi @louispotok , thank you for the kind answer.
I just noticed that even using simpler approaches such as :
`with open(interactions_input_file) as json_file:
data_lan = []
for line in json_file:
data_lan.append(pd.io.json.loads(line))

all_columns = data_lan[0].keys()
print("Size "+str(len(data_lan)))
interactions = pd.DataFrame(columns=all_columns, data=data_lan)`

I have similar memory outputs.
I will stop the conversation here as it's slighly off topic.
Should I assume that parsing json lines in Python is just that expensive ?
we are talking about 5-7 times more ram than the initial file...

alessandrobenedetti commented Jul 16, 2018

hi @louispotok , thank you for the kind answer.
I just noticed that even using simpler approaches such as :
`with open(interactions_input_file) as json_file:
data_lan = []
for line in json_file:
data_lan.append(pd.io.json.loads(line))

all_columns = data_lan[0].keys()
print("Size "+str(len(data_lan)))
interactions = pd.DataFrame(columns=all_columns, data=data_lan)`

I have similar memory outputs.
I will stop the conversation here as it's slighly off topic.
Should I assume that parsing json lines in Python is just that expensive ?
we are talking about 5-7 times more ram than the initial file...

@rosswait

This comment has been minimized.

Show comment
Hide comment
@rosswait

rosswait Jul 23, 2018

I'm having a similar experience with this function as well, @alessandrobenedetti. I ended up regenerating my data to use read_csv instead, which is using a dramatically smaller amount of ram.

rosswait commented Jul 23, 2018

I'm having a similar experience with this function as well, @alessandrobenedetti. I ended up regenerating my data to use read_csv instead, which is using a dramatically smaller amount of ram.

@alessandrobenedetti

This comment has been minimized.

Show comment
Hide comment
@alessandrobenedetti

alessandrobenedetti Jul 24, 2018

thanks @rosswait , I have a small update if that helps...

My file was heavily String and Lists based ( each line was a Json object with a lot of Strings and lists of Strings).
For a matter of fact, those Strings were actually Integer ids, so, after I got that information I switched the Strings to Int and Lists of int.
This first of all brought down the size of the Json from 4.5 Gb to 3 Gb and the memory output from 30 GB to 10 GB.
If I end up with stricter memory requirements I will definitely take a look to the csv option.
Thanks!

alessandrobenedetti commented Jul 24, 2018

thanks @rosswait , I have a small update if that helps...

My file was heavily String and Lists based ( each line was a Json object with a lot of Strings and lists of Strings).
For a matter of fact, those Strings were actually Integer ids, so, after I got that information I switched the Strings to Int and Lists of int.
This first of all brought down the size of the Json from 4.5 Gb to 3 Gb and the memory output from 30 GB to 10 GB.
If I end up with stricter memory requirements I will definitely take a look to the csv option.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment