Skip to content

Conversation

jdeschenes
Copy link
Contributor

closes #11786

Fixed an issue with thread safety when calling read_csv with a StringIO object.

The issue was caused by a misplaced PyGilSate_Ensure()

@jreback
Copy link
Contributor

jreback commented Dec 7, 2015

@jdeschenes gr8! can you add in the example from the issue as a smoke test. (e.g. just have it run), then read in with a single trhead and compare.

and pls add a release note when you are satisified.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Dec 7, 2015
@jreback jreback added this to the 0.18.0 milestone Dec 7, 2015
@jdeschenes
Copy link
Contributor Author

Alright, I did this quickly as I don't have time to work on this right now. How long until next release?

@jreback
Copy link
Contributor

jreback commented Dec 7, 2015

@jdeschenes oh have a while.....when you have a chance...thanks!

@jreback
Copy link
Contributor

jreback commented Dec 26, 2015

@jdeschenes if you can have a look at this again would be great.

@jdeschenes
Copy link
Contributor Author

ping! @jreback

files = [BytesIO(b) for b in bytes]

# Read all files in many threads
pool = ThreadPool(8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that the read in values match a single threaded reader. (e.g. compare frames)

@mrocklin
Copy link
Contributor

mrocklin commented Jan 5, 2016

Thank you both for keeping up on this.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

@jdeschenes IIRC this issue is repro with actual files. Is that not the case? is it only StringIO/BytesIO. are they not thread-safe?

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

@jdeschenes can you respond to my comments.

@jdeschenes
Copy link
Contributor Author

Hi @jreback,

the issue is solely reproducible with StringIO. The root cause of this bug is in function buffer_rd_bytes in
pandas/src/parser/io.c. This function is only used when a StringIO/BytesIO is passed to the read_csv function.

The function was calling Py_XDECREF before ensuring that the thread had the GIL. This behavior could not be seen before since the GIL was always locked throughout the read_csv function call.

I am not aware of any issues when reading from disk and this pull request will not fix any problem related to this.

I think that the release notes should be kept as is.

Let me know what you think.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

ok, can you add a test that validates the issue that reading from a disk with multiple threads is ok (so we don't regress).

- Bug in vectorized ``DateOffset`` when ``n`` parameter is ``0`` (:issue:`11370`)
- Compat for numpy 1.11 w.r.t. ``NaT`` comparison changes (:issue:`12049`)

- Bug in ``read_csv`` when reading from a StringIO in threads (:issue:`11790`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around StringIO

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

pls run git diff master | flake8 --diff as much PEP checking has been one on these files.

@jdeschenes jdeschenes force-pushed the pandas-11786 branch 2 times, most recently from 1e257fe to a9a2513 Compare January 19, 2016 02:36
@mrocklin
Copy link
Contributor

FWIW using BytesIO has actual use cases in distributed computing, it isn't just a test case.

Many parallel storage systems won't give you access to the hard disk but will instead deliver a bunch of bytes. In this case the best way I've found to use pd.read_csv is to hand it a BytesIO object.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

@mrocklin oh of course. just covering the bases. I suspect people have tried multi-threading to read files as well :)

…tringIO object., pandas-dev#11786

The issue was caused by a misplaced PyGilSate_Ensure()
@jdeschenes
Copy link
Contributor Author

It would be very interesting to see if there is any benefit in using a ThreadPool for reading from a BytesIO. We are spending a lot of time into the GIL, thanks to the buffer_rd_bytes function. It should probably be benchmarked.

I have a suspicion that it doesn't help at all(It might be even a net loss).

I added the test for the file read. I didn't do it for the BytesIO. The code would effectively look a lot like what I did up top... Grabbing a list of BytesIO and processing them in a ThreadPool. I can take a look at this a bit later, if that is required.

@jreback jreback closed this in 567bc5c Jan 19, 2016
@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

@jdeschenes thanks!

certainly would take addtl benchmarks / fixes!

jreback added a commit that referenced this pull request Jan 19, 2016
@kayvonr
Copy link

kayvonr commented Jan 27, 2016

Hey all - any estimate of when this will be go out in a production release? Encountering this bug very very frequently with 0.17.1, and would like to get back up to a newer version of pandas again soon

Thanks

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

planning on a RC in about 2 weeks, so release should be roughly mid-feb or so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv in multiple theads causes segmentation fault
4 participants