BUG: GH11786 Thread safety issue with read_csv #11790

jdeschenes · 2015-12-07T18:42:04Z

Fixed an issue with thread safety when calling read_csv with a StringIO object.

The issue was caused by a misplaced PyGilSate_Ensure()

jreback · 2015-12-07T18:47:05Z

@jdeschenes gr8! can you add in the example from the issue as a smoke test. (e.g. just have it run), then read in with a single trhead and compare.

and pls add a release note when you are satisified.

jdeschenes · 2015-12-07T18:48:00Z

Alright, I did this quickly as I don't have time to work on this right now. How long until next release?

jreback · 2015-12-07T18:58:11Z

@jdeschenes oh have a while.....when you have a chance...thanks!

jreback · 2015-12-26T00:46:51Z

@jdeschenes if you can have a look at this again would be great.

jdeschenes · 2016-01-04T15:25:53Z

ping! @jreback

jreback · 2016-01-05T00:43:02Z

pandas/tests/test_frame.py

+        files = [BytesIO(b) for b in bytes]
+
+        # Read all files in many threads
+        pool = ThreadPool(8)


assert that the read in values match a single threaded reader. (e.g. compare frames)

mrocklin · 2016-01-05T18:45:52Z

Thank you both for keeping up on this.

jreback · 2016-01-06T13:42:42Z

@jdeschenes IIRC this issue is repro with actual files. Is that not the case? is it only StringIO/BytesIO. are they not thread-safe?

jreback · 2016-01-11T13:47:27Z

@jdeschenes can you respond to my comments.

jdeschenes · 2016-01-19T02:22:04Z

Hi @jreback,

the issue is solely reproducible with StringIO. The root cause of this bug is in function buffer_rd_bytes in
pandas/src/parser/io.c. This function is only used when a StringIO/BytesIO is passed to the read_csv function.

The function was calling Py_XDECREF before ensuring that the thread had the GIL. This behavior could not be seen before since the GIL was always locked throughout the read_csv function call.

I am not aware of any issues when reading from disk and this pull request will not fix any problem related to this.

I think that the release notes should be kept as is.

Let me know what you think.

jreback · 2016-01-19T02:23:39Z

ok, can you add a test that validates the issue that reading from a disk with multiple threads is ok (so we don't regress).

jreback · 2016-01-19T02:23:54Z

doc/source/whatsnew/v0.18.0.txt

- Bug in vectorized ``DateOffset`` when ``n`` parameter is ``0`` (:issue:`11370`)
- Compat for numpy 1.11 w.r.t. ``NaT`` comparison changes (:issue:`12049`)
-
+- Bug in ``read_csv`` when reading from a StringIO in threads (:issue:`11790`)


use double backticks around StringIO

jreback · 2016-01-19T02:25:56Z

pls run git diff master | flake8 --diff as much PEP checking has been one on these files.

mrocklin · 2016-01-19T02:44:05Z

FWIW using BytesIO has actual use cases in distributed computing, it isn't just a test case.

Many parallel storage systems won't give you access to the hard disk but will instead deliver a bunch of bytes. In this case the best way I've found to use pd.read_csv is to hand it a BytesIO object.

jreback · 2016-01-19T02:48:55Z

@mrocklin oh of course. just covering the bases. I suspect people have tried multi-threading to read files as well :)

…tringIO object., pandas-dev#11786 The issue was caused by a misplaced PyGilSate_Ensure()

jdeschenes · 2016-01-19T03:29:07Z

It would be very interesting to see if there is any benefit in using a ThreadPool for reading from a BytesIO. We are spending a lot of time into the GIL, thanks to the buffer_rd_bytes function. It should probably be benchmarked.

I have a suspicion that it doesn't help at all(It might be even a net loss).

I added the test for the file read. I didn't do it for the BytesIO. The code would effectively look a lot like what I did up top... Grabbing a list of BytesIO and processing them in a ThreadPool. I can take a look at this a bit later, if that is required.

jreback · 2016-01-19T20:02:38Z

@jdeschenes thanks!

certainly would take addtl benchmarks / fixes!

kayvonr · 2016-01-27T04:42:35Z

Hey all - any estimate of when this will be go out in a production release? Encountering this bug very very frequently with 0.17.1, and would like to get back up to a newer version of pandas again soon

Thanks

jreback · 2016-01-27T13:56:55Z

planning on a RC in about 2 weeks, so release should be roughly mid-feb or so

jreback added Bug IO CSV read_csv, to_csv labels Dec 7, 2015

jreback added this to the 0.18.0 milestone Dec 7, 2015

jreback mentioned this pull request Dec 7, 2015

Fatal Python error: GC object already tracked dask/dask#860

Closed

mrocklin mentioned this pull request Dec 23, 2015

Unstable dask._Frame.map_partitions behavior dask/dask#888

Closed

jdeschenes force-pushed the pandas-11786 branch from ce82e4c to 2c4df56 Compare January 3, 2016 17:08

jreback reviewed Jan 5, 2016
View reviewed changes

mrocklin mentioned this pull request Jan 12, 2016

Fatal error when running read_csv dask/dask#841

Closed

jdeschenes force-pushed the pandas-11786 branch from 2c4df56 to 6252eeb Compare January 19, 2016 02:21

jreback reviewed Jan 19, 2016
View reviewed changes

jdeschenes force-pushed the pandas-11786 branch 2 times, most recently from 1e257fe to a9a2513 Compare January 19, 2016 02:36

BUG: Fixed an issue with thread safety when calling read_csv with a S…

505f6a6

…tringIO object., pandas-dev#11786 The issue was caused by a misplaced PyGilSate_Ensure()

jdeschenes force-pushed the pandas-11786 branch from a9a2513 to 505f6a6 Compare January 19, 2016 03:20

jreback closed this in 567bc5c Jan 19, 2016

jreback added a commit that referenced this pull request Jan 19, 2016

TST: win32 testing fix, xref #11790

e8fbabd

TomAugspurger mentioned this pull request Feb 4, 2016

Fatal error with read_csv for large file on OS X on both dask/dask#954

Closed

dragoljub mentioned this pull request Nov 9, 2018

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2 #23516

Closed

Uh oh!

BUG: GH11786 Thread safety issue with read_csv #11790

BUG: GH11786 Thread safety issue with read_csv #11790

Uh oh!

Conversation

jdeschenes commented Dec 7, 2015

Uh oh!

jreback commented Dec 7, 2015

Uh oh!

jdeschenes commented Dec 7, 2015

Uh oh!

jreback commented Dec 7, 2015

Uh oh!

jreback commented Dec 26, 2015

Uh oh!

jdeschenes commented Jan 4, 2016

Uh oh!

jreback Jan 5, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jan 5, 2016

Uh oh!

jreback commented Jan 6, 2016

Uh oh!

jreback commented Jan 11, 2016

Uh oh!

jdeschenes commented Jan 19, 2016

Uh oh!

jreback commented Jan 19, 2016

Uh oh!

jreback Jan 19, 2016

Choose a reason for hiding this comment

Uh oh!

jreback commented Jan 19, 2016

Uh oh!

mrocklin commented Jan 19, 2016

Uh oh!

jreback commented Jan 19, 2016

Uh oh!

jdeschenes commented Jan 19, 2016

Uh oh!

jreback commented Jan 19, 2016

Uh oh!

kayvonr commented Jan 27, 2016

Uh oh!

jreback commented Jan 27, 2016

Uh oh!

Uh oh!