Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
BUG: GH11786 Thread safety issue with read_csv #11790
Conversation
|
@jdeschenes gr8! can you add in the example from the issue as a smoke test. (e.g. just have it run), then read in with a single trhead and compare. and pls add a release note when you are satisified. |
jreback
added Bug CSV
labels
Dec 7, 2015
jreback
added this to the
0.18.0
milestone
Dec 7, 2015
|
Alright, I did this quickly as I don't have time to work on this right now. How long until next release? |
|
@jdeschenes oh have a while.....when you have a chance...thanks! |
This was referenced Dec 7, 2015
|
@jdeschenes if you can have a look at this again would be great. |
|
ping! @jreback |
jreback
commented on an outdated diff
Jan 5, 2016
| @@ -16064,6 +16064,18 @@ def bar(self): | ||
| with tm.assertRaisesRegexp(AttributeError, '.*i_dont_exist.*'): | ||
| A().bar | ||
| + def test_multithread_stringio_read_csv(self): | ||
| + from io import BytesIO | ||
| + from multiprocessing.pool import ThreadPool | ||
| + | ||
| + bytes = ['\n'.join(['%d,%d,%d' % (i, i, i) for i in range(10000)]).encode() | ||
| + for j in range(100)] | ||
| + files = [BytesIO(b) for b in bytes] | ||
| + | ||
| + # Read all files in many threads | ||
| + pool = ThreadPool(8) |
jreback
Contributor
|
jreback
and 1 other
commented on an outdated diff
Jan 5, 2016
| @@ -16064,6 +16064,18 @@ def bar(self): | ||
| with tm.assertRaisesRegexp(AttributeError, '.*i_dont_exist.*'): | ||
| A().bar | ||
| + def test_multithread_stringio_read_csv(self): | ||
| + from io import BytesIO | ||
| + from multiprocessing.pool import ThreadPool | ||
| + | ||
| + bytes = ['\n'.join(['%d,%d,%d' % (i, i, i) for i in range(10000)]).encode() | ||
| + for j in range(100)] | ||
| + files = [BytesIO(b) for b in bytes] | ||
| + | ||
| + # Read all files in many threads | ||
| + pool = ThreadPool(8) | ||
| + pool.map(pd.read_csv, files) |
jreback
Contributor
|
jreback
commented on an outdated diff
Jan 5, 2016
jreback
and 1 other
commented on an outdated diff
Jan 5, 2016
| @@ -355,7 +355,7 @@ Bug Fixes | ||
| - Regression in ``.clip`` with tz-aware datetimes (:issue:`11838`) | ||
| - Bug in ``date_range`` when the boundaries fell on the frequency (:issue:`11804`) | ||
| - Bug in consistency of passing nested dicts to ``.groupby(...).agg(...)`` (:issue:`9052`) | ||
| - | ||
| +- Bug in ``read_csv`` when reading from a StringIO in threads (:issue:`11790`) |
jreback
Contributor
|
mrocklin
commented
Jan 5, 2016
|
Thank you both for keeping up on this. |
|
@jdeschenes IIRC this issue is repro with actual files. Is that not the case? is it only |
|
@jdeschenes can you respond to my comments. |
mrocklin
referenced
this pull request
in dask/dask
Jan 12, 2016
Closed
Fatal error when running read_csv #841
|
Hi @jreback, the issue is solely reproducible with StringIO. The root cause of this bug is in function buffer_rd_bytes in The function was calling Py_XDECREF before ensuring that the thread had the GIL. This behavior could not be seen before since the GIL was always locked throughout the read_csv function call. I am not aware of any issues when reading from disk and this pull request will not fix any problem related to this. I think that the release notes should be kept as is. Let me know what you think. |
|
ok, can you add a test that validates the issue that reading from a disk with multiple threads is ok (so we don't regress). |
jreback
commented on an outdated diff
Jan 19, 2016
| @@ -476,11 +476,7 @@ Bug Fixes | ||
| - Regression in ``.clip`` with tz-aware datetimes (:issue:`11838`) | ||
| - Bug in ``date_range`` when the boundaries fell on the frequency (:issue:`11804`) | ||
| - Bug in consistency of passing nested dicts to ``.groupby(...).agg(...)`` (:issue:`9052`) | ||
| -- Accept unicode in ``Timedelta`` constructor (:issue:`11995`) | ||
| -- Bug in value label reading for ``StataReader`` when reading incrementally (:issue:`12014`) | ||
| -- Bug in vectorized ``DateOffset`` when ``n`` parameter is ``0`` (:issue:`11370`) | ||
| -- Compat for numpy 1.11 w.r.t. ``NaT`` comparison changes (:issue:`12049`) | ||
| - | ||
| +- Bug in ``read_csv`` when reading from a StringIO in threads (:issue:`11790`) |
|
|
jreback
commented on an outdated diff
Jan 19, 2016
| + results = pool.map(pd.read_csv, files) | ||
| + first_result = results[0] | ||
| + | ||
| + for result in results: | ||
| + tm.assert_frame_equal(first_result, result) | ||
| + | ||
| + def test_multithread_path_read_csv(self): | ||
| + df = DataFrame(np.random.rand(5000, 20), columns=map(str, xrange(20))) | ||
| + num_files = 20 | ||
| + with tm.ensure_clean('__passing_str_as_dtype__.csv') as path: | ||
| + df.to_csv(path, index=False) | ||
| + pool = ThreadPool(4) | ||
| + read_dataframes = pool.map(pd.read_csv, [path]*num_files) | ||
| + | ||
| + for single_dataframe in read_dataframes: | ||
| + tm.assert_frame_equal(df, single_dataframe, check_names=False) |
|
|
jreback
and 1 other
commented on an outdated diff
Jan 19, 2016
| + | ||
| + for result in results: | ||
| + tm.assert_frame_equal(first_result, result) | ||
| + | ||
| + def test_multithread_path_read_csv(self): | ||
| + df = DataFrame(np.random.rand(5000, 20), columns=map(str, xrange(20))) | ||
| + num_files = 20 | ||
| + with tm.ensure_clean('__passing_str_as_dtype__.csv') as path: | ||
| + df.to_csv(path, index=False) | ||
| + pool = ThreadPool(4) | ||
| + read_dataframes = pool.map(pd.read_csv, [path]*num_files) | ||
| + | ||
| + for single_dataframe in read_dataframes: | ||
| + tm.assert_frame_equal(df, single_dataframe, check_names=False) | ||
| + | ||
| + | ||
| class TestMiscellaneous(tm.TestCase): |
jreback
Contributor
|
jreback
commented on an outdated diff
Jan 19, 2016
|
pls run |
mrocklin
commented
Jan 19, 2016
|
FWIW using Many parallel storage systems won't give you access to the hard disk but will instead deliver a bunch of bytes. In this case the best way I've found to use |
|
@mrocklin oh of course. just covering the bases. I suspect people have tried multi-threading to read files as well :) |
|
It would be very interesting to see if there is any benefit in using a ThreadPool for reading from a BytesIO. We are spending a lot of time into the GIL, thanks to the buffer_rd_bytes function. It should probably be benchmarked. I have a suspicion that it doesn't help at all(It might be even a net loss). I added the test for the file read. I didn't do it for the BytesIO. The code would effectively look a lot like what I did up top... Grabbing a list of BytesIO and processing them in a ThreadPool. I can take a look at this a bit later, if that is required. |
jreback
closed this
in 567bc5c
Jan 19, 2016
|
@jdeschenes thanks! certainly would take addtl benchmarks / fixes! |
jreback
added a commit
that referenced
this pull request
Jan 19, 2016
|
|
jreback |
e8fbabd
|
kayvonr
commented
Jan 27, 2016
|
Hey all - any estimate of when this will be go out in a production release? Encountering this bug very very frequently with 0.17.1, and would like to get back up to a newer version of pandas again soon Thanks |
|
planning on a RC in about 2 weeks, so release should be roughly mid-feb or so |
jdeschenes commentedDec 7, 2015
closes #11786
Fixed an issue with thread safety when calling read_csv with a StringIO object.
The issue was caused by a misplaced PyGilSate_Ensure()