PERF: Removed the GIL from parts of the TextReader class #11272

Merged
merged 1 commit into from Nov 4, 2015

Conversation

Projects
None yet
3 participants
Contributor

jdeschenes commented Oct 9, 2015

The GIL was removed around the tokenizer functions and the conversion function(_string_convert excluded).

Benchmark:

Data Generation:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(1000000,10))
df.to_csv('test.csv')

Benchmark Code:

import pandas as pd
from pandas.util.testing import test_parallel

def f():
    for i in range(4):
        pd.read_csv('test.csv', index_col=0)

@test_parallel(4)
def g():
    pd.read_csv('test.csv', index_col=0)

Before:

In [4]: %timeit pd.read_csv('test.csv', index_col=0)
1 loops, best of 3: 2.3 s per loop

In [7]: %timeit f()
1 loops, best of 3: 9.15 s per loop

In [8]: %timeit g()
1 loops, best of 3: 9.25 s per loop

After:

In [6]: %timeit pd.read_csv('test.csv', index_col=0)
1 loops, best of 3: 2.35 s per loop

In [9]: %timeit f()
1 loops, best of 3: 9.55 s per loop

In [10]: %timeit g()
1 loops, best of 3: 4.38 s per loop

jreback added this to the 0.17.1 milestone Oct 9, 2015

@jreback jreback commented on an outdated diff Oct 9, 2015

asv_bench/benchmarks/gil.py
@@ -320,3 +325,21 @@ def time_nogil_kth_smallest(self):
def run(arr):
algos.kth_smallest(arr, self.k)
run()
+
+
+class nogil_read_csv(object):
+ number = 1
+ repeat = 5
+
+ def setup(self):
+ if (not have_real_test_parallel):
+ raise NotImplementedError
+ # Using the values
+ self.data = '0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336\n 0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285\n 0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n 0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n 0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n '
@jreback

jreback Oct 9, 2015

Contributor

I would just do use random here. e.g. DataFrame(np.random.randn(10000,10)) (or whatever makes this a reasonable time

also can you add a tests that uses object and another that has a datetime parse in the index_col

@jreback jreback commented on an outdated diff Oct 9, 2015

pandas/parser.pyx
cdef int status
status = tokenize_nrows(self.parser, nrows)
if self.parser.warn_msg != NULL:
- print >> sys.stderr, self.parser.warn_msg
+ # Need to rewrite this without the GIL
@jreback

jreback Oct 9, 2015

Contributor

this is ok, as doesn't happen that often

Contributor

jreback commented Oct 9, 2015

some windows cythoning errors.

odd they don't show up for you, what platform are you testing on?

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>git checkout -b jdeschenes-nogil_csv master
Switched to a new branch 'jdeschenes-nogil_csv'

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>git pull https://github.com/jdeschenes/pandas.git nogil_csv
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (7/7), done.
From https://github.com/jdeschenes/pandas
 * branch            nogil_csv  -> FETCH_HEAD
Updating 26db172..2da353e
Fast-forward
 asv_bench/benchmarks/gil.py |  23 ++++
 pandas/parser.pyx           | 269 +++++++++++++++++++++++++++++++-------------
 2 files changed, 213 insertions(+), 79 deletions(-)

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>make
python setup.py build_ext --inplace
make: python: Command not found
Makefile:2: recipe for target `tseries' failed
make: *** [tseries] Error 127

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>python setup.py build_ext --inplace
running build_ext
skipping 'pandas\index.c' Cython extension (up-to-date)
skipping 'pandas\tslib.c' Cython extension (up-to-date)
skipping 'pandas\hashtable.c' Cython extension (up-to-date)
skipping 'pandas\algos.c' Cython extension (up-to-date)
cythoning pandas\parser.pyx to pandas\parser.c

Error compiling Cython file:
------------------------------------------------------------
...
            # in the hash table
            if k != na_hashset.n_buckets:
                na_count[0] += 1
                data[0] = NA
            else:
                data[0] = parser.converter(word, &p_end, parser.decimal, parser.sci,
                                         ^
------------------------------------------------------------

pandas\parser.pyx:1538:42: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
                        data[0] = NA
            data += 1
    else:
        for i in range(lines):
            COLITER_NEXT(it, word)
            data[0] = parser.converter(word, &p_end, parser.decimal, parser.sci,
                                     ^
------------------------------------------------------------

pandas\parser.pyx:1557:38: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
                na_count[0] += 1
                data[0] = NA
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1680:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            data += 1
    else:
        for i in range(lines):
            COLITER_NEXT(it, word)

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1688:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            if k != false_hashset.n_buckets:
                data[0] = 0
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1758:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            if k != false_hashset.n_buckets:
                data[0] = 0
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1778:30: Calling gil-requiring function not allowed without gil
building 'pandas.parser' extension
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ipandas/src/klib -Ipandas/src -IC:\Miniconda\envs\pandas3.5\lib\site-packages\numpy\core\in
clude -IC:\Miniconda\envs\pandas3.5\include -IC:\Miniconda\envs\pandas3.5\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 1
4.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6\include\um" "-IC:\Program Files (x86)\Windows Kits\10\in
clude\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpandas\parser.c /Fobuild\temp.win
-amd64-3.5\Release\pandas\parser.obj
parser.c
pandas\parser.c(1): fatal error C1189: #error:  Do not use this file, it is the result of a failed Cython compilation.
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\amd64\\cl.exe' failed with exit status 2

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>
Contributor

jdeschenes commented Oct 9, 2015

I fixed a few issues with the build

Contributor

jreback commented Oct 9, 2015

builds ,couple of errors on windows

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>nosetests -A "not network" pandas\io\tests\test_parsers.py
...........S.........................................................................F......................................S....................S..........................S...........................
................................F........S..................................................................................................................................S...........................
..............................................
======================================================================
FAIL: test_parse_bools (pandas.io.tests.test_parsers.TestCParserHighMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Jeff Reback\pandas3.5\pandas\io\tests\test_parsers.py", line 1177, in test_parse_bools
    self.assertEqual(data['A'].dtype, np.bool_)
AssertionError: dtype('O') != <class 'numpy.bool_'>

======================================================================
FAIL: test_parse_bools (pandas.io.tests.test_parsers.TestCParserLowMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Jeff Reback\pandas3.5\pandas\io\tests\test_parsers.py", line 1177, in test_parse_bools
    self.assertEqual(data['A'].dtype, np.bool_)
AssertionError: dtype('O') != <class 'numpy.bool_'>

----------------------------------------------------------------------
Ran 446 tests in 12.620s

FAILED (SKIP=6, failures=2)
Contributor

jdeschenes commented Oct 9, 2015

This was an issue with python3 and not limited to windows.

@jreback jreback and 1 other commented on an outdated diff Oct 9, 2015

asv_bench/benchmarks/gil.py
+ self.df.to_csv('__test__.csv')
+
+ self.rng = date_range('1/1/2000', periods=10000)
+ self.df_date_time = DataFrame(np.random.randn(10000, 10), index=self.rng)
+ self.df_date_time.to_csv('__test_datetime__.csv')
+
+ self.df_object = DataFrame('foo', index=self.df.index, columns=self.create_cols('object'))
+ self.df_object.to_csv('__test_object__.csv')
+
+ def create_cols(self, name):
+ return [('%s%03d' % (name, i)) for i in range(5)]
+
+ def time_nogil_read_csv(self):
+ @test_parallel(num_threads=2)
+ def run(arr):
+ read_csv('__test__.csv', sep=',', header=None, float_precision=None)
@jreback

jreback Oct 9, 2015

Contributor

is their a cleanup function on these? (e.g. to remove the created csvs)?

@qwhelan ?

@jdeschenes

jdeschenes Oct 27, 2015

Contributor

There isn't. I had a look at a few tests and they don't seem to cleanup either... see test.h5

@jreback jreback and 1 other commented on an outdated diff Oct 9, 2015

pandas/parser.pyx
@@ -1452,6 +1450,20 @@ cdef _to_fw_string(parser_t *parser, int col, int line_start,
result = np.empty(line_end - line_start, dtype='|S%d' % width)
data = <char*> result.data
+ with nogil:
+ error = _to_fw_string_internal(parser, col, line_start, line_end, width, data)
+ if error != 0:
+ raise CParserError("Unknown error raised")
@jreback

jreback Oct 9, 2015

Contributor

are these ever hit?

@jdeschenes

jdeschenes Oct 13, 2015

Contributor

No.

_to_fw_string_internal function is essentially a wrapper around strncpy. It can never fail.

I wanted to cover the bases if the internal function implementation is changed. I can remove it.

@jreback

jreback Oct 25, 2015

Contributor

ok

Contributor

jreback commented Oct 13, 2015

@jdeschenes

  • couple of questions for you above.
  • can you post the asv results here.
  • pls add a note in the performance section (use this PR number)
  • squash
Contributor

jdeschenes commented Oct 13, 2015

@jreback

Where do I need to add information in the performance section? Is it in the what's new file in the documentation?

Contributor

jreback commented Oct 13, 2015

whatsnew/v0.17.1 (Performance section)

Contributor

jreback commented Oct 25, 2015

@jdeschenes looks good. can you

  • add a note in whatsnew (performance section)
  • rebase
  • squash
Contributor

jreback commented Nov 2, 2015

can you update according to comments

mrocklin commented Nov 3, 2015

What's the status on this @jdeschenes? I'd like to include this work in a talk happening tomorrow. It'd be awesome to be able to say that this was in master rather than in a branch.

Contributor

jreback commented Nov 3, 2015

you can say slated for 0.17.1 :)

Contributor

jdeschenes commented Nov 3, 2015

I will get the final changes tomorrow.

Contributor

jdeschenes commented Nov 4, 2015

@jreback, The changes have been implemented. Let me know if there is anything else that needs to be done.

@jreback jreback commented on an outdated diff Nov 4, 2015

pandas/parser.pyx
@@ -1452,6 +1449,18 @@ cdef _to_fw_string(parser_t *parser, int col, int line_start,
result = np.empty(line_end - line_start, dtype='|S%d' % width)
data = <char*> result.data
+ with nogil:
+ _to_fw_string_internal(parser, col, line_start, line_end, width, data)
@jreback

jreback Nov 4, 2015

Contributor

call this: _to_fw_string_nogil instead

@jreback jreback commented on an outdated diff Nov 4, 2015

doc/source/whatsnew/v0.17.1.txt
@@ -60,6 +60,8 @@ Performance Improvements
- Release the GIL on most datetime field operations (e.g. ``DatetimeIndex.year``, ``Series.dt.year``), normalization, and conversion to and from ``Period``, ``DatetimeIndex.to_period`` and ``PeriodIndex.to_timestamp`` (:issue:`11263`)
- Release the GIL on some srolling algos (``rolling_median``, ``rolling_mean``, ``rolling_max``, ``rolling_min``, ``rolling_var``, ``rolling_kurt``, `rolling_skew`` (:issue:`11450`)
+- Release the GIL when reading a file using the TextReader class(``read_csv``, ``read_table``).
+ The GIL is now released around the tokenizer functions and the conversion functions(_string_convert excluded).
@jreback

jreback Nov 4, 2015

Contributor

just say: Release the GIL when reading and parsing text files inpd.read_csv, (:issue:12272)

@jreback jreback commented on an outdated diff Nov 4, 2015

pandas/parser.pyx
lines = line_end - line_start
result = np.empty(lines, dtype=np.float64)
data = <double *> result.data
+ na_fset = kset_float64_from_list(na_flist)
+ with nogil:
+ error = _try_double_internal(parser, col, line_start, line_end,
+ na_filter, na_hashset, use_na_flist, na_fset, NA, data, &na_count)
@jreback

jreback Nov 4, 2015

Contributor

similarly name this

Contributor

jreback commented Nov 4, 2015

@jdeschenes thanks, just some small comments. ping when pushed. pls also post a short benchmark in the top of the PR (you can just run before/after in ipython via timeit if you want), mainly for posterity.

@jdeschenes jdeschenes PERF: Released the GIL from parts of the TextReader class
The GIL was released around the tokenizer functions and the conversion function(_string_convert excluded).
091df3e
Contributor

jdeschenes commented Nov 4, 2015

@jreback: Added the benchmarks.

Contributor

jreback commented Nov 4, 2015

looks good

ping when green

mrocklin commented Nov 4, 2015

Ping

@jreback jreback added a commit that referenced this pull request Nov 4, 2015

@jreback jreback Merge pull request #11272 from jdeschenes/nogil_csv
PERF: Removed the GIL from parts of the TextReader class
774411c

@jreback jreback merged commit 774411c into pandas-dev:master Nov 4, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

jreback commented Nov 4, 2015

thanks @jdeschenes

and @mrocklin for the pings!

jdeschenes deleted the jdeschenes:nogil_csv branch Nov 4, 2015

@khs26 khs26 added a commit to khs26/pandas that referenced this pull request Nov 6, 2015

@khs26 khs26 Merge pull request #1 from pydata/master
Merge pull request #11272 from jdeschenes/nogil_csv
fd4fc5f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment