New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent segfault after take error #2892

Closed
dsm054 opened this Issue Feb 18, 2013 · 40 comments

Comments

Projects
None yet
5 participants
@dsm054
Contributor

dsm054 commented Feb 18, 2013

Was playing around and managed to get a segfault, in several versions. It's a little intermittent, and it may have to do with how things are left after an error is encountered. The same code without the the faulty take line doesn't crash for me.

>>> import pandas as pd, sys
>>> sys.version
'2.7.3 (default, Aug  1 2012, 05:16:07) \n[GCC 4.6.3]'
>>> pd.__version__
'0.11.0.dev-14a04dd'
>>> df = pd.DataFrame({'A': {1.0: 0.0, 2.0: 6.0, 3.0: 12.0}, 'C': {1.0: 2.0, 2.0: 8.0, 3.0: 14.0}, 'B': {1.0: 1.0, 2.0: 7.0, 3.0: 13.0}})
>>> df
    A   B   C
1   0   1   2
2   6   7   8
3  12  13  14
>>> 
>>> df.take(df.unstack())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/frame.py", line 2951, in take
    new_index = self.index.take(indices)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/index.py", line 416, in take
    taken = self.view(np.ndarray).take(indexer)
IndexError: index out of range for array
>>> df.unstack()
A  2     0
   2     6
   3    12
B  2     1
   2     7
   3    13
C  2     2
   2     8
   3    14
Dtype: float64
>>> df.take(df.unstack())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/frame.py", line 2951, in take
    new_index = self.index.take(indices)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/index.py", line 416, in take
    taken = self.view(np.ndarray).take(indexer)
IndexError: index out of range for array
>>> df.unstack()
Segmentation fault (core dumped)
@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Feb 18, 2013

Contributor

Not 100% sure, but it looks like a numpy issue:

In [1]: import pandas as pd, sys

In [2]: import numpy as np

In [3]: df=pd.DataFrame({'A': {1.0: 0.0, 2.0: 6.0, 3.0: 12.0},
   ...:                  'C': {1.0: 2.0, 2.0: 8.0, 3.0: 14.0},
   ...:                  'B': {1.0: 1.0, 2.0: 7.0, 3.0: 13.0}})

In [4]: i = df.index.view(np.ndarray)

In [5]: u = df.unstack().astype(np.int_).view(np.ndarray)

In [6]: i
Out[6]: array([1.0, 2.0, 3.0], dtype=object)

In [7]: i.take(u)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-8293b97d15cc> in <module>()
----> 1 i.take(u)

IndexError: index out of range for array

In [8]: i
Segmentation fault (core dumped)

Line 7 only involves numpy types, although the underlying buffers have been touched by pandas (so it's possible it's happening because pandas violated a contract somewhere). i.take doesn't seem to have any reason to affect the value of i though, since it returns a new object instead of modifying itself: I'm guessing numpy's implementation is modifying i for some optimization purpose but it wasn't coded in an exception-safe manner...leaving i in some invalid state.

Will dig in a bit more.

Contributor

stephenwlin commented Feb 18, 2013

Not 100% sure, but it looks like a numpy issue:

In [1]: import pandas as pd, sys

In [2]: import numpy as np

In [3]: df=pd.DataFrame({'A': {1.0: 0.0, 2.0: 6.0, 3.0: 12.0},
   ...:                  'C': {1.0: 2.0, 2.0: 8.0, 3.0: 14.0},
   ...:                  'B': {1.0: 1.0, 2.0: 7.0, 3.0: 13.0}})

In [4]: i = df.index.view(np.ndarray)

In [5]: u = df.unstack().astype(np.int_).view(np.ndarray)

In [6]: i
Out[6]: array([1.0, 2.0, 3.0], dtype=object)

In [7]: i.take(u)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-8293b97d15cc> in <module>()
----> 1 i.take(u)

IndexError: index out of range for array

In [8]: i
Segmentation fault (core dumped)

Line 7 only involves numpy types, although the underlying buffers have been touched by pandas (so it's possible it's happening because pandas violated a contract somewhere). i.take doesn't seem to have any reason to affect the value of i though, since it returns a new object instead of modifying itself: I'm guessing numpy's implementation is modifying i for some optimization purpose but it wasn't coded in an exception-safe manner...leaving i in some invalid state.

Will dig in a bit more.

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Feb 18, 2013

Contributor

Note that the unstack results after the first take are already off (it shouldn't be 2,2,3). On a different system I can get

>>> df.unstack()
A  1358910081     0
   2              6
   3             12
B  1358910081     1
   2              7
   3             13
C  1358910081     2
   2              8
   3             14

instead.

Contributor

dsm054 commented Feb 18, 2013

Note that the unstack results after the first take are already off (it shouldn't be 2,2,3). On a different system I can get

>>> df.unstack()
A  1358910081     0
   2              6
   3             12
B  1358910081     1
   2              7
   3             13
C  1358910081     2
   2              8
   3             14

instead.

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Feb 18, 2013

Contributor

Changing line 5 to u = np.asarray(df.unstack().astype(np.int_).copy()).copy() still results in a segfault as does changing line 4 to either i = np.asarray(df.index.copy()) or i = np.asarray(df.index).copy() seem to make it go away. So seems either Index is violating some contract in subclassing ndarray, ndarray.take is buggy, or both...

EDIT: actually it seems like changing 4 to either option above still segfaults (at least for me), but it requires multiple calls to ndarray.take to do so.

Contributor

stephenwlin commented Feb 18, 2013

Changing line 5 to u = np.asarray(df.unstack().astype(np.int_).copy()).copy() still results in a segfault as does changing line 4 to either i = np.asarray(df.index.copy()) or i = np.asarray(df.index).copy() seem to make it go away. So seems either Index is violating some contract in subclassing ndarray, ndarray.take is buggy, or both...

EDIT: actually it seems like changing 4 to either option above still segfaults (at least for me), but it requires multiple calls to ndarray.take to do so.

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Feb 18, 2013

Contributor

@dsm054 yes, it looks like the the index contents are being affected by ndarray.take, which seems like numpy optimization gone awry (since ndarray.take doesn't semantically mutate its self argument)

Contributor

stephenwlin commented Feb 18, 2013

@dsm054 yes, it looks like the the index contents are being affected by ndarray.take, which seems like numpy optimization gone awry (since ndarray.take doesn't semantically mutate its self argument)

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Feb 18, 2013

Contributor

definitely numpy bug (here's with dev numpy)

In [1]: import numpy as np

In [2]: u = np.asarray([ 0, 6, 12, 1, 7, 13, 2, 8, 14])

In [3]: i = np.asarray([1.0, 2.0, 3.0]).astype(object)

In [4]: i
Out[4]: array([1.0, 2.0, 3.0], dtype=object)

In [5]: i.take(u)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-8293b97d15cc> in <module>()
----> 1 i.take(u)

IndexError: index 6 is out of bounds for axis 0 with size 3

In [6]: i
Segmentation fault (core dumped)
Contributor

stephenwlin commented Feb 18, 2013

definitely numpy bug (here's with dev numpy)

In [1]: import numpy as np

In [2]: u = np.asarray([ 0, 6, 12, 1, 7, 13, 2, 8, 14])

In [3]: i = np.asarray([1.0, 2.0, 3.0]).astype(object)

In [4]: i
Out[4]: array([1.0, 2.0, 3.0], dtype=object)

In [5]: i.take(u)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-8293b97d15cc> in <module>()
----> 1 i.take(u)

IndexError: index 6 is out of bounds for axis 0 with size 3

In [6]: i
Segmentation fault (core dumped)
@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Feb 18, 2013

Contributor

Should we close this, then, and bump it upstream?

Contributor

dsm054 commented Feb 18, 2013

Should we close this, then, and bump it upstream?

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Feb 18, 2013

Contributor

well, maybe pandas should work around it by checking index bounds itself first for object arrays (at some minor efficiency cost...)? in which case we might want to leave the issue open for that.

it might be important since Index arrays are usually object-based (unless specialized via a subclass)

Contributor

stephenwlin commented Feb 18, 2013

well, maybe pandas should work around it by checking index bounds itself first for object arrays (at some minor efficiency cost...)? in which case we might want to leave the issue open for that.

it might be important since Index arrays are usually object-based (unless specialized via a subclass)

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Feb 19, 2013

Member

why don't you replace the usage of ndarray.take in Index.take with take_nd (which should not corrupt memory)?

Member

wesm commented Feb 19, 2013

why don't you replace the usage of ndarray.take in Index.take with take_nd (which should not corrupt memory)?

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Feb 19, 2013

Contributor

yeah, that works too...thanks! (it might be marginally faster too because ndarray.take is coded for the general nd-case but we have cython specializations for most 1-d cases)

Contributor

stephenwlin commented Feb 19, 2013

yeah, that works too...thanks! (it might be marginally faster too because ndarray.take is coded for the general nd-case but we have cython specializations for most 1-d cases)

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Mar 12, 2013

Member

What's the status here?

Member

wesm commented Mar 12, 2013

What's the status here?

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

sorry for the lack of update, this turned out to be somewhat of a rabbit hole; basically, we're exposed to this same numpy bug everywhere where we call ndarray.take with user-supplied indices, which turns out to be a lot of places. it doesn't make that much sense to just convert one call arbitrarily so I started doing a blanket conversion...but that turns out to be the wrong thing because ndarray.take has different semantics with negative indicies than our cythonic takes (i.e. the normal ones, rather than -1 being a dummy value for fill behavior)...lots of tests were broken in subtle ways before I realized the issue and regardless it's unclear if anyone is depending on the normal semantics already (since there's no documentation either way that this is supported officially, afaict).

i think the right thing to do is probably the following:

  1. audit the code and try to disambiguate uses of user-supplied indices and algorithm-generated indices throughout
  2. convert the former to either a bug-free cythonic take that does normal negative indices, or a bounds-checking wrapper around ndarray.take
  3. (optional) if new cythonic take is used, then get rid of the whole back and forth "platform int <-> int64" conversion mess, which is basically only there to support ndarray.take.

but that's a pretty big change...i don't have free cycles for it right now but if someone else wants to take a look they're free to

Contributor

stephenwlin commented Mar 12, 2013

sorry for the lack of update, this turned out to be somewhat of a rabbit hole; basically, we're exposed to this same numpy bug everywhere where we call ndarray.take with user-supplied indices, which turns out to be a lot of places. it doesn't make that much sense to just convert one call arbitrarily so I started doing a blanket conversion...but that turns out to be the wrong thing because ndarray.take has different semantics with negative indicies than our cythonic takes (i.e. the normal ones, rather than -1 being a dummy value for fill behavior)...lots of tests were broken in subtle ways before I realized the issue and regardless it's unclear if anyone is depending on the normal semantics already (since there's no documentation either way that this is supported officially, afaict).

i think the right thing to do is probably the following:

  1. audit the code and try to disambiguate uses of user-supplied indices and algorithm-generated indices throughout
  2. convert the former to either a bug-free cythonic take that does normal negative indices, or a bounds-checking wrapper around ndarray.take
  3. (optional) if new cythonic take is used, then get rid of the whole back and forth "platform int <-> int64" conversion mess, which is basically only there to support ndarray.take.

but that's a pretty big change...i don't have free cycles for it right now but if someone else wants to take a look they're free to

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

(btw, in case someone picks this up...there's no bounds checking in take_nd in most cases, which might be ok if we can ensure that no user-supplied indices are every passed to it, but maybe it should be addressed to be safe; probably too slow to turn on cython's built-in bounds check though, since that'll check with every access, better thing to do would be to add the minimal checks manually for a row/column of calls at a time)

Contributor

stephenwlin commented Mar 12, 2013

(btw, in case someone picks this up...there's no bounds checking in take_nd in most cases, which might be ok if we can ensure that no user-supplied indices are every passed to it, but maybe it should be addressed to be safe; probably too slow to turn on cython's built-in bounds check though, since that'll check with every access, better thing to do would be to add the minimal checks manually for a row/column of calls at a time)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 12, 2013

Contributor

this is the change I made in _ixs in frame.py (this was the old icol), as negative user-supplied indicies were incorrect (the change was all wrapped up in the .iloc changes). I am sure as @stephenwlin points out there are more user facing changes like this (prob not too many, and most prob hit this code-block anyhow)

can anyone think of a place where there are user-facing posssibilites of negative indices except:

  • _ixs/icol
  • frame/take

other uses of negative indices denote labels (or hit _ixs) AFAICT

Contributor

jreback commented Mar 12, 2013

this is the change I made in _ixs in frame.py (this was the old icol), as negative user-supplied indicies were incorrect (the change was all wrapped up in the .iloc changes). I am sure as @stephenwlin points out there are more user facing changes like this (prob not too many, and most prob hit this code-block anyhow)

can anyone think of a place where there are user-facing posssibilites of negative indices except:

  • _ixs/icol
  • frame/take

other uses of negative indices denote labels (or hit _ixs) AFAICT

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Mar 12, 2013

Contributor

no bound checking ocassionaly bites with a kindly segfault, #3011, #2775/#2778
are all those cython functions really in the inner loop?

Contributor

y-p commented Mar 12, 2013

no bound checking ocassionaly bites with a kindly segfault, #3011, #2775/#2778
are all those cython functions really in the inner loop?

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

well, if all the indices are generated internally and the algorithms to do so are not buggy, then it's technically superfluous to check--there's just no guarantee that that's the case

Contributor

stephenwlin commented Mar 12, 2013

well, if all the indices are generated internally and the algorithms to do so are not buggy, then it's technically superfluous to check--there's just no guarantee that that's the case

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Mar 12, 2013

Contributor

obviously that wasn't the case in those examples.

Contributor

y-p commented Mar 12, 2013

obviously that wasn't the case in those examples.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 12, 2013

Contributor

so why don't we do negative index conversion in frame/take, and make that only user facing (and can call from _ixs), then @stephenwlin can fix the take bugs independently? could also have a flag (for safety sake?)

Contributor

jreback commented Mar 12, 2013

so why don't we do negative index conversion in frame/take, and make that only user facing (and can call from _ixs), then @stephenwlin can fix the take bugs independently? could also have a flag (for safety sake?)

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

@y-p of course :) just saying there's a case for why it wasn't there to begin with (i didn't take it out, it was turned off before I started working on it)

Contributor

stephenwlin commented Mar 12, 2013

@y-p of course :) just saying there's a case for why it wasn't there to begin with (i didn't take it out, it was turned off before I started working on it)

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Mar 12, 2013

Contributor

generate_code.py uses @cython.boundscheck(False) across the board. maybe
this can be revamped with the conversion to fused types.

Contributor

y-p commented Mar 12, 2013

generate_code.py uses @cython.boundscheck(False) across the board. maybe
this can be revamped with the conversion to fused types.

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

@jreback maybe, i don't know how many user-facing functions there are to look at actually, it's a bit hard to tell what's user-facing and what isn't in python sometimes...it might be less of a deal than i think it is

Contributor

stephenwlin commented Mar 12, 2013

@jreback maybe, i don't know how many user-facing functions there are to look at actually, it's a bit hard to tell what's user-facing and what isn't in python sometimes...it might be less of a deal than i think it is

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

@y-p i don't think we should pay for per-access bounds checking though, that'll be O(prod(shape)) overhead, unless cython is smart enough to factor out the checks

Contributor

stephenwlin commented Mar 12, 2013

@y-p i don't think we should pay for per-access bounds checking though, that'll be O(prod(shape)) overhead, unless cython is smart enough to factor out the checks

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Mar 12, 2013

Contributor

Wouldn't it be a function of access count rather then strictly the shape of the array? probably most
cython code does do significant work, so point taken in any case.

I'm just thinking of reducing bug surface area, perhaps integrating a boundscheck preamble into the
cython code would be better.

Contributor

y-p commented Mar 12, 2013

Wouldn't it be a function of access count rather then strictly the shape of the array? probably most
cython code does do significant work, so point taken in any case.

I'm just thinking of reducing bug surface area, perhaps integrating a boundscheck preamble into the
cython code would be better.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 12, 2013

Contributor

So user facing take is broken as well for negative indicies (because take_nd doesn like negative indices)
(and the 2 cases below have to do whether the internals method take is called or the indexers take is called

In [5]: df = pd.DataFrame(np.random.rand(8,3))

In [6]: df.take([0,1,-1])
Out[6]: 
          0         1         2
0  0.610429  0.185237  0.174880
1  0.059846  0.290267  0.012082
7       NaN       NaN       NaN

In [7]: df
Out[7]: 
          0         1         2
0  0.610429  0.185237  0.174880
1  0.059846  0.290267  0.012082
2  0.719515  0.322478  0.953638
3  0.450850  0.374526  0.781923
4  0.588323  0.951343  0.995791
5  0.385074  0.276861  0.720896
6  0.149348  0.553080  0.652363
7  0.554311  0.635818  0.148375

In [8]: df.take([0,1,-1],axis=1)
Out[8]: 
          0         1   2
0  0.610429  0.185237 NaN
1  0.059846  0.290267 NaN
2  0.719515  0.322478 NaN
3  0.450850  0.374526 NaN
4  0.588323  0.951343 NaN
5  0.385074  0.276861 NaN
6  0.149348  0.553080 NaN
7  0.554311  0.635818 NaN

but works for mixed

In [9]: df['test'] = 'foo'

In [10]: df.take([0,1,-1],axis=1)
Out[10]: 
          0         1 test
0  0.610429  0.185237  foo
1  0.059846  0.290267  foo
2  0.719515  0.322478  foo
3  0.450850  0.374526  foo
4  0.588323  0.951343  foo
5  0.385074  0.276861  foo
6  0.149348  0.553080  foo
7  0.554311  0.635818  foo

but not in mixed with axis=0

In [3]: df = pd.DataFrame(dict(A = np.random.rand(8), B = 'foo'), index=range(8))

In [4]: df.take([-1],axis=0)
Exception: Indices must be nonzero and less than the axis length
Contributor

jreback commented Mar 12, 2013

So user facing take is broken as well for negative indicies (because take_nd doesn like negative indices)
(and the 2 cases below have to do whether the internals method take is called or the indexers take is called

In [5]: df = pd.DataFrame(np.random.rand(8,3))

In [6]: df.take([0,1,-1])
Out[6]: 
          0         1         2
0  0.610429  0.185237  0.174880
1  0.059846  0.290267  0.012082
7       NaN       NaN       NaN

In [7]: df
Out[7]: 
          0         1         2
0  0.610429  0.185237  0.174880
1  0.059846  0.290267  0.012082
2  0.719515  0.322478  0.953638
3  0.450850  0.374526  0.781923
4  0.588323  0.951343  0.995791
5  0.385074  0.276861  0.720896
6  0.149348  0.553080  0.652363
7  0.554311  0.635818  0.148375

In [8]: df.take([0,1,-1],axis=1)
Out[8]: 
          0         1   2
0  0.610429  0.185237 NaN
1  0.059846  0.290267 NaN
2  0.719515  0.322478 NaN
3  0.450850  0.374526 NaN
4  0.588323  0.951343 NaN
5  0.385074  0.276861 NaN
6  0.149348  0.553080 NaN
7  0.554311  0.635818 NaN

but works for mixed

In [9]: df['test'] = 'foo'

In [10]: df.take([0,1,-1],axis=1)
Out[10]: 
          0         1 test
0  0.610429  0.185237  foo
1  0.059846  0.290267  foo
2  0.719515  0.322478  foo
3  0.450850  0.374526  foo
4  0.588323  0.951343  foo
5  0.385074  0.276861  foo
6  0.149348  0.553080  foo
7  0.554311  0.635818  foo

but not in mixed with axis=0

In [3]: df = pd.DataFrame(dict(A = np.random.rand(8), B = 'foo'), index=range(8))

In [4]: df.take([-1],axis=0)
Exception: Indices must be nonzero and less than the axis length
@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

so should negative indices be allowed everywhere where the user can supply indices, or just through .ix and friends?

anyway, regardless of what we allow or don't allow, we need to make sure any user-supplied indices are sanitized before being passed into functions that assume properly bounded indices, so disambiguating the two is important.

@y-p maybe just add an option to generate.py to add bounds checking as a debug tool? also, i'm pretty sure it's only the take_* functions that are a potential problem, there's almost certainly no user-supplied indicies going into pad_*, backfill_*, etc.

Contributor

stephenwlin commented Mar 12, 2013

so should negative indices be allowed everywhere where the user can supply indices, or just through .ix and friends?

anyway, regardless of what we allow or don't allow, we need to make sure any user-supplied indices are sanitized before being passed into functions that assume properly bounded indices, so disambiguating the two is important.

@y-p maybe just add an option to generate.py to add bounds checking as a debug tool? also, i'm pretty sure it's only the take_* functions that are a potential problem, there's almost certainly no user-supplied indicies going into pad_*, backfill_*, etc.

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Mar 12, 2013

Contributor

Sure, but that's useful as a way to track down segfaults after users report them.

sanitizing is good, but probably less effort to "head them off at the pass" in the
cython function preamble then tracking down all the code paths that lead there.

#3011 was indeed groupby_int64 delegating to take_1d in numpy_helper.

Contributor

y-p commented Mar 12, 2013

Sure, but that's useful as a way to track down segfaults after users report them.

sanitizing is good, but probably less effort to "head them off at the pass" in the
cython function preamble then tracking down all the code paths that lead there.

#3011 was indeed groupby_int64 delegating to take_1d in numpy_helper.

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

ok, i guess that's ok, but if it's just take_* etc., we can at least minimize the performance hit but writing the bounds checks manually so they don't repeat checks unnecessarily, rather than relying on the cython ones.

Contributor

stephenwlin commented Mar 12, 2013

ok, i guess that's ok, but if it's just take_* etc., we can at least minimize the performance hit but writing the bounds checks manually so they don't repeat checks unnecessarily, rather than relying on the cython ones.

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Mar 12, 2013

Contributor

+1

Contributor

y-p commented Mar 12, 2013

+1

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

oh actually, i guess that's what you meant by doing it in the preamble...yeah that's fine; actually, you can be a bit more efficient than that (for the normal case) by doing it as you iterate, instead of iterating twice...i'll do that for now until we resolve the semantics/sanitization issue

Contributor

stephenwlin commented Mar 12, 2013

oh actually, i guess that's what you meant by doing it in the preamble...yeah that's fine; actually, you can be a bit more efficient than that (for the normal case) by doing it as you iterate, instead of iterating twice...i'll do that for now until we resolve the semantics/sanitization issue

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Mar 12, 2013

Contributor

In #3011 take_1d was in the inner loop, and it's really the caller that should have done the
check in this case. Though it was another cython function.

Probably need to do case-by-case unfortunately.

Contributor

y-p commented Mar 12, 2013

In #3011 take_1d was in the inner loop, and it's really the caller that should have done the
check in this case. Though it was another cython function.

Probably need to do case-by-case unfortunately.

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

@y-p yeah, in take_1d there's only one loop so it needs to check every index anyway, but take_2d_* etc. can do it column-by-column or row-by-row during the outer loop

Contributor

stephenwlin commented Mar 12, 2013

@y-p yeah, in take_1d there's only one loop so it needs to check every index anyway, but take_2d_* etc. can do it column-by-column or row-by-row during the outer loop

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

anyway, so it's still an open question here where we want to allow negative indexing. it'll probably be simpler to only allow it through .ix and friends, to reduce the amount of cases we need to support and test...and with bounds checks in the cython take implementations we can probably get away with not sanitizing elsewhere (although the ideal would be to sanitize on any user-facing function and avoid the bounds check internally; not sure how obtainable that would be at this point without a thorough review)

Contributor

stephenwlin commented Mar 12, 2013

anyway, so it's still an open question here where we want to allow negative indexing. it'll probably be simpler to only allow it through .ix and friends, to reduce the amount of cases we need to support and test...and with bounds checks in the cython take implementations we can probably get away with not sanitizing elsewhere (although the ideal would be to sanitize on any user-facing function and avoid the bounds check internally; not sure how obtainable that would be at this point without a thorough review)

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Mar 12, 2013

Member

The -1 business was done as a matter of convenience. That whole mess should be replaced with a proper INT64_MIN or INT32_MIN depending on the platform / byte width. Sigh

Member

wesm commented Mar 12, 2013

The -1 business was done as a matter of convenience. That whole mess should be replaced with a proper INT64_MIN or INT32_MIN depending on the platform / byte width. Sigh

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 12, 2013

Contributor

it is tested/works with .ix and friends, take (though I would be open to ONLY allowing with .iloc but thats a different issue)

Contributor

jreback commented Mar 12, 2013

it is tested/works with .ix and friends, take (though I would be open to ONLY allowing with .iloc but thats a different issue)

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

so are we willing to break code that uses negative indices with Index.take, DataFrame.take, etc? guessing there's not a whole lot of it, and it's broken already for some cases (as show by @jreback's example)

Contributor

stephenwlin commented Mar 12, 2013

so are we willing to break code that uses negative indices with Index.take, DataFrame.take, etc? guessing there's not a whole lot of it, and it's broken already for some cases (as show by @jreback's example)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 12, 2013

Contributor

@stephenwlin DataFrame.take is fixed (for neg indicies), as well as neg inicies for .ix (which worked before), see #3027, I don't think Index.take is broken?

I was just musing if we should restrict .ix to non-negative (but maybe there is not good reason)

Contributor

jreback commented Mar 12, 2013

@stephenwlin DataFrame.take is fixed (for neg indicies), as well as neg inicies for .ix (which worked before), see #3027, I don't think Index.take is broken?

I was just musing if we should restrict .ix to non-negative (but maybe there is not good reason)

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

yeah, DataFrame.take is broken because it mixes ndarray.take and take_nd calls, Index.take presumably isn't (other than possibly triggering a numpy segfault)...but it would probably make things easier to just not support it except at a few defined entry points rather than supporting it inconsistently?

Contributor

stephenwlin commented Mar 12, 2013

yeah, DataFrame.take is broken because it mixes ndarray.take and take_nd calls, Index.take presumably isn't (other than possibly triggering a numpy segfault)...but it would probably make things easier to just not support it except at a few defined entry points rather than supporting it inconsistently?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 12, 2013

Contributor

@stephenwlin i just merged in #3027 to master. I cannot reproduce this bug (but I didn't do anything to 'fix' it), maybe the bounds checking fixed this somehow? weird

correction: it hits the exception before it can segfault....because the 3 from the unstack was an out-of-bounds...but
didn't you discover some other bug here?

Contributor

jreback commented Mar 12, 2013

@stephenwlin i just merged in #3027 to master. I cannot reproduce this bug (but I didn't do anything to 'fix' it), maybe the bounds checking fixed this somehow? weird

correction: it hits the exception before it can segfault....because the 3 from the unstack was an out-of-bounds...but
didn't you discover some other bug here?

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 12, 2013

Contributor

no that's pretty much it, that should take care of it; the only problem is if we haven't covered all the cases that we might be passing out-of-range indices into ndarray.take

Contributor

stephenwlin commented Mar 12, 2013

no that's pretty much it, that should take care of it; the only problem is if we haven't covered all the cases that we might be passing out-of-range indices into ndarray.take

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 12, 2013

Contributor

ok great

it's funny that the take didn't blow up with an illegal access on the first action

should close this issue then?
(as u moved everything else to a new issue)

Contributor

jreback commented Mar 12, 2013

ok great

it's funny that the take didn't blow up with an illegal access on the first action

should close this issue then?
(as u moved everything else to a new issue)

@stephenwlin

This comment has been minimized.

Show comment
Hide comment
@stephenwlin

stephenwlin Mar 13, 2013

Contributor

sure, i split off another issue in case we want to do a bigger overhaul

Contributor

stephenwlin commented Mar 13, 2013

sure, i split off another issue in case we want to do a bigger overhaul

@jreback jreback closed this Mar 13, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment