Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Comparisons with non-comparable objects in DataFrame return True #4537

Closed
jtratner opened this issue Aug 11, 2013 · 17 comments
Closed

BUG: Comparisons with non-comparable objects in DataFrame return True #4537

jtratner opened this issue Aug 11, 2013 · 17 comments
Milestone

Comments

@jtratner
Copy link
Contributor

I can't tell if this is expected or not (taken from part of test_where_datetime in tests/test_frame). The change in #3311 (which the test case came from) didn't specify what should happen with non-datetime comparisons. Right now they return True, which doesn't really make sense. They should probably return False instead.

In [16]: df = DataFrame(dict(A = date_range('20130102',periods=5),
                            B = date_range('20130104',periods=5),
                            C = np.random.randn(5)))

In [17]: stamp = datetime(2013,1,3)

In [18]: df
Out[18]:
                    A                   B         C
0 2013-01-02 00:00:00 2013-01-04 00:00:00 -0.321163
1 2013-01-03 00:00:00 2013-01-05 00:00:00  1.219840
2 2013-01-04 00:00:00 2013-01-06 00:00:00 -1.048629
3 2013-01-05 00:00:00 2013-01-07 00:00:00 -0.861459
4 2013-01-06 00:00:00 2013-01-08 00:00:00  1.480088

In [19]: df > stamp
Out[19]:
       A     B     C
0  False  True  True
1  False  True  True
2   True  True  True
3   True  True  True
4   True  True  True

In [20]: df < stamp
Out[20]:
       A      B     C
0   True  False  True
1  False  False  True
2  False  False  True
3  False  False  True
4  False  False  True

In [21]: df == stamp
Out[21]:
       A      B      C
0  False  False  False
1   True  False  False
2  False  False  False
3  False  False  False
4  False  False  False

In [22]: df != stamp
Out[22]:
       A     B     C
0   True  True  True
1  False  True  True
2   True  True  True
3   True  True  True
4   True  True  True

In [23: pd.__version__
Out[23]: '0.12.0-154-g7fd6b20'

Column C probably should be False for every case.

@jreback
Copy link
Contributor

jreback commented Aug 11, 2013

what do we so for the reverse case?
compare a float with a datetime / or an object (string)?
I would actually like it to be nan (but then you would always have to fill it)
I agree False if you have an incomparable across dtypes is prob appropriate

@jtratner
Copy link
Contributor Author

@jreback this is easily fixed actually, fi you agree with this, I have a PR brewing.

@jtratner
Copy link
Contributor Author

@jreback it actually fails pretty spectacularly with integers:

>>> df > 2
Traceback
   ...
AttributeError: 'int' object has no attribute 'view'

@jtratner
Copy link
Contributor Author

more weird behavior:

In [3]: df = DataFrame(dict(A = date_range('20130102',periods=5),
                            B = date_range('20130104',periods=5),
                            C = np.random.randn(5)))

In [4]: df
Out[4]:
                    A                   B         C
0 2013-01-02 00:00:00 2013-01-04 00:00:00  1.029470
1 2013-01-03 00:00:00 2013-01-05 00:00:00 -0.589214
2 2013-01-04 00:00:00 2013-01-06 00:00:00 -1.979968
3 2013-01-05 00:00:00 2013-01-07 00:00:00  0.196116
4 2013-01-06 00:00:00 2013-01-08 00:00:00 -0.190880

In [5]: df > np.array([2])
Out[5]:
      A     B      C
0  True  True  False
1  True  True  False
2  True  True  False
3  True  True  False
4  True  True  False

In [6]: df == np.array([2])
Out[6]:
       A      B      C
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
4  False  False  False

In [7]: df != np.array([2])
Out[7]:
      A     B     C
0  True  True  True
1  True  True  True
2  True  True  True
3  True  True  True
4  True  True  True

In [8]: df < np.array([2])
Out[8]:
       A      B     C
0  False  False  True
1  False  False  True
2  False  False  True
3  False  False  True
4  False  False  True

@jtratner
Copy link
Contributor Author

apparently this is not in the language spec and changes behavior in Python 3 as well - http://stackoverflow.com/questions/3270680/how-does-python-compare-string-and-int

@cpcloud
Copy link
Member

cpcloud commented Aug 11, 2013

in your example though they are ultimately comparing integers since that's how datetime64 is represented

@jtratner
Copy link
Contributor Author

@cpcloud ah, good to know.

@jtratner
Copy link
Contributor Author

@cpcloud but example mostly stays the same for things like string dataframes compared to integers, etc... it's kind of an edge case that shouldn't be relied upon.

@cpcloud
Copy link
Member

cpcloud commented Aug 11, 2013

i think we should leave comparison (with datetimes) as is...though i agree that's it's weird to compare a datetime64 to an integer (and if someone is writing code mixing abstractions like this then they are probably doing something wrong, better to keep abstractions on the same level)

what were you thinking of doing instead of just letting python's/numpy's comparison behavior do its thing? if two types are not equal then their values are not equal? maybe just for object arrays? seems hard since pandas is really just sort of "numeric", "object", and "datetimes" but lets you retain dtypes if you want. so comparing int to float should be valid (like it is now), but comparing object to not object should never return True? what about in the case of integer object arrays....seems like a mountain of complexity to deal with this edge case, maybe i'm overthinking it a bit though and you have a really elegant solution in mind

@jtratner
Copy link
Contributor Author

@cpcloud honestly, I think we should just leave it, but have its fill value in comparisons be False for <, >, <=, >=, ==, and True for !=. So this would be a very minor change. Comparisons against nan might need to be fixed, but it's somewhat separate.

@jtratner
Copy link
Contributor Author

@cpcloud or maybe no change at all. Just weird that the fill value is True in _comp_method.

Here's what it is now:

def _comp_method(func, name, str_rep):
    @Appender('Wrapper for comparison method %s' % name)
    def f(self, other):
        if isinstance(other, DataFrame):    # Another DataFrame
            return self._compare_frame(other, func, str_rep)
        elif isinstance(other, Series):
            return self._combine_series_infer(other, func)
        else:

            # straight boolean comparisions we want to allow all columns
            # (regardless of dtype to pass thru)
            return self._combine_const(other, func, raise_on_error = False).fillna(True).astype(bool)

    f.__name__ = name

    return f

I think it should be:

def _comp_method(func, name, str_rep):
    @Appender('Wrapper for comparison method %s' % name)
    def f(self, other):
        if isinstance(other, DataFrame):    # Another DataFrame
            return self._compare_frame(other, func, str_rep)
        elif isinstance(other, Series):
            return self._combine_series_infer(other, func)
        else:

            if func == operator.ne:
                masker = True
            else:
                masker = False
            # straight boolean comparisions we want to allow all columns
            # (regardless of dtype to pass thru)
            return self._combine_const(other, func, raise_on_error = False).fillna(masker).astype(bool)

    f.__name__ = name

    return f

Similar to how @jreback set it up in the Series refactor

@jreback
Copy link
Contributor

jreback commented Aug 11, 2013

here's an idea

generally numeric operations in a frame are done using the numeric columns only (using _get_numeric_data)

I think for comparisons you could do something similar, only allow a compatible dtype comparator on the rhs (eg say its a datetime then exclude non-datetimes on the lhs), this prevents the integer/datetime comparison issues; booleans prob need some special consideration here (there is a _get_boolean_data); easy to create other of these

so for frames this makes perfect sense
but series prob needs to raise on invalid comparisons by default

@jtratner
Copy link
Contributor Author

@jreback would this still work:

df = DataFrame({"A": [0, 1.2, np.nan], "B": [datetime.now(), datetime.now(), datetime.now()]})
df[ df > 0]

@jreback
Copy link
Contributor

jreback commented Aug 11, 2013

this is why the comparisons return True by default (when the comparisons fail, eg fillna(True)), so it passes thru on the selection

so will pass thru except where the A value = 0;

I don't think this should change

@jtratner
Copy link
Contributor Author

okay, then I guess this particularly thing should be left as-is? (since that's already the behavior that's happening).

A little weird for this to happen

In [1]: from pandas import *

In [2]: %paste
row = [datetime(1, 1, 1), datetime(2, 2, 2), datetime(3, 3, 3)]
df = DataFrame({"A": row, "B": row})
df > 3

## -- End pasted text --
Out[2]:
      A     B
0  True  True
1  True  True
2  True  True

@jtratner
Copy link
Contributor Author

but I guess you could chain the comparisons...which probably makes sense, like:

(df > 3) & (df < datetime(2, 2, 3))

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@jtratner I think we are not changing this, correct, close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants