Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping non-finite entries #7314

Closed
amelio-vazquez-reina opened this issue Jun 2, 2014 · 19 comments · Fixed by #7315
Closed

Dropping non-finite entries #7314

amelio-vazquez-reina opened this issue Jun 2, 2014 · 19 comments · Fixed by #7315
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@amelio-vazquez-reina
Copy link
Contributor

I have been looking for a solution for this for a long time. I tried the ideas in the following threads (with the latest Pandas):

  1. Keep finite entries only in Pandas
  2. Row filtering so that we only keep finite entries

but none of them work. See thread 2 above, and the comments in its only answer to see why.

What is a good way to drop indices (either rows or columns) that meet a specific criteria such as: "they contain entries that are not finite".?

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

You can use the option mode.use_inf_as_null to do this:

In [14]: df = DataFrame({'a': randint(3,size=10)})

In [15]: df['b'] = tm.choice([2,3,nan,inf,-inf], size=len(df))

In [16]: df
Out[16]:
   a       b
0  1     inf
1  2    -inf
2  0  3.0000
3  1    -inf
4  2     NaN
5  1  3.0000
6  1     inf
7  0  2.0000
8  2    -inf
9  2     inf

In [17]: with pd.option_context('mode.use_inf_as_null', True):
   ....:     res = df.dropna()
   ....:

In [18]: res
Out[18]:
   a  b
2  0  3
5  1  3
7  0  2

@jreback
Copy link
Contributor

jreback commented Jun 2, 2014

Well using the example from 2

In [81]: x = pandas.DataFrame([
   ....:     [1, 2, np.inf],
   ....:     [4, np.inf, 5],
   ....:     [6, 7, 8]
   ....: ])

In [82]: x
Out[82]: 
   0         1         2
0  1  2.000000       inf
1  4       inf  5.000000
2  6  7.000000  8.000000

In [84]: np.isinf(x)
Out[84]: 
       0      1      2
0  False  False   True
1  False   True  False
2  False  False  False

In [85]: x[np.isinf(x)] = np.nan

In [86]: x.dropna()
Out[86]: 
   0  1  2
2  6  7  8

In [87]: x
Out[87]: 
   0   1   2
0  1   2 NaN
1  4 NaN   5
2  6   7   8

isn't this what you want?

(its only slightly more tricky to NOT convert the existing nans if you have)

@jreback
Copy link
Contributor

jreback commented Jun 2, 2014

ahhh yes...forgot about the use_inf_as_null option... +1 on that!

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

curious that inf makes the numbers in the Series do %.2f repr instead of a %.2g-style repr, is that intentional?

@amelio-vazquez-reina
Copy link
Contributor Author

Thanks @cpcloud and @jreback . Any way to just drop Inf (and non-Inf) entries when working with dfs with mixed types?

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

what do you mean inf and non-inf? isn't that everything?

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

oh i see ... because isfinite doesn't work on object dtypes

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

seems like a bug, dropna doesn't work on inf when dtypes are mixed and mode.use_inf_as_null is True

@amelio-vazquez-reina
Copy link
Contributor Author

Thanks @cpcloud Yes I have had the problem you just mentioned before. Also, sometimes I just want to drop Inf and -Inf values (keeping NaNs untouched)

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

that's a bit of a strange use case. i would suggest something like replacing nan with a string like nan_str or something then dropping inf/-inf with isnull (once I fix this) then replacing the nan_str back with nan

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

or you could use replace

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

@ribonoous i put up the fix if you want to check it out

@cpcloud cpcloud added this to the 0.14.1 milestone Jun 2, 2014
@cpcloud cpcloud self-assigned this Jun 2, 2014
@hayd
Copy link
Contributor

hayd commented Jun 2, 2014

or you could use replace

Is the answer to everything.

should isnull/dropna do infs by default?? It seems like they may sometimes have special meaning (different from NaN).

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

:) replace is that person who raises their hand to answer every question whether or not they know the answer. Judging by the way things are named I would guess that this used to be the default but for some reason was changed.

@jreback
Copy link
Contributor

jreback commented Jun 2, 2014

changed here: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#v0-10-0-december-17-2012 (look down a bit); inf should not be treated as nan makes sense as the default. as pretty simple to convert them if needed and its sematically wrong (as they are an actual value)

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2014

we could have an isinf that handles object dtype, but i'm not really sure how widely used inf is ... personally I almost never use it and when I do, it'll eventually be replaced by 0 or NaN or something else that's easy(ier) to work with.

@hayd
Copy link
Contributor

hayd commented Jun 2, 2014

You can do applymap(np.isinf) or df.where(df.applymap(np.isinf)...

If perf is the issue convert to float!

@jreback
Copy link
Contributor

jreback commented Jun 2, 2014

easy enough to df._get_numeric_data()

fyi, maybe we should make a method (needs a better name maybe)

df.get_for_dtypes(list_of_dtypes), where list_of_dtypes could be actual dtypes and/or numeric/datetime

@TomAugspurger
Copy link
Contributor

@jreback I could use something like that in #7308 for [numeric, datetime]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants