Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

Open
JoshuaC3 opened this issue Mar 29, 2021 · 5 comments
Open

ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

JoshuaC3 opened this issue Mar 29, 2021 · 5 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action

Comments

@JoshuaC3
Copy link

Is your feature request related to a problem?

I wish I could use pandas to drop NaN values using a threshold that is a fraction of the total column/row, not an absolute number. See detailed example below.

Describe the solution you'd like

thresh : int or float, optional
Where int, requires that many non-NA values and where float, require that fraction of non-NA values.

API breaking implications

Only needs to extend it to accept floats as well as ints.

Describe alternatives you've considered

None. IMHO this solution is too simple and effective to consider other options.

Additional context

Example
import numpy as np
import pandas as pd
import missingno as msno


X = pd.DataFrame(
    {
        'ones': np.ones(50),
        'rand': np.random.normal(size=50),
        'linear': np.linspace(1, 50),
    }
)

y = 3 * np.sin(X.linear) + X.ones + X.rand
y = y.rename('sin_target')

X.loc[4:10, 'ones'] = np.nan
X.loc[4:20, 'rand'] = np.nan
X.loc[40:45, ['linear', 'ones', 'rand']] = np.nan
y.loc[23:25] = np.nan

Xy = pd.concat([X, y], axis=1)

msno.matrix(Xy)

image

Xy_t = Xy.dropna(thresh=0.7 * len(Xy), axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.7, axis=1)
msno.matrix(Xy_t)

image

Xy_t = Xy.dropna(thresh=0.4 * Xy.shape[1], axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.4, axis=1)
msno.matrix(Xy_t)

image

It should be obviously that this is very useful in cases where the df axis size is liable to change, and where using piped functionality (saves the extra line of code calculating thresh_int = 0.4 * Xy.shape[1]).

@JoshuaC3 JoshuaC3 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 29, 2021
@attack68
Copy link
Contributor

This seems like a reasonably sensible and simple extension, albeit, there might need to be careful coding around 1 int and 1.00 float (0 int and 0.00 float are the same).

I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh to actually be a callable, based on the row, column that is being sampled.

@attack68 attack68 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 29, 2021
@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

this is a duplicate issue - pls search (prior one is closed as this is not a good api)

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

duplicate of #35299

if you want to propose a new api go ahead, though am loath to add any additional keywords.

@rhshadrach
Copy link
Member

I don't see why the existing alternative mentioned here is not sufficient.

@attack68

I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh to actually be a callable, based on the row, column that is being sampled.

That shouldn't go in the dropna method though, right? That sounds more like a general filter (there was a recent issue on this).

@attack68
Copy link
Contributor

@rhshadrach yes probably right, I am not upto date with discussions on this, more a high level view that more technical missing data routines are becoming more necessary (particularly in my field) and pandas might have requests for flexibility in this regard. As for where there should go, I can agree dropna might serve a more basic purpose and best to keep it basic.

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants