ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

JoshuaC3 · 2021-03-29T10:10:41Z

Is your feature request related to a problem?

I wish I could use pandas to drop NaN values using a threshold that is a fraction of the total column/row, not an absolute number. See detailed example below.

Describe the solution you'd like

thresh : int or float, optional
Where int, requires that many non-NA values and where float, require that fraction of non-NA values.

API breaking implications

Only needs to extend it to accept floats as well as ints.

Describe alternatives you've considered

None. IMHO this solution is too simple and effective to consider other options.

Additional context

Example

import numpy as np
import pandas as pd
import missingno as msno


X = pd.DataFrame(
    {
        'ones': np.ones(50),
        'rand': np.random.normal(size=50),
        'linear': np.linspace(1, 50),
    }
)

y = 3 * np.sin(X.linear) + X.ones + X.rand
y = y.rename('sin_target')

X.loc[4:10, 'ones'] = np.nan
X.loc[4:20, 'rand'] = np.nan
X.loc[40:45, ['linear', 'ones', 'rand']] = np.nan
y.loc[23:25] = np.nan

Xy = pd.concat([X, y], axis=1)

msno.matrix(Xy)

Xy_t = Xy.dropna(thresh=0.7 * len(Xy), axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.7, axis=1)
msno.matrix(Xy_t)

Xy_t = Xy.dropna(thresh=0.4 * Xy.shape[1], axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.4, axis=1)
msno.matrix(Xy_t)

It should be obviously that this is very useful in cases where the df axis size is liable to change, and where using piped functionality (saves the extra line of code calculating thresh_int = 0.4 * Xy.shape[1]).

The text was updated successfully, but these errors were encountered:

attack68 · 2021-03-29T11:59:32Z

This seems like a reasonably sensible and simple extension, albeit, there might need to be careful coding around 1 int and 1.00 float (0 int and 0.00 float are the same).

I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh to actually be a callable, based on the row, column that is being sampled.

jreback · 2021-03-29T12:19:54Z

this is a duplicate issue - pls search (prior one is closed as this is not a good api)

jreback · 2021-03-29T12:49:08Z

duplicate of #35299

if you want to propose a new api go ahead, though am loath to add any additional keywords.

rhshadrach · 2021-03-30T01:07:03Z

I don't see why the existing alternative mentioned here is not sufficient.

@attack68

I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh to actually be a callable, based on the row, column that is being sampled.

That shouldn't go in the dropna method though, right? That sounds more like a general filter (there was a recent issue on this).

attack68 · 2021-03-30T12:42:14Z

@rhshadrach yes probably right, I am not upto date with discussions on this, more a high level view that more technical missing data routines are becoming more necessary (particularly in my field) and pandas might have requests for flexibility in this regard. As for where there should go, I can agree dropna might serve a more basic purpose and best to keep it basic.

JoshuaC3 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 29, 2021

attack68 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 29, 2021

JoshuaC3 mentioned this issue Mar 29, 2021

ENH: add percentage threshold to DataFrame.dropna #35299

Closed

attack68 removed the good first issue label Mar 29, 2021

mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

JoshuaC3 commented Mar 29, 2021

attack68 commented Mar 29, 2021

jreback commented Mar 29, 2021

jreback commented Mar 29, 2021

rhshadrach commented Mar 30, 2021

attack68 commented Mar 30, 2021

ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

Comments

JoshuaC3 commented Mar 29, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

Example

attack68 commented Mar 29, 2021

jreback commented Mar 29, 2021

jreback commented Mar 29, 2021

rhshadrach commented Mar 30, 2021

attack68 commented Mar 30, 2021