Skip to content

Slowness for DataFrame.where/mask when dtype=bool #3733

@dalejung

Description

@dalejung
data = np.random.randn(10000, 500)
df = pd.DataFrame(data)
df = df.where(df > 0) # create nans
bools = df > 0

reg_mask = bools.mask(pd.isnull(df)) # walltime 10s
float_mask = bools.astype(float).mask(pd.isnull(df)) # walltime 200ms
tm.assert_frame_equal(reg_mask, float_mask) # success

I found that while masking boolean DataFrames, converting to float speeds up the process considerably. When you're masking at least one row in each column, the outputs are equivalent.

Note, the mask is slow even when you're only masking one column

data = np.random.randn(10000, 500)
df = pd.DataFrame(data)
df.ix[:, 0] = np.nan
bools = df > 0

reg_mask = bools.mask(pd.isnull(df)) # walltime 10s
# dtypes: bool(499), float64(1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions