Skip to content

ENH: add errors='coerce' to DataFrame.astype #48781

@joooeey

Description

@joooeey

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could quickly convert a DataFrame with some invalid data to numeric type and coerce. I thought pd.DataFrame.astype could do that but it doesn't have the option to coerce invalid data to NaNs (or NaTs).

In my particular case I have a DataFrame of sensor readings with mostly NaNs (indicating no value received), many integers (those I care about), and some strings (indicating specific errors). I quickly tried to get a histogram to get an overview of that data but the pd.DataFrame.hist requires numeric data which is a few lines of code to get. This is exploratory code I write in my console, so it would be sweet if this could be done with a single method.

Toy Example

import numpy as np
import pandas as pd

df = pd.DataFrame([
    [np.NaN, 0.1, 1.1, 1.6],
    ["error", 0.2, 1.2, 1.7],
    [0.3, "", 1.3, 1.8],
    [0.4, 1.4, "code255", 1.9],
])

df.astype(float, errors="coerce")
# ValueError: Expected value of kwarg 'errors' to be one of ['raise', 'ignore'].
# Supplied value is 'coerce'

import matplotlib.pyplot as plt
plt.hist(df.values.flatten(), bins=[0, 1, 2])

Expected result:

In [30]: df
Out[30]: 
     0    1    2    3
0  NaN  0.1  1.1  1.6
1  NaN  0.2  1.2  1.7
2  0.3  NaN  1.3  1.8
3  0.4  1.4  NaN  1.9

image

Feature Description

Two options:

  • Allow multidimensional input (e.g. DataFrames, Numpy Arrays) as the arg of pd.to_numeric. In case of mixed type columns (e.g. integers and floats), we'd have to decide and document if that would operate by column or cast the whole data structure to one dtype. I'd expect by column for DataFrames. Another issue that comes up is how to deal with multidimensional lists and tuples.

OR/AND

  • Add the option "coerce" to the errors kwarg in pd.DataFrame.astype (the current options are "raise" and "ignore". We'd have to decide how to deal with incompatibilities between errors="coerce" and dtype. E.g. what to do if someone tries to coerce to string. I would expect an error.

To me it looks like the potential for confusing the user is a lot lower with the second option because it has fewer edge cases.

Alternative Solutions

for col in df.cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")

Additional Context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions