BUG: Parameter converters when using the read function. #59026

Thanaraklee · 2024-06-17T06:55:10Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
def a_cleaning(value: object) -> object:
    if isinstance(value, str):
        return value.replace(',','')
    else:
        return value

data = pd.DataFrame({
    'A':['1,200',np.nan,'400','200',np.nan]
})
data.to_csv('data.csv',index=False)

# converters
df = pd.read_csv('data.csv',
                  converters={
                      'A': a_cleaning
                  })
print('converters:')
display(df)

# apply
print('apply:')
df = pd.read_csv('data.csv')
df['A'] = df['A'].apply(a_cleaning)
display(df)

Issue Description

I'm wondering why using converters results in returning NaN values as '' when using the same function, but when switching to apply instead of converters, the NaN values are returned as NaN as before.

My function:

def a_cleaning(value: object) -> object:
    if isinstance(value, str):
        return value.replace(',','')
    else:
        return value

My Dataframe:

data = pd.DataFrame({
    'A':['1,200',np.nan,'400','200',np.nan]
})
data.to_csv('data.csv',index=False)

My code when using converters:

df = pd.read_csv('data.csv',
                  converters={
                      'A': a_cleaning
                  })
display(df)

Result:

My code when using apply:

df = pd.read_csv('data.csv')
df['A'] = df['A'].apply(a_cleaning)
display(df)

Result:

Why are the results different?
I'm not sure if this issue will occur with other read functions. I've only tested it with read_csv so far.

Expected Behavior

It should produce the same result as using apply.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.133+
Version : #1 SMP Tue Dec 19 13:14:11 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : C.UTF-8
LOCALE : None.None

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.0.3
pip : 23.3.2
Cython : 3.0.8
pytest : 8.2.1
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 5.2.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.20.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.7.5
numba : 0.59.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.3
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.5.0
xlrd : None
zstandard : 0.19.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

pongpatapee · 2024-06-25T19:43:41Z

Hi,

This looks interesting to me. I'd like to have a go at this.
This will be my first time contributing, please let me know if something should be done a different way.

Thank you :)

pongpatapee · 2024-06-25T19:43:45Z

take

Thanaraklee · 2024-10-06T06:03:12Z

take

vamsi-verma-s · 2024-10-07T02:53:20Z

The function gets different inputs in each of these cases, one is getting raw string data in csv as input and the other is getting elements of a processed DataFrame. The conversion in read_csv happens before the dataframe is created.

In [12]: def a_cleaning(value: object) -> object:
    ...:     print(value)
    ...:     print(f'type: {type(value)}')
    ...:     if isinstance(value, str):
    ...:         return value.replace(',','')
    ...:     else:
    ...:         return value
    ...: 

In [13]: data = pd.DataFrame({
    ...:     'A':['1,200',np.nan,'400','200',np.nan]
    ...: })
    ...: csv = data.to_csv(index=False)

In [14]: df = pd.read_csv(StringIO(csv),
    ...:                   converters={
    ...:                       'A': a_cleaning
    ...:                   })
    ...: display(df)
1,200
type: <class 'str'>

type: <class 'str'>
400
type: <class 'str'>
200
type: <class 'str'>

type: <class 'str'>
      A
0  1200
1      
2   400
3   200
4      

In [15]: df = pd.read_csv(StringIO(csv))
    ...: df['A'] = df['A'].apply(a_cleaning)
    ...: display(df)
1,200
type: <class 'str'>
nan
type: <class 'float'>
400
type: <class 'str'>
200
type: <class 'str'>
nan
type: <class 'float'>
      A
0  1200
1   NaN
2   400
3   200
4   NaN

Thanaraklee added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 17, 2024

rhshadrach added the IO CSV read_csv, to_csv label Jun 19, 2024

github-actions bot assigned pongpatapee Jun 25, 2024

github-actions bot assigned Thanaraklee Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Parameter converters when using the read function. #59026

BUG: Parameter converters when using the read function. #59026

Thanaraklee commented Jun 17, 2024

INSTALLED VERSIONS

pongpatapee commented Jun 25, 2024

pongpatapee commented Jun 25, 2024

Thanaraklee commented Oct 6, 2024

vamsi-verma-s commented Oct 7, 2024

BUG: Parameter converters when using the read function. #59026

BUG: Parameter converters when using the read function. #59026

Comments

Thanaraklee commented Jun 17, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

pongpatapee commented Jun 25, 2024

pongpatapee commented Jun 25, 2024

Thanaraklee commented Oct 6, 2024

vamsi-verma-s commented Oct 7, 2024