Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Parameter converters when using the read function. #59026

Open
3 tasks done
Thanaraklee opened this issue Jun 17, 2024 · 4 comments
Open
3 tasks done

BUG: Parameter converters when using the read function. #59026

Thanaraklee opened this issue Jun 17, 2024 · 4 comments
Assignees
Labels
Bug IO CSV read_csv, to_csv Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@Thanaraklee
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
def a_cleaning(value: object) -> object:
    if isinstance(value, str):
        return value.replace(',','')
    else:
        return value

data = pd.DataFrame({
    'A':['1,200',np.nan,'400','200',np.nan]
})
data.to_csv('data.csv',index=False)

# converters
df = pd.read_csv('data.csv',
                  converters={
                      'A': a_cleaning
                  })
print('converters:')
display(df)

# apply
print('apply:')
df = pd.read_csv('data.csv')
df['A'] = df['A'].apply(a_cleaning)
display(df)

Issue Description

I'm wondering why using converters results in returning NaN values as '' when using the same function, but when switching to apply instead of converters, the NaN values are returned as NaN as before.

My function:

def a_cleaning(value: object) -> object:
    if isinstance(value, str):
        return value.replace(',','')
    else:
        return value

My Dataframe:

data = pd.DataFrame({
    'A':['1,200',np.nan,'400','200',np.nan]
})
data.to_csv('data.csv',index=False)

My code when using converters:

df = pd.read_csv('data.csv',
                  converters={
                      'A': a_cleaning
                  })
display(df)

Result:
image

My code when using apply:

df = pd.read_csv('data.csv')
df['A'] = df['A'].apply(a_cleaning)
display(df)

Result:
image

Why are the results different?
I'm not sure if this issue will occur with other read functions. I've only tested it with read_csv so far.

Expected Behavior

It should produce the same result as using apply.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.133+
Version : #1 SMP Tue Dec 19 13:14:11 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : C.UTF-8
LOCALE : None.None

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.0.3
pip : 23.3.2
Cython : 3.0.8
pytest : 8.2.1
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 5.2.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.20.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.7.5
numba : 0.59.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.3
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.5.0
xlrd : None
zstandard : 0.19.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@Thanaraklee Thanaraklee added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 17, 2024
@rhshadrach rhshadrach added the IO CSV read_csv, to_csv label Jun 19, 2024
@pongpatapee
Copy link

Hi,

This looks interesting to me. I'd like to have a go at this.
This will be my first time contributing, please let me know if something should be done a different way.

Thank you :)

@pongpatapee
Copy link

take

@Thanaraklee
Copy link
Author

take

@vamsi-verma-s
Copy link
Contributor

The function gets different inputs in each of these cases, one is getting raw string data in csv as input and the other is getting elements of a processed DataFrame. The conversion in read_csv happens before the dataframe is created.

In [12]: def a_cleaning(value: object) -> object:
    ...:     print(value)
    ...:     print(f'type: {type(value)}')
    ...:     if isinstance(value, str):
    ...:         return value.replace(',','')
    ...:     else:
    ...:         return value
    ...: 

In [13]: data = pd.DataFrame({
    ...:     'A':['1,200',np.nan,'400','200',np.nan]
    ...: })
    ...: csv = data.to_csv(index=False)

In [14]: df = pd.read_csv(StringIO(csv),
    ...:                   converters={
    ...:                       'A': a_cleaning
    ...:                   })
    ...: display(df)
1,200
type: <class 'str'>

type: <class 'str'>
400
type: <class 'str'>
200
type: <class 'str'>

type: <class 'str'>
      A
0  1200
1      
2   400
3   200
4      

In [15]: df = pd.read_csv(StringIO(csv))
    ...: df['A'] = df['A'].apply(a_cleaning)
    ...: display(df)
1,200
type: <class 'str'>
nan
type: <class 'float'>
400
type: <class 'str'>
200
type: <class 'str'>
nan
type: <class 'float'>
      A
0  1200
1   NaN
2   400
3   200
4   NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants