Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv with engine pyarrow doesn't handle missing values #47950

Open
3 tasks done
MarcoGorelli opened this issue Aug 3, 2022 · 3 comments
Open
3 tasks done

BUG: read_csv with engine pyarrow doesn't handle missing values #47950

MarcoGorelli opened this issue Aug 3, 2022 · 3 comments
Labels
Arrow pyarrow functionality Bug IO CSV read_csv, to_csv

Comments

@MarcoGorelli
Copy link
Member

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import io

data = """idx,date
2,2000-01-01
,
"""

print('pyarrow engine', pd.read_csv(io.StringIO(data), parse_dates=['date'], engine='pyarrow'))
print('default engine', pd.read_csv(io.StringIO(data), parse_dates=['date']))

Issue Description

Like this, we get:

pyarrow engine    idx        date
0  2.0  2000-01-01
1  NaN        None
default engine    idx       date
0  2.0 2000-01-01
1  NaN        NaT

Expected Behavior

I'd expect:

pyarrow engine    idx        date
0  2.0  2000-01-01
1  NaN        NaT
default engine    idx       date
0  2.0 2000-01-01
1  NaN        NaT

Installed Versions

/home/megorelli/pandas-dev/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : 61464f8
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.114-16025-ge75506b9d98e
Version : #1 SMP PREEMPT Sat Jul 16 18:52:19 PDT 2022
machine : aarch64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+1230.g61464f8cad.dirty
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 62.1.0
pip : 22.0.4
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.47.1
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@MarcoGorelli MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2022
@MarcoGorelli
Copy link
Member Author

I think the issue is in concat_date_cols, where None is passed to convert_to_unicode and is converted to 'None'

@MarcoGorelli MarcoGorelli added Arrow pyarrow functionality IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2022
@MarcoGorelli
Copy link
Member Author

Furthermore:

import numpy as np
import pandas as pd
import io

dates=np.array(['3/31/2019', None], dtype=object)
times=np.array(['11:20', '10:45'], dtype=object)
result = pd._libs.tslibs.parsing.concat_date_cols((dates, times))
print(result)

outputs

['3/31/2019 11:20' 'None 10:45']

I think it should output None instead of 'None 10:45'

@mattharrison
Copy link

I was going to file this bug. Here's my example:

import pandas as pd

import io

billing_data = \
 '''cancel_date,period_start,start_date,end_date,rev,sum_payments
12/1/2019,1/1/2020,12/15/2019,5/15/2020,999,50
,1/1/2020,12/15/2019,5/15/2020,999,50
,1/1/2020,12/15/2019,5/15/2020,999,1950
1/20/2020,1/1/2020,12/15/2019,5/15/2020,499,0
,1/1/2020,12/24/2019,5/24/2020,699,100
,1/1/2020,11/29/2019,4/29/2020,799,250
,1/1/2020,1/15/2020,4/29/2020,799,250'''

df = pd.read_csv(io.StringIO(billing_data), dtype_backend='pyarrow',
                 parse_dates=['cancel_date', 'period_start', 'start_date', 'end_date'])
df.dtypes  # cancel_date is object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants