BUG: read_csv with engine pyarrow doesn't handle missing values #47950

MarcoGorelli · 2022-08-03T20:50:08Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import io

data = """idx,date
2,2000-01-01
,
"""

print('pyarrow engine', pd.read_csv(io.StringIO(data), parse_dates=['date'], engine='pyarrow'))
print('default engine', pd.read_csv(io.StringIO(data), parse_dates=['date']))

Issue Description

Like this, we get:

pyarrow engine    idx        date
0  2.0  2000-01-01
1  NaN        None
default engine    idx       date
0  2.0 2000-01-01
1  NaN        NaT

Expected Behavior

I'd expect:

pyarrow engine    idx        date
0  2.0  2000-01-01
1  NaN        NaT
default engine    idx       date
0  2.0 2000-01-01
1  NaN        NaT

Installed Versions

/home/megorelli/pandas-dev/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : 61464f8
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.114-16025-ge75506b9d98e
Version : #1 SMP PREEMPT Sat Jul 16 18:52:19 PDT 2022
machine : aarch64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+1230.g61464f8cad.dirty
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 62.1.0
pip : 22.0.4
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.47.1
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2022-08-03T21:09:11Z

I think the issue is in concat_date_cols, where None is passed to convert_to_unicode and is converted to 'None'

MarcoGorelli · 2022-08-04T10:51:01Z

Furthermore:

import numpy as np
import pandas as pd
import io

dates=np.array(['3/31/2019', None], dtype=object)
times=np.array(['11:20', '10:45'], dtype=object)
result = pd._libs.tslibs.parsing.concat_date_cols((dates, times))
print(result)

outputs

['3/31/2019 11:20' 'None 10:45']

I think it should output None instead of 'None 10:45'

mattharrison · 2023-11-30T23:43:35Z

I was going to file this bug. Here's my example:

import pandas as pd

import io

billing_data = \
 '''cancel_date,period_start,start_date,end_date,rev,sum_payments
12/1/2019,1/1/2020,12/15/2019,5/15/2020,999,50
,1/1/2020,12/15/2019,5/15/2020,999,50
,1/1/2020,12/15/2019,5/15/2020,999,1950
1/20/2020,1/1/2020,12/15/2019,5/15/2020,499,0
,1/1/2020,12/24/2019,5/24/2020,699,100
,1/1/2020,11/29/2019,4/29/2020,799,250
,1/1/2020,1/15/2020,4/29/2020,799,250'''

df = pd.read_csv(io.StringIO(billing_data), dtype_backend='pyarrow',
                 parse_dates=['cancel_date', 'period_start', 'start_date', 'end_date'])
df.dtypes  # cancel_date is object

MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2022

MarcoGorelli added Arrow pyarrow functionality IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2022

MarcoGorelli mentioned this issue Aug 4, 2022

BUG: read_csv with engine pyarrow doesn't handle missing values #47962

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv with engine pyarrow doesn't handle missing values #47950

BUG: read_csv with engine pyarrow doesn't handle missing values #47950

MarcoGorelli commented Aug 3, 2022

INSTALLED VERSIONS

MarcoGorelli commented Aug 3, 2022

MarcoGorelli commented Aug 4, 2022

mattharrison commented Nov 30, 2023

BUG: read_csv with engine pyarrow doesn't handle missing values #47950

BUG: read_csv with engine pyarrow doesn't handle missing values #47950

Comments

MarcoGorelli commented Aug 3, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

MarcoGorelli commented Aug 3, 2022

MarcoGorelli commented Aug 4, 2022

mattharrison commented Nov 30, 2023