BUG: Ordering of `columns` is not preserved in `pd.read_orc` #47944

galipremsagar · 2022-08-03T15:28:28Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [5]: import cudf

In [6]: df = cudf.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

In [7]: df.to_orc("abc")

In [8]: df.to_parquet("abc.parquet")

In [9]: import pandas as pd

In [10]: pd.read_parquet("abc.parquet", columns=['b', 'a'])
Out[10]: 
   b  a
0  a  1
1  b  2
2  c  3

In [11]: pd.read_orc("abc", columns=['b', 'a'])
Out[11]: 
   a  b
0  1  a
1  2  b
2  3  c

In [12]: pd.__version__
Out[12]: '1.4.3'

Issue Description

The ordering of columns in the output is being respected in pd.read_parquet but not in pd.read_orc.

Expected Behavior

The expected behavior is the columns in the output dataframe are in the same order of columns

Installed Versions

INSTALLED VERSIONS

commit : e8093ba
python : 3.9.13.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.3
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 63.3.0
pip : 22.2.1
Cython : 0.29.32
pytest : 7.1.2
hypothesis : 6.47.1
sphinx : 5.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli :
fastparquet : None
fsspec : 2022.7.1
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : 0.55.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2022.7.1
scipy : 1.9.0
snappy :
sqlalchemy : 1.4.39
tables : None
tabulate : 0.8.10
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-08-09T18:52:06Z

I think this may be an issue with Pyarrow. I'll raise an issue there to see if this is expected behavior

In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

# pandas 1.5 main branch
In [2]: df.to_orc("abc")

In [3]: pd.read_orc("abc", columns=['b', 'a'])
Out[3]:
   a  b
0  1  a
1  2  b
2  3  c

In [4]: import pyarrow.orc as orc

In [5]: orc_file = orc.ORCFile("abc")

In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
Out[6]:
   a  b
0  1  a
1  2  b
2  3  c

# reordered to a, b
In [7]: orc_file.read(columns=['b', 'a'])
Out[7]:
pyarrow.Table
a: int64
b: string
----
a: [[1,2,3]]
b: [["a","b","c"]]

mroeschke · 2022-08-09T18:59:28Z

xref https://issues.apache.org/jira/browse/ARROW-17360

mroeschke · 2022-11-09T19:21:28Z

Looks like the resolution is from the above issue was to add a note in the doc that order is not guaranteed to be preserved when passing columns: apache/arrow#14528

We can mirror this note in pd.read_orc

grtcoder · 2022-11-10T14:30:03Z

Hey, I would like to take this up

markopacak · 2022-11-14T19:29:37Z

@mroeschke is it just about adding a note like

Output always follows the ordering of the file and not the columns list.

to read_orc?

I see @grtcoder is working on multiple issues ATM. Perhaps I could take this? It would be my first contribution to Pandas.

mroeschke · 2022-11-14T19:43:04Z

Correct. Yeah anyone is welcome to open up a pull request to tackle this issue

markopacak · 2022-11-14T19:48:56Z

take

galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2022

mroeschke added Upstream issue Issue related to pandas dependency Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 9, 2022

mroeschke added Docs good first issue and removed Bug Upstream issue Issue related to pandas dependency labels Nov 9, 2022

github-actions bot assigned markopacak Nov 14, 2022

markopacak mentioned this issue Nov 15, 2022

DOC: document read_orc columns order behaviour #49709

Merged

5 tasks

phofl closed this as completed in #49709 Nov 21, 2022

asfimport mentioned this issue Nov 9, 2022

[Python] Order of columns in pyarrow.feather.read_table apache/arrow#32634

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Ordering of `columns` is not preserved in `pd.read_orc` #47944

BUG: Ordering of `columns` is not preserved in `pd.read_orc` #47944

galipremsagar commented Aug 3, 2022

INSTALLED VERSIONS

mroeschke commented Aug 9, 2022

mroeschke commented Aug 9, 2022

mroeschke commented Nov 9, 2022

grtcoder commented Nov 10, 2022

markopacak commented Nov 14, 2022

mroeschke commented Nov 14, 2022

markopacak commented Nov 14, 2022

BUG: Ordering of columns is not preserved in pd.read_orc #47944

BUG: Ordering of columns is not preserved in pd.read_orc #47944

Comments

galipremsagar commented Aug 3, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

mroeschke commented Aug 9, 2022

mroeschke commented Aug 9, 2022

mroeschke commented Nov 9, 2022

grtcoder commented Nov 10, 2022

markopacak commented Nov 14, 2022

mroeschke commented Nov 14, 2022

markopacak commented Nov 14, 2022

BUG: Ordering of `columns` is not preserved in `pd.read_orc` #47944

BUG: Ordering of `columns` is not preserved in `pd.read_orc` #47944