FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk #5519

anmyachev · 2023-01-04T15:27:26Z

Signed-off-by: Anatoly Myachev anatoly.myachev@intel.com

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves read_csv with OmniSci backend doesn't handle duplicated columns #3080
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

… read_csv on hdk Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

ienkovich · 2023-01-04T20:19:19Z

modin/experimental/core/execution/native/implementations/hdk_on_native/io/io.py

@@ -299,6 +299,9 @@ def read_csv(
                parse_options=po,
                convert_options=co,
            )
+            if len(set(at.schema.names)) < len(at.schema.names):


IIUC defaulting to pandas here means we read a CSV file using the original Pandas API, then convert it to Arrow table, and pass this table to the same from_arrow used below. So, we pay full file read + conversion price just to use Pandas column mangling functionality.

We can obtain mangled names by running the original Pandas read_csv with nrows=0 and assign them to Arrow table using pyarrow.Table.rename_columns. We can also mangle names by ourselves because mangling is documented. If mangling is disabled then duplicated columns are simply dropped.

Implemented column names mangling - #5639 .

anmyachev · 2023-02-09T19:42:21Z

Closed in favour of #5639

FIX-modin-project#3080: Fix case when there is duplicated columns for…

f2317e2

… read_csv on hdk Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the issue3080 branch from 1695a23 to f2317e2 Compare January 4, 2023 17:37

anmyachev marked this pull request as ready for review January 4, 2023 18:32

anmyachev requested review from a team as code owners January 4, 2023 18:32

anmyachev added the Ready for review label Jan 4, 2023

ienkovich reviewed Jan 4, 2023

View reviewed changes

anmyachev closed this Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk #5519

FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk #5519

anmyachev commented Jan 4, 2023 •

edited

ienkovich Jan 4, 2023

AndreyPavlenko Feb 8, 2023

anmyachev commented Feb 9, 2023

FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk #5519

FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk #5519

Conversation

anmyachev commented Jan 4, 2023 • edited

What do these changes do?

ienkovich Jan 4, 2023

Choose a reason for hiding this comment

AndreyPavlenko Feb 8, 2023

Choose a reason for hiding this comment

anmyachev commented Feb 9, 2023

anmyachev commented Jan 4, 2023 •

edited