Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk #5519

Closed
wants to merge 1 commit into from

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Jan 4, 2023

Signed-off-by: Anatoly Myachev anatoly.myachev@intel.com

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves read_csv with OmniSci backend doesn't handle duplicated columns #3080
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

… read_csv on hdk

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev marked this pull request as ready for review January 4, 2023 18:32
@anmyachev anmyachev requested review from a team as code owners January 4, 2023 18:32
@@ -299,6 +299,9 @@ def read_csv(
parse_options=po,
convert_options=co,
)
if len(set(at.schema.names)) < len(at.schema.names):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC defaulting to pandas here means we read a CSV file using the original Pandas API, then convert it to Arrow table, and pass this table to the same from_arrow used below. So, we pay full file read + conversion price just to use Pandas column mangling functionality.

We can obtain mangled names by running the original Pandas read_csv with nrows=0 and assign them to Arrow table using pyarrow.Table.rename_columns. We can also mangle names by ourselves because mangling is documented. If mangling is disabled then duplicated columns are simply dropped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented column names mangling - #5639 .

@anmyachev
Copy link
Collaborator Author

Closed in favour of #5639

@anmyachev anmyachev closed this Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv with OmniSci backend doesn't handle duplicated columns
3 participants