Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk #5519

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -299,6 +299,9 @@ def read_csv(
parse_options=po,
convert_options=co,
)
if len(set(at.schema.names)) < len(at.schema.names):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC defaulting to pandas here means we read a CSV file using the original Pandas API, then convert it to Arrow table, and pass this table to the same from_arrow used below. So, we pay full file read + conversion price just to use Pandas column mangling functionality.

We can obtain mangled names by running the original Pandas read_csv with nrows=0 and assign them to Arrow table using pyarrow.Table.rename_columns. We can also mangle names by ourselves because mangling is documented. If mangling is disabled then duplicated columns are simply dropped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented column names mangling - #5639 .

ErrorMessage.default_to_pandas("`read_csv`")
return super().read_csv(**mykwargs)

return cls.from_arrow(at)
except (
Expand Down
22 changes: 0 additions & 22 deletions modin/pandas/test/test_io.py
Expand Up @@ -378,17 +378,6 @@ def test_read_csv_parsing_2(
names,
encoding,
):
xfail_case = (
StorageFormat.get() == "Hdk"
and header is not None
and isinstance(skiprows, int)
and names is None
and nrows is None
)
if xfail_case:
pytest.xfail(
"read_csv fails because of duplicated columns names - issue #3080"
)
if request.config.getoption(
"--simulate-cloud"
).lower() != "off" and is_list_like(skiprows):
Expand Down Expand Up @@ -495,10 +484,6 @@ def test_read_csv_squeeze(self, request, test_case):
)

def test_read_csv_mangle_dupe_cols(self):
if StorageFormat.get() == "Hdk":
pytest.xfail(
"processing of duplicated columns in HDK storage format is not supported yet - issue #3080"
)
with ensure_clean() as unique_filename, pytest.warns(
FutureWarning, match="'mangle_dupe_cols' keyword is deprecated"
):
Expand Down Expand Up @@ -1001,13 +986,6 @@ def test_read_csv_s3_issue4658(self):
@pytest.mark.parametrize("names", [list("XYZ"), None])
@pytest.mark.parametrize("skiprows", [1, 2, 3, 4, None])
def test_read_csv_skiprows_names(self, names, skiprows):
if StorageFormat.get() == "Hdk" and names is None and skiprows in [1, None]:
# If these conditions are satisfied, columns names will be inferred
# from the first row, that will contain duplicated values, that is
# not supported by `HDK` storage format yet.
pytest.xfail(
"processing of duplicated columns in HDK storage format is not supported yet - issue #3080"
)
eval_io(
fn_name="read_csv",
# read_csv kwargs
Expand Down