Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Drop Duplicates Method #10207

Closed
exalate-issue-sync bot opened this issue May 12, 2023 · 7 comments
Closed

Create Drop Duplicates Method #10207

exalate-issue-sync bot opened this issue May 12, 2023 · 7 comments

Comments

@exalate-issue-sync
Copy link

create a method similar to pandas drop_duplicates() for h2oframe

@exalate-issue-sync
Copy link
Author

Michal Malohlava commented: There is workaround (CC: [~accountid:557058:89297402-cb5a-4710-9511-20f42b25451a])

@exalate-issue-sync
Copy link
Author

Navdeep commented: [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223] Oh, im intrigued.

@exalate-issue-sync
Copy link
Author

Michal Malohlava commented: Oki, so drop_duplicates() removes rows - at the end it is providing the same functionality as sort | uniq in shell.

@exalate-issue-sync
Copy link
Author

Lauren DiPerna commented: 1) one workaround is to use sort and then it could be easy to check duplicated rows (there is no support for that right now)

  1. another workaround is to create a new column with row hash and call drop duplicates using that column (however, it still will need sort above)

@exalate-issue-sync
Copy link
Author

Neema Mashayekhi commented: Like in {{pandas.drop_duplicates()}} or data.table's {{unique()}}, customer would like the ability to:

keep either the first occurrence or the last one, {{fromLast = FALSE}} or {{fromLast = TRUE}} .

handle duplicate for multiple keys/columns. Using {{h2o.unique()}} only handles one key.

[https://rdrr.io/rforge/data.table/man/duplicated.html|https://rdrr.io/rforge/data.table/man/duplicated.html]

@exalate-issue-sync
Copy link
Author

Neema Mashayekhi commented: Consider concatenating multiple columns into a single column: [1], ['A'] → ['1_A']

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 15, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-3292
Assignee: Pavel Pscheidl
Reporter: Lauren DiPerna
State: Resolved
Fix Version: 3.30.0.4
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#4581

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant