Use pandas duplicated() to check for duplicate index values #826

victorlin · 2022-01-06T22:35:15Z

Description of proposed changes

This PR uses the built-in pandas.Series.duplicated() to check for duplicates, based on @huddlej's original suggestion. Also cleans up the logic for optimal performance.

Related issue(s)

Fixes #825

Testing

Added test

codecov · 2022-01-06T22:37:54Z

Codecov Report

Merging #826 (73ccb0e) into master (73a07cf) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #826      +/-   ##
==========================================
+ Coverage   33.78%   33.79%   +0.01%     
==========================================
  Files          41       41              
  Lines        5902     5903       +1     
  Branches     1465     1465              
==========================================
+ Hits         1994     1995       +1     
  Misses       3825     3825              
  Partials       83       83

Impacted Files	Coverage Δ
augur/util_support/metadata_file.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73a07cf...73ccb0e. Read the comment docs.

victorlin · 2022-01-06T23:03:28Z

[side note] While working on this, I opened a performance bug in pandas pandas-dev/pandas#45236 related to this line:

augur/augur/util_support/metadata_file.py

Line 59 in 73ccb0e

duplicate_rows = self.metadata[self.key_type].duplicated()

The docs suggest using self.metadata.duplicated(self.key_type), but it is actually slower.

huddlej

Thanks, @victorlin! I confirmed that this works with the ncov scripts that call read_metadata when I have numexpr installed.

victorlin added 2 commits January 6, 2022 14:15

Use pandas duplicated() to check for duplicate index values

f7308dd

add test for duplicate check

73ccb0e

victorlin requested a review from huddlej January 6, 2022 22:35

huddlej approved these changes Jan 7, 2022

View reviewed changes

victorlin merged commit a004d3f into master Jan 7, 2022

victorlin deleted the victorlin/fix-utils-duplicate-check branch January 7, 2022 21:25

huddlej added this to the Patch release 13.1.1 milestone Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pandas duplicated() to check for duplicate index values #826

Use pandas duplicated() to check for duplicate index values #826

victorlin commented Jan 6, 2022 •

edited

Loading

codecov bot commented Jan 6, 2022 •

edited

Loading

victorlin commented Jan 6, 2022

huddlej left a comment

Use pandas duplicated() to check for duplicate index values #826

Use pandas duplicated() to check for duplicate index values #826

Conversation

victorlin commented Jan 6, 2022 • edited Loading

Description of proposed changes

Related issue(s)

Testing

codecov bot commented Jan 6, 2022 • edited Loading

Codecov Report

victorlin commented Jan 6, 2022

huddlej left a comment

Choose a reason for hiding this comment

victorlin commented Jan 6, 2022 •

edited

Loading

codecov bot commented Jan 6, 2022 •

edited

Loading