Skip to content

Poor performance of StaticCluster#remove  #917

@YegorKozlov

Description

@YegorKozlov

NonHierarchicalDistanceBasedAlgorithm calls StaticCluster#remove when re-clustering candidates:

StaticCluster internally stores the items in a ArrayList

and the call of ArrayList#remove is O(N), i.e. every time we remove an item the list needs to iterate the elements until it finds the item. This results in poor performance on large data sets (100K+ points)

A low-hanging fix is to change the container from ArrayList to HashSet (or LinkedHashSet if ordering matters)

Metadata

Metadata

Assignees

Labels

releasedtriage meI really want to be triaged.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions