Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match merging has performance issues with submission sets of larger programs #1430

Open
tsaglam opened this issue Dec 11, 2023 · 2 comments
Open
Labels
bug Issue/PR that involves a bug minor Minor issue/feature/contribution/change

Comments

@tsaglam
Copy link
Member

tsaglam commented Dec 11, 2023

When submissions are > 1000 LoC, the performance degrades. I have not yet investigated whether this is CPU load or memory/paging issues, or a general algorithmic flaw.

@tsaglam tsaglam added bug Issue/PR that involves a bug minor Minor issue/feature/contribution/change labels Dec 11, 2023
@tsaglam tsaglam added this to the v5.0.0 milestone Dec 11, 2023
@tsaglam
Copy link
Member Author

tsaglam commented Dec 12, 2023

I profiled match merging for a large dataset (>400 submissions, ~1500 loc each, with basecode on):

Method profiling

image

Memory profiling

image

My current theory: Due to the nature of the match merging, smaller matches below the MTM are generated during greedy string tiling. This is required, as neighbors with lengths below MTM can be merged into eligible matches again. However, this effectively forces the greedy string tiling to execute to a subsequence length of 2 (default neighbor length is 2), producing exponentially more subsequences, and thus the matching is slowed. Thus, most of the overhead is algorithmic. For my dataset, enabling match merging comes with a ~4000% runtime overhead. Most samples come from GreedyStringTiling.compareInternal().

Besides that issue, I see two minor performance issues in the MatchMerging class:

  • The method computeNeighbors() is inefficient due to the sorting.
  • The method removeTooShortMatches() is inefficient due to the type of deletion (a simple stream statement is more efficient here based on rudimentary tests).

Hotifx: Setting --neighbor-length to 5 reduces the overhead to ~190%.

@tsaglam
Copy link
Member Author

tsaglam commented Dec 15, 2023

One idea to work around this problem would be to set match merging as activated by default but increase the neighbor length dynamically depending on the average tokens per submission. We would need measurements for different datasets for different neighbor lengths. to see when the slow-down factor gets excessive. Something like 2x to 4x is okay, but 40x is not. Then, we would have a CLI parameter for match merging: Off, auto, manual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR that involves a bug minor Minor issue/feature/contribution/change
Projects
None yet
Development

No branches or pull requests

1 participant