Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Refactor CPD integration #4323

Closed
adangel opened this issue Jan 12, 2023 · 3 comments · Fixed by #4397
Closed

[core] Refactor CPD integration #4323

adangel opened this issue Jan 12, 2023 · 3 comments · Fixed by #4397
Assignees
Labels
in:cpd Affects the copy-paste detector
Milestone

Comments

@adangel
Copy link
Member

adangel commented Jan 12, 2023

Notes originally from #3809:

  • reuse common infrastructure like text documents
  • CPD programmatic API - goal is to reuse the same abstractions as PMD like FileCollector. No compatibility story planned for master. Also see [core] Provide a CpdAnalysis class as a programmatic entry point into CPD #4204
  • Related to both items, maybe the cleanest approach would be to make CPD a PMD rule, and we could remove the programmatic API of CPD (keeping the CLI for convenience)
@adangel adangel added the in:cpd Affects the copy-paste detector label Jan 12, 2023
@adangel adangel added this to the 7.0.0 milestone Jan 12, 2023
@adangel adangel mentioned this issue Jan 12, 2023
55 tasks
@jsotuyod
Copy link
Member

I don't think implementing CPD as a rule is viable in the short term (if at all).

  • it requires statefullness (clashing with [core] Rule API refactoring #4321) or multifile analysis.
  • it requires the rule to either be able to cache arbitrary data or automatically invalidates the cache, requiring all files to be analyzed at least for this rule.

@oowekyala
Copy link
Member

While it cannot be implemented as a rule yet, the hurdles you mention are the topic of #3920.

@oowekyala
Copy link
Member

Also I want to eventually implement tree-based copy paste detection which would work by comparing trees instead of tokens. The most reasonable way to implement this is with #3920, because it still requires parsing files. I have an old prototype which gives much higher-quality results than CPD (https://github.com/oowekyala/pmd/tree/tree-clone-detection). This has the same challenges you mention about incremental analysis, but allowing one rule to process all files is not IMO a problem, as long as those rules that don't need to rerun use the violation cache.

@adangel adangel modified the milestones: 7.0.0, 7.x Jan 23, 2023
@oowekyala oowekyala self-assigned this Feb 13, 2023
@oowekyala oowekyala mentioned this issue Feb 16, 2023
5 tasks
@oowekyala oowekyala linked a pull request Feb 18, 2023 that will close this issue
5 tasks
@adangel adangel modified the milestones: 7.x, 7.0.0 Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in:cpd Affects the copy-paste detector
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants