Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Clarify how CPD --ignore-literals and --ignore-identifiers work #4676

Closed
Scrates1 opened this issue Sep 14, 2023 · 8 comments · Fixed by #4685
Closed

[doc] Clarify how CPD --ignore-literals and --ignore-identifiers work #4676

Scrates1 opened this issue Sep 14, 2023 · 8 comments · Fixed by #4685
Labels
in:cli Affects the PMD Command Line Interface in:cpd Affects the copy-paste detector in:documentation Affects the documentation
Milestone

Comments

@Scrates1
Copy link

Scrates1 commented Sep 14, 2023

Affects PMD Version:

pmd-bin-7.0.0-rc3

Description:

I use the cpd function, and the command is as follows:

pmd-bin-7.0.0-rc3\bin>.\pmd.bat cpd --language java --format xml --ignore-annotations false --ignore-literals false --ignore-identifiers false --minimum-tokens 45 --dir "...Enum.java"

the result is as follows:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/C:/.../pmd/lib/groovy-2.4.21.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[main] INFO net.sourceforge.pmd.cli.commands.internal.AbstractPmdSubcommand - Log level is at INFO
<?xml version="1.0" encoding="UTF-8"?>
<pmd-cpd>
   <file path="C:\...Enum.java"
         totalNumberOfTokens="137"/>
   <duplication lines="15" tokens="49">
      <file begintoken="4"
            column="5"
            endcolumn="17"
            endline="16"
            endtoken="52"
            line="2"
            path="C:\...Enum.java"/>
      <file begintoken="59"
            column="5"
            endcolumn="23"
            endline="30"
            endtoken="107"
            line="20"
            path="C:\...Enum.java"/>
      <codefragment><![CDATA[    SUB_LIST("subList"), /* List.subList(int start, int end) */
    GET_BYTES1("getBytes"), /* String.getBytes(...) */

    GET("get"), /* List.get(int index) / FileItem.get() / Properties.get() / Map.get() */
    SET("set"), /* set() / Collection.set(int index, E element) */
    GET_BYTES("getBytes"), /* String.getBytes(...) */

    GET_NAME("getName"), /* ZipEntry.getName(), Class.getName() */
    GET_SIZE("getSize"), /* ZipEntry.getSize() */

    GET_INSTANCE("getInstance"), /* Calendar.getInstance() */

    GET_AND_INCREMENT("getAndIncrement"), /* AtomicInteger.getAndIncrement */

    LOCK("lock"), /* Lock.lock() */]]></codefragment>
   </duplication>
</pmd-cpd>

The content from row 2 to 16 is different from row 20 to 30. Args --ignore-literals and --ignore-identifiers have been set to false. The expected result is no duplicates be found which contradicts the actual result.

  • Is there something wrong with my args? What should I do if I want to reach my expectations?

  • And is there any way to know the similarity rate of the duplicates?

Thanks so much!

Exception Stacktrace:

# Copy-paste the stack trace here

Code Sample demonstrating the issue:

public enum Enum {
    SUB_LIST("subList"), /* List.subList(int start, int end) */
    GET_BYTES1("getBytes"), /* String.getBytes(...) */

    GET("get"), /* List.get(int index) / FileItem.get() / Properties.get() / Map.get() */
    SET("set"), /* set() / Collection.set(int index, E element) */
    GET_BYTES("getBytes"), /* String.getBytes(...) */

    GET_NAME("getName"), /* ZipEntry.getName(), Class.getName() */
    GET_SIZE("getSize"), /* ZipEntry.getSize() */

    GET_INSTANCE("getInstance"), /* Calendar.getInstance() */

    GET_AND_INCREMENT("getAndIncrement"), /* AtomicInteger.getAndIncrement */

    LOCK("lock"), /* Lock.lock() */
    TRY_LOCK("tryLock"), /* Lock.tryLock() */


    VALIDATE("validate"), /* ValidatorForm.validate */

    PARSE_INT("parseInt"), /* Integer.parseInt */
    PARSE_BYTE("parseByte"), /* Byte.parseByte */
    PARSE_SHORT1("parseShort"), /* Short.parseShort */
    PARSE_LONG("parseLong"), /* Long.parseLong */
    PARSE_FLOAT("parseFloat"), /* Float.parseFloat */
    PARSE_DOUBLE1("parseDouble"), /* Double.parseDouble */
    PARSE_BOOLEAN("parseBoolean"), /* Boolean.parseBoolean */
    SUBMIT("submit"), /* ThreadPoolExecutor.submit(), ExecutorService.submit() */
    EXECUTE("execute"); /* ThreadPoolExecutor.execute() */

    private final String name;

    private CommonMethodEnum(String name) {
        this.name = name;
    }

    public String simpleName() {
        return this.name;
    }
}

Steps to reproduce:

Please provide detailed steps for how we can reproduce the bug.

  1. ... (e.g. if you're using maven: mvn clean verify)
  2. ...

Running PMD through: [CLI | Ant | Maven | Gradle | Designer | Other]

@Scrates1 Scrates1 added the a:bug PMD crashes or fails to analyse a file. label Sep 14, 2023
@adangel adangel changed the title [java]CPD, It seems that --ignore-literals --ignore-identifier does not work [java] CPD --ignore-literals and --ignore-identifier do not work Sep 21, 2023
@adangel adangel added in:cpd Affects the copy-paste detector in:cli Affects the PMD Command Line Interface in:documentation Affects the documentation labels Sep 21, 2023
@adangel
Copy link
Member

adangel commented Sep 21, 2023

Thanks for reporting!
There is indeed something wrong... I'll have to dig further.

Note, that the CLI syntax has changed with PMD 7 - and the documentation on https://docs.pmd-code.org/latest/pmd_userdocs_cpd.html should be updated to make it clear, how to enable/disable various flags. Probably provide some examples and also mention this in the migration guide.

By default, identifiers are supposed to be not ignored. If you want to have identifiers ignored, you just add --ignore-identifiers as an additional argument. If you don't want to have the identifiers ignored (the default behavior), then don't specify the argument at all.

--ignore-identifiers false is NOT setting ignore-identifiers to false, rather it sets it to true and requests the a file named "false" should be analyzed... the argument does not take any parameters (the documentation is misleading).

Using your test file, I couldn't see any difference when adding --ignore-identifiers or removing it - thus indicating that something really doesn't work. I'll have to dig deeper.

@adangel adangel changed the title [java] CPD --ignore-literals and --ignore-identifier do not work [java] CPD --ignore-literals and --ignore-identifiers do not work Sep 21, 2023
@Scrates1
Copy link
Author

I am very glad to receive your reply!
Thank you very much for point out my wrong use of parameters, and I look forward to these parameters being effective in the future.
In addition, I have another question:

Is there any way to know the similarity rate of the duplicates?

Thanks!

@adangel
Copy link
Member

adangel commented Sep 22, 2023

Is there any way to know the similarity rate of the duplicates?

Not sure, if we can calculate this. We use the Rabin-Karp algorithm, which is implemented in MatchAlgorithm.

I think, when we find a match, then it is a perfect match (similarity rate 100%). If you use the flags --ignore-identifiers, then we manipulate the source code before we put it into the match algorithm (replacing each identifier with the same string). The match algorithm would still report a similarity rate of 100%.

We would need to compare at the end the two code snippets with the original identifiers. Then we probably could calculate something like a similarity rate.

@adangel
Copy link
Member

adangel commented Sep 22, 2023

After calling PMD with the correct parameters and using a proper test file, I can confirm, that everything still works as it should.

Extract this ZIP into a directory src: CommonMethodEnum.zip
It contains the following two files:

  • CommonMethodEnum.java
  • CommonMethodEnum2.java

They have almost the same content, but each enum value is suffixed with 2 in the second file.

When running CPD with the defaults:

$ pmd-bin-7.0.0-rc3/bin/pmd cpd --language java --format xml --minimum-tokens 45 -d src
[main] INFO net.sourceforge.pmd.cli.commands.internal.AbstractPmdSubcommand - Log level is at INFO
<?xml version="1.0" encoding="UTF-8"?>
<pmd-cpd>
   <file path="/home/andreas/PMD/source/pmd-it/issue-4676-cpd-cli/src/CommonMethodEnum.java"
         totalNumberOfTokens="137"/>
   <file path="/home/andreas/PMD/source/pmd-it/issue-4676-cpd-cli/src/CommonMethodEnum2.java"
         totalNumberOfTokens="137"/>
</pmd-cpd>

No duplications are detected, because the enum values differ.

But if called with "ignoreIdentifiers":

$ pmd-bin-7.0.0-rc3/bin/pmd cpd --language java --format xml --ignore-identifiers --minimum-tokens 45 -d src
[main] INFO net.sourceforge.pmd.cli.commands.internal.AbstractPmdSubcommand - Log level is at INFO
<?xml version="1.0" encoding="UTF-8"?>
<pmd-cpd>
   <file path="/home/andreas/PMD/source/pmd-it/issue-4676-cpd-cli/src/CommonMethodEnum.java"
         totalNumberOfTokens="137"/>
   <file path="/home/andreas/PMD/source/pmd-it/issue-4676-cpd-cli/src/CommonMethodEnum2.java"
         totalNumberOfTokens="137"/>
   <duplication lines="34" tokens="113">
      <file begintoken="0"
            column="1"
            endcolumn="12"
            endline="34"
            endtoken="112"
            line="1"
            path="/home/andreas/PMD/source/pmd-it/issue-4676-cpd-cli/src/CommonMethodEnum.java"/>
      <file begintoken="138"
            column="1"
            endcolumn="12"
            endline="34"
            endtoken="250"
            line="1"
            path="/home/andreas/PMD/source/pmd-it/issue-4676-cpd-cli/src/CommonMethodEnum2.java"/>
      <codefragment><![CDATA[public enum CommonMethodEnum {
    SUB_LIST("subList"), /* List.subList(int start, int end) */
    GET_BYTES1("getBytes"), /* String.getBytes(...) */

    GET("get"), /* List.get(int index) / FileItem.get() / Properties.get() / Map.get() */
    SET("set"), /* set() / Collection.set(int index, E element) */
    GET_BYTES("getBytes"), /* String.getBytes(...) */

    GET_NAME("getName"), /* ZipEntry.getName(), Class.getName() */
    GET_SIZE("getSize"), /* ZipEntry.getSize() */

    GET_INSTANCE("getInstance"), /* Calendar.getInstance() */

    GET_AND_INCREMENT("getAndIncrement"), /* AtomicInteger.getAndIncrement */

    LOCK("lock"), /* Lock.lock() */
    TRY_LOCK("tryLock"), /* Lock.tryLock() */


    VALIDATE("validate"), /* ValidatorForm.validate */

    PARSE_INT("parseInt"), /* Integer.parseInt */
    PARSE_BYTE("parseByte"), /* Byte.parseByte */
    PARSE_SHORT1("parseShort"), /* Short.parseShort */
    PARSE_LONG("parseLong"), /* Long.parseLong */
    PARSE_FLOAT("parseFloat"), /* Float.parseFloat */
    PARSE_DOUBLE1("parseDouble"), /* Double.parseDouble */
    PARSE_BOOLEAN("parseBoolean"), /* Boolean.parseBoolean */
    SUBMIT("submit"), /* ThreadPoolExecutor.submit(), ExecutorService.submit() */
    EXECUTE("execute"); /* ThreadPoolExecutor.execute() */

    private final String name;

    private CommonMethodEnum(String name) {]]></codefragment>
   </duplication>
</pmd-cpd>

A duplication is detected as expected.

So, I think this is not a code bug, but just documentation.

By the way, you could calculate how much of a file is duplicated: E.g. CommonMethodEnum.java has in total 137 tokens, and it contains a duplication with length 113 tokens, so 82% of the file content is duplication.

Let me know, if it still doesn't work for you (e.g. I ran into the error, that I used -language java instead of --language java - which lead to no files being analyzed).

@adangel adangel changed the title [java] CPD --ignore-literals and --ignore-identifiers do not work [doc] Clarify how CPD --ignore-literals and --ignore-identifiers work Sep 22, 2023
@adangel adangel removed the a:bug PMD crashes or fails to analyse a file. label Sep 22, 2023
@adangel adangel added this to the 7.0.0 milestone Sep 22, 2023
@adangel
Copy link
Member

adangel commented Sep 22, 2023

If tested a bit further, and with the current SNAPSHOT version of PMD 7, the CLI interface is even better unified and behaves more correct (thanks to #4397). E.g. "-language java" throws an error, that the language "anguage" doesn't exist ("-l" is the short option of "--language"). So, you can't call CPD wrongly in that case.

I've also verified behavior of PMD 6.55.0, and:

$ $ ~/PMD/binaries/pmd-bin-6.55.0/bin/run.sh cpd --language java --format xml --minimum-tokens 45 --ignore-identifiers false -d src 
Was passed main parameter 'false' but no main parameter was defined
Run with --help for command line help.

So, --ignore-identifiers false was never supported - besides being indicated somehow in the docs. So we don't need to mention this in the migration guide.

The PMD 7 SNAPSHOT version supports such a "main" parameter and interprets it as part of the file list to be analyzed. Since "false" does not exist, it will log an error:

[main] ERROR net.sourceforge.pmd.cli - No such file false

In summary, this is really just a documentation issue.
For your question, I've created a separate enhancement issue - see #4684 .

@Scrates1
Copy link
Author

Scrates1 commented Sep 23, 2023

Hi, I have called PMD with the correct parameters(pmd-bin-7.0.0-rc3 and pmd-bin-6.55.0), and it's excited that I got the expected result! It's really great!

I tried to reflect on why I did not find it was caused by my wrong use of parameters in time. I did not add any redundant parameters at the beginning, but I still seemed to encounter this problem. And then I found that one reason was that I was misleading by the output.

The first time I use CPD, the command is:

 \pmd-bin-7.0.0-rc3\bin> .\pmd.bat cpd  --minimum-tokens 45 --dir  "...\example"

the test file is like:

public class Example{
    public void theFirstMethod(List<String> strList) {
        for (String str : strList) {
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
        }
    }

    public void theSecondMethod(List<String> strList) {
        for (String str : strList) {
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
        }
    }
}

and the result is:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/C:/Users.../pmd-bin-7.0.0-rc3/lib/groovy-2.4.21.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[main] INFO net.sourceforge.pmd.cli.commands.internal.AbstractPmdSubcommand - Log level is at INFO
Found a 9 line (58 tokens) duplication in the following files:
Starting at line 2 of ...\example\Example.java
Starting at line 12 of...\example\Example.java

    public void theFirstMethod(List<String> strList) {
        for (String str : strList) {
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
        }
    }

As we can see, the source codes on line 2(method name is theFirstMethod) and line 12(method name is theSecondMethod) are different, and the output code public void theFirstMethod should not appear. So I begin to try args like --ignore-identifiers false--ignore-identifiers true, and of cource it does not work.

Now I realize it's because the duplicates starts in one of the columns of the starting line. If I use--format xml, it will be show:

...\pmd-bin-7.0.0-rc3\bin> .\pmd.bat cpd  --minimum-tokens 45 --dir  "...\example" --format xml
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:.../pmd-bin-7.0.0-rc3/lib/groovy-2.4.21.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[main] INFO net.sourceforge.pmd.cli.commands.internal.AbstractPmdSubcommand - Log level is at INFO
<?xml version="1.0" encoding="UTF-8"?>
<pmd-cpd>
   <file path="...\example\Example.java"
         totalNumberOfTokens="127"/>
   <duplication lines="9" tokens="58">
      <file begintoken="7"
            column="31"
            endcolumn="6"
            endline="10"
            endtoken="64"
            line="2"
            path="...\example\Example.java"/>
      <file begintoken="68"
            column="32"
            endcolumn="6"
            endline="20"
            endtoken="125"
            line="12"
            path="...\example\Example.java"/>
      <codefragment><![CDATA[    public void theFirstMethod(List<String> strList) {
        for (String str : strList) {
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
        }
    }]]></codefragment>
   </duplication>
</pmd-cpd>

the column has been shown. However, the codefragment still have public void theFirstMethod which belongs to different content.

How about when the output is accurate to line(with no column), if the repeating code starts at the beginning of the line, the beginning line counts as this line, and when the repeating code starts in the middle of the line, the beginning line counts as the next line?

For example, in this case, when the command is

 \pmd-bin-7.0.0-rc3\bin> .\pmd.bat cpd  --minimum-tokens 45 --dir  "...\example"

the result is:

Starting at line 3 of ...\example\Example.java
Starting at line 13 of...\example\Example.java
        for (String str : strList) {
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
            System.out.println(str);
        }
    }

Best wishes!

@Scrates1
Copy link
Author

Scrates1 commented Sep 23, 2023

And in

If tested a bit further, and with the current SNAPSHOT version of PMD 7

Could you please paste the url here of current SNAPSHOT version of PMD 7, I can't find this version and it seems that it is not pmd-bin-7.0.0-rc3, because I didn't see that the language "anguage" doesn't exist when use -language java

\pmd-bin-7.0.0-rc3\bin> .\pmd.bat cpd -language java --format xml --minimum-tokens 45 -d "...\CommonMethodEnum"
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:.../pmd-bin-7.0.0-rc3/lib/groovy-2.4.21.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[main] INFO net.sourceforge.pmd.cli.commands.internal.AbstractPmdSubcommand - Log level is at INFO
<?xml version="1.0" encoding="UTF-8"?>
<pmd-cpd/>

Thanks a lot!

@jsotuyod
Copy link
Member

I started a separate issue #4725 to track the issue with the text renderer and start / end columns to keep things easier to track. I'm closing this as the original issue was addressed with better documentation already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in:cli Affects the PMD Command Line Interface in:cpd Affects the copy-paste detector in:documentation Affects the documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants