Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
44f29c3
Fix #4396 - Fix PLSQL CPD being case-sensitive
oowekyala Apr 7, 2024
72408ca
Normalize image of PLSQL tokens to uppercase, reuse strings
oowekyala Apr 8, 2024
1c23df7
Fix some weird things in PLSQL tokens
oowekyala Apr 8, 2024
0cb2e37
Update reference files
oowekyala Apr 8, 2024
ab80b24
Also add this ability for Antlr lexers, adapt TSQL
oowekyala Apr 8, 2024
41c0135
Fix things
oowekyala Apr 9, 2024
f484c75
Add test for PLSQL ignore literals
oowekyala Apr 9, 2024
8f1e6b0
Fix exclusive end index in antlr token
oowekyala Apr 9, 2024
835abc8
Add back ctor for compatibility
oowekyala Apr 9, 2024
10dfb45
Replace numbers with names
oowekyala Apr 9, 2024
c482666
Merge branch 'master' into issue4396-cpd-case-sensitive
oowekyala Apr 21, 2024
06eb7ea
review comments
oowekyala Apr 21, 2024
d45aef8
Apply suggestions from code review
oowekyala Apr 21, 2024
b56925f
Merge remote-tracking branch 'origin/issue4396-cpd-case-sensitive' in…
oowekyala Apr 21, 2024
f4e7541
Treat unquotable identifiers as unquoted in PLSQL
oowekyala Apr 21, 2024
b931c2f
Fix name of String literal token
oowekyala Apr 21, 2024
95721ef
Trick javacc into giving string literal a non-literal image
oowekyala Apr 21, 2024
838df27
Normalize token images also in PMD parser
oowekyala Apr 21, 2024
75e50df
Make @Image have old behavior, remove KEYWORD_UNRESERVED from tree
oowekyala Apr 21, 2024
7484186
Merge branch 'master' into issue4396-cpd-case-sensitive
oowekyala May 11, 2024
2e9aa06
Fix API compatibility
oowekyala May 11, 2024
ffc71e8
Don"t add publicly supported API
oowekyala May 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

import net.sourceforge.pmd.cpd.CpdLexer;
import net.sourceforge.pmd.lang.TokenManager;
import net.sourceforge.pmd.lang.ast.impl.antlr4.AntlrLexerBehavior;
import net.sourceforge.pmd.lang.ast.impl.antlr4.AntlrToken;
import net.sourceforge.pmd.lang.ast.impl.antlr4.AntlrTokenManager;
import net.sourceforge.pmd.lang.document.TextDocument;
Expand All @@ -23,7 +24,15 @@ public abstract class AntlrCpdLexer extends CpdLexerBase<AntlrToken> {
@Override
protected final TokenManager<AntlrToken> makeLexerImpl(TextDocument doc) throws IOException {
CharStream charStream = CharStreams.fromReader(doc.newReader(), doc.getFileId().getAbsolutePath());
return new AntlrTokenManager(getLexerForSource(charStream), doc);
return new AntlrTokenManager(getLexerForSource(charStream), doc, getLexerBehavior());
}

/**
* Override this method to customize some aspects of the
* lexer.
*/
protected AntlrLexerBehavior getLexerBehavior() {
return new AntlrLexerBehavior();
}

protected abstract Lexer getLexerForSource(CharStream charStream);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
/**
* BSD-style license; for more info see http://pmd.sourceforge.net/license.html
*/

package net.sourceforge.pmd.lang.ast.impl.antlr4;

import org.antlr.v4.runtime.Token;

import net.sourceforge.pmd.cpd.CpdLanguageProperties;

/**
* Strategy to customize some aspects of the mapping
* from Antlr tokens to PMD/CPD tokens.
*/
public class AntlrLexerBehavior {


/**
* Return the image that the token should have, possibly applying a transformation.
* The default just returns {@link Token#getText()}.
* Transformations here are usually normalizations, for instance, mapping
* the image of all keywords to uppercase/lowercase to implement case-insensitivity,
* or replacing the image of literals by a placeholder to implement {@link CpdLanguageProperties#CPD_ANONYMIZE_LITERALS}.
Copy link
Member

@jsotuyod jsotuyod Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to now these things would have been done in the token filter, which is supported by both Javacc and antlr.

I'm a little wary of adding yet another mechanism that can overlap. The only reason to get something into the lexer itself (the way Javacc is implemented) is because the behavior should apply both to cpd and pmd, as it's inherent to the language. The way this is hooked into antlr languages it's only applicable to cpd.

Yes, the tokenfilter was a catch all that made for some complex implementations and would probably use some cleanup, but I'm unsure this is is.

I feel we need to have a better defined boundary / limit extension points. Once this is published we will have to support it. I'd love to see this behavior pushed into the lexer rather the token / token manager as is the case for Javacc.

Copy link
Member Author

@oowekyala oowekyala Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see this behavior pushed into the lexer rather the token / token manager as is the case for Javacc.

Do you mean having this behavior pushed into the Antlr lexer, so that the Antlr parser also sees the normalized image? In that case I agree, it would be nicer. I can look into it.

I think overall using the token filter to do that is an abuse when the language itself applies specific normalizations. That's why I believe our Javacc extension point is justified, and we need a mirror extension point for Antlr.

Edit: I see now that on the line you commented on, the reference to CPD_ANONYMIZE_LITERALS should be removed, as this part should be exclusive to the CpdLexer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you got my intent right.

Try and push this down to the antlr lexer, the same way the Javacc one works. It will seamlessly apply to pmd and cpd, make it non optional (it's language behavior), and exclude it as an extension point.

Other cpd specific behaviors such as anonymize literals should be done in cpd specific code. Right now that is the token filter, but as we both agree, at some point that api will need to evolve.

*
* @param token A token from the Antlr Lexer
*
* @return The image
*/
protected String getTokenImage(Token token) {
return token.getText();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,13 @@
*/
public class AntlrToken implements GenericToken<AntlrToken> {

private final Token token;
oowekyala marked this conversation as resolved.
Show resolved Hide resolved
private final AntlrToken previousComment;
private final TextDocument textDoc;
private final String image;
private final int endOffset;
private final int startOffset;
private final int channel;
private final int kind;
AntlrToken next;


Expand All @@ -30,10 +34,22 @@ public class AntlrToken implements GenericToken<AntlrToken> {
* @param previousComment The previous comment
* @param textDoc The text document
*/
public AntlrToken(final Token token, final AntlrToken previousComment, TextDocument textDoc) {
this.token = token;
AntlrToken(final Token token, final AntlrToken previousComment, TextDocument textDoc, AntlrLexerBehavior behavior) {
oowekyala marked this conversation as resolved.
Show resolved Hide resolved
this.previousComment = previousComment;
this.textDoc = textDoc;
this.image = behavior.getTokenImage(token);
this.startOffset = token.getStartIndex();
this.endOffset = token.getStopIndex() + 1; // exclusive
this.channel = token.getChannel();
this.kind = token.getType();
}

/**
* @deprecated Don't create antlr tokens directly, use an {@link AntlrTokenManager}
*/
@Deprecated
public AntlrToken(final Token token, final AntlrToken previousComment, TextDocument textDoc) {
this(token, previousComment, textDoc, new AntlrLexerBehavior());
}

@Override
Expand All @@ -48,13 +64,13 @@ public AntlrToken getPreviousComment() {

@Override
public CharSequence getImageCs() {
return token.getText();
return image;
}

/** Returns a text region with the coordinates of this token. */
@Override
public TextRegion getRegion() {
return TextRegion.fromBothOffsets(token.getStartIndex(), token.getStopIndex() + 1);
return TextRegion.fromBothOffsets(startOffset, endOffset);
}

@Override
Expand All @@ -74,14 +90,14 @@ public int compareTo(AntlrToken o) {

@Override
public int getKind() {
return token.getType();
return kind;
}

public boolean isHidden() {
return !isDefault();
}

public boolean isDefault() {
return token.getChannel() == Lexer.DEFAULT_TOKEN_CHANNEL;
return channel == Lexer.DEFAULT_TOKEN_CHANNEL;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,20 @@ public class AntlrTokenManager implements TokenManager<AntlrToken> {

private final Lexer lexer;
private final TextDocument textDoc;
private final AntlrLexerBehavior behavior;
private AntlrToken previousToken;


public AntlrTokenManager(final Lexer lexer, final TextDocument textDocument) {
this(lexer, textDocument, new AntlrLexerBehavior());
}

public AntlrTokenManager(final Lexer lexer,
final TextDocument textDocument,
final AntlrLexerBehavior behavior) {
this.lexer = lexer;
this.textDoc = textDocument;
this.behavior = behavior;
resetListeners();
}

Expand All @@ -40,7 +48,7 @@ public AntlrToken getNextToken() {

private AntlrToken getNextTokenFromAnyChannel() {
final AntlrToken previousComment = previousToken != null && previousToken.isHidden() ? previousToken : null;
final AntlrToken currentToken = new AntlrToken(lexer.nextToken(), previousComment, textDoc);
final AntlrToken currentToken = new AntlrToken(lexer.nextToken(), previousComment, textDoc, this.behavior);
if (previousToken != null) {
previousToken.next = currentToken;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,11 @@ public String getImage() {
return image.toString();
}

/** Returns the original text of the token. The image may be normalized. */
public Chars getText() {
return document.getTextDocument().sliceOriginalText(getRegion());
}

@Override
public final TextRegion getRegion() {
return TextRegion.fromBothOffsets(startOffset, endOffset);
Expand Down