Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical Preservation (including whitespace) #124

Closed
ptitjes opened this issue Mar 13, 2015 · 29 comments
Closed

Lexical Preservation (including whitespace) #124

ptitjes opened this issue Mar 13, 2015 · 29 comments
Assignees
Labels
Improvement Not a bug, but a way that JP can be be enhanced to work better.
Milestone

Comments

@ptitjes
Copy link
Contributor

ptitjes commented Mar 13, 2015

A long requested feature...

@ptitjes ptitjes added the Improvement Not a bug, but a way that JP can be be enhanced to work better. label Mar 13, 2015
@ptitjes ptitjes added this to the 3.0 milestone Mar 13, 2015
@ptitjes
Copy link
Contributor Author

ptitjes commented Mar 13, 2015

As part of #115.

@larsac07
Copy link

How can I help? Using javaparser in my master's thesis, and need lexical preservation for marking findings correctly in an editor.

@ptitjes
Copy link
Contributor Author

ptitjes commented Nov 24, 2015

@larsac07 I do not contribute to JavaParser anymore as it moved to dual-license with Apache License. As you are doing research, I see no problem that you use my LGPL-only clone in its lex-preserving branch (https://github.com/ptitjes/javaparser/tree/lex-preserving). As an alternative, you can use another of my projects (https://github.com/ptitjes/jlato) that also is LGPL-only, has lexical preservation and is much more compliant than this branch.

@rpau
Copy link
Contributor

rpau commented Nov 24, 2015

@larsac07 I have also lexical preservation for the walkmod project at https://github.com/rpau/javalang
It has LGPL v3.

@ptitjes
Copy link
Contributor Author

ptitjes commented Nov 24, 2015

@rpau Wrong URL! I guess you meant https://github.com/rpau/javalang, right ?

@ftomassetti
Copy link
Member

Hi @larsac07, if you intend to contribute to JavaParser I would be glad to discuss with you and help. For the sake of clarity, jlato is unrelated to JavaParser (it is a different project, a "competitor" if you want).

Consider that while we do not store whitespace we store the position of elements in the original source code.

JavaParser is currently a mature project, with an extended userbase and business-friendly license (GPL and Apache can both used) so a contribution to this project would have a wider reach. However if you have not already invested time using JavaParser and you need an already implemented solution there are alternatives. In that case I suggest you taking a look at https://github.com/rpau/javalang

@rpau
Copy link
Contributor

rpau commented Nov 24, 2015

@ptitjes yes, sorry :) I have upgraded the comment

@ptitjes
Copy link
Contributor Author

ptitjes commented Nov 24, 2015

@rpau :p BTW, how much lexical preservation do you have ? Can you parse a file, modify the ast and write it back as it was (modulo the changes) ?

@ftomassetti
Copy link
Member

Perhaps you can discuss the differences between your own projects on a different venue :)
I think it is fair to point to their existence but this a page of the JavaParser project.

@rpau
Copy link
Contributor

rpau commented Nov 24, 2015

No problem @ftomassetti :-)

@larsac07
Copy link

Thank you, guys! This really helps. @ptitjes and @rpau, I will have a look at your suggested repos. @ftomassetti, thank you for clarifying this. I must have missed that while experimenting.

Still, LP would be very helpful, so any suggestion as to where the problem lies would be much appreciated. If I find a solution, I can contribute my findings and make a PR if desirable.

@ftomassetti
Copy link
Member

I think JavaParser can be used for many different goals and not in all fo them PR makes sense. For example, if I have this statement:

myMethodCall(1, 2, foo +
                                         2*3); 

and I remove 2*3, what happens?
Should I get:

myMethodCall(1, 2, foo); 

or:

myMethodCall(1, 2 
                                 ); 

I do not know and I think there are many cases which are not obvious. Another example:

void foo() {
   bla();
      foo();
baz();
}

If I add a statement in the function foo, how much should I indent it? Should I use some standard value? Should I indent it as much as the statement following or the one preceeding?

LP is something that several users could desire but I am not sure everyone would want the same solution.

In my opinion it would be brilliant to have it as something optional. Looks at how comments works: they are extracted separately from the other AST nodes and added later (if the user wants them). We have already information on the position of tokens on from that we coul derive the presence of the whitespace characters. We could need to look into the original source to differentiate between spaces and tabs. Once we have extracted all the information about the whitespace in the original source we could have a variant of the DumpVisitor which uses that: instead of adding "standard" whitespaces it could put back the original whitespaces. Of course it would need some sort of logic to decide what to do around the nodes added/removed.

All in all I think it could be not too difficult to implement this feature and me and the other committers could help you and give feedback on that.

@ptitjes
Copy link
Contributor Author

ptitjes commented Nov 26, 2015

@ftomassetti
The white-space you give in your first example can either be attached to the + expression itself as a separator between the operator symbol and the right operand, be attached to the previous + symbol directly, be attached to the following 2 * 3 expression, or be split in two with one part being attached to the + symbol and the other to the right operand.

As for the AST modification you give as an example, note that you can't simply "remove 2 * 3" as this would make the surrounding expression incomplete (i.e. foo + ???). You could however

  • replace the foo + 2 * 3 by foo to get your first example result, in which case the white-space is simply lost as it can only be attached to the + expression or one of its right sub-expression, but in no way to the foo name sub-expression, or
  • remove the last argument of the surrounding method invocation, in which case the white-space is also simply lost.

I personally found out while developing JLaTo that the more flexible way is to consider every white-space to be split in one, two or three parts depending on whether the previous and next meaningful AST constituents were non-terminals and as such be considered to capture additional white-space, and on what white-space would be expected by a formatter at that location. This proved to be particularly efficient and versatile. This shows, contrary to what you said, that you can devise a way to handle modifications of lexically-preserving AST in a global and uniform way with just some common sense and some sensible defaults.

As for your proposal about using the parsed source locations to retrieve the original white-spaces, I warn you that this may not be a really good idea. As I explained in a comment 9 months ago (#22 (comment)), I produced something similar some years ago and realized this was really hard to make something stable.

Also, you have to understand that the positions (are mutable and) may change due to user manipulation and as such you cannot make an algorithm using source positions that can be fully trusted. A first step would be to make nodes' positions immutable.

@ftomassetti
Copy link
Member

My point was simply to present some examples to make clear that whitespace preservation is not trivial when the code is modified. Decisions have to be made and there are not clear cut rules.

If your comments help @larsac07 then I am thankful for them.

@matozoid
Copy link
Contributor

For me the usual thing goes: what is lost cannot be regained. Once information about the original look of the code is thrown out, it cannot be put back in place. So if we want to have lexical preservation as something optional, then it should be built into the core with a pretty printing step after it by default. When you turn pretty printing off, you will have your original formatting again.

The order of work should then be: make sure lexical preservation works. Add a pretty printer later on. When you go about it like this, the pretty printing part becomes the optional part - which makes sense to me.

@larsac07
Copy link

larsac07 commented Dec 9, 2015

Thanks for more info. @ftomassetti, I'm not sure you actually store the original position of elements correctly. The line number is always correct, but columns seem to consider a tab character as 8 spaces. Take a look at this example:

public class SomeClass {
...
    |<- this should be column 5, but is 9
    public static void main(String[] args) {
        |<- this should be column 9, but is 17
        int divider = 10;
        double[] values = new double[] {4, 3, 3};
        double entropy = calculateEntropy(divider, values);
        System.out.println("Entropy: " + entropy);
        double infoGain = calculateInformationGain(entropy,divider,new double[][] {{3,1,1},{1,2,2}});
        System.out.println("InfoGain: " + infoGain);
    }
...
}

Is this intended behavior?

@ftomassetti
Copy link
Member

Well, the same tab can be represented as any number of spaces so we have been using the arbitrary value 8. I am not sure what other options there are

@larsac07
Copy link

I see, of course. This is different for each IDE, right?

It could be solved with an option for how many columns a tab character is, I guess, but at the same time, that could clutter the API for such a small detail.

@ftomassetti
Copy link
Member

Not just every IDE, every user can typically configure that value and set it to 1, 3, 4, 8, one billion. I think it would make sense to make the value configurable also in JavaParser (if it is not already configurable)

@matozoid
Copy link
Contributor

Let's pick this up step by step. What would be the first steps to make? Can we make issues for those steps? Make sure that no step leaves the code unusable, so we can always release.

Hopefully this will break down this ball of unknowns.

@ptitjes
Copy link
Contributor Author

ptitjes commented Jan 21, 2016

BTW, you can toy with JavaCharStream.get/setTabSize() to change that value of 8 used by JavaCC.

@matozoid matozoid removed this from the 3.0 milestone Jul 15, 2016
@matozoid matozoid added this to the 4.0.0 milestone Jul 29, 2016
@ftomassetti
Copy link
Member

Maybe we can look into this article for ideas:

http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2011-027.pdf

@matozoid
Copy link
Contributor

Very interesting document, let's get going on that for version 4.0 :)

@theangrydev
Copy link

As mentioned in #606, I would be interested in seeing some options in the pretty printer around how to format the code. Something similar to the options an IDE gives you for auto formatting.

@matozoid
Copy link
Contributor

@ftomassetti made something interesting: http://tomassetti.me/lexical-preservation-javaparser/

Personally I'd prefer what's the in the PDF. It is comparable, but I think pointing to the original source by pointing at the actual tokens is more reliable, or more direct, than depending on our position information. The position information comes from the tokens :)

@ftomassetti
Copy link
Member

Yes, I see that using tokens would be a better solution from an engineering point of view.

I think the only benefit of my solution is to be compatible with JavaParser as it is.

Of we can get access to the tokens within a reasonable timeframe I am all for using them.

@matozoid
Copy link
Contributor

I do think it's awesome that you built something :)

It shouldn't be too hard to get the tokens in the nodes. Currently we're keeping track of the location of the begin and end token. We only have to start keeping track of the begin and end token instead of that - so mostly a search/replace.

@ftomassetti
Copy link
Member

I am now trying to adapt my experiment to use tokens: we start by saving the list of elements for each Node. Such elements are either placeholder for children or tokens. For example a class could have as elements: CLASS_KW, SPACE, {, SPACE, fieldA, SPACE ,} where fieldA is a placeholder for the child of type FieldDeclaration

@ftomassetti
Copy link
Member

Closed through #654

@matozoid matozoid added this to the next release milestone Mar 4, 2017
ftomassetti added a commit to ftomassetti/javaparser that referenced this issue Jan 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Not a bug, but a way that JP can be be enhanced to work better.
Projects
None yet
Development

No branches or pull requests

6 participants