Lexical Preservation (including whitespace) #124

ptitjes · 2015-03-13T18:22:31Z

A long requested feature...

ptitjes · 2015-03-13T18:23:26Z

As part of #115.

larsac07 · 2015-11-24T15:33:42Z

How can I help? Using javaparser in my master's thesis, and need lexical preservation for marking findings correctly in an editor.

ptitjes · 2015-11-24T15:40:09Z

@larsac07 I do not contribute to JavaParser anymore as it moved to dual-license with Apache License. As you are doing research, I see no problem that you use my LGPL-only clone in its lex-preserving branch (https://github.com/ptitjes/javaparser/tree/lex-preserving). As an alternative, you can use another of my projects (https://github.com/ptitjes/jlato) that also is LGPL-only, has lexical preservation and is much more compliant than this branch.

rpau · 2015-11-24T15:44:13Z

@larsac07 I have also lexical preservation for the walkmod project at https://github.com/rpau/javalang
It has LGPL v3.

ptitjes · 2015-11-24T15:46:17Z

@rpau Wrong URL! I guess you meant https://github.com/rpau/javalang, right ?

ftomassetti · 2015-11-24T15:46:46Z

Hi @larsac07, if you intend to contribute to JavaParser I would be glad to discuss with you and help. For the sake of clarity, jlato is unrelated to JavaParser (it is a different project, a "competitor" if you want).

Consider that while we do not store whitespace we store the position of elements in the original source code.

JavaParser is currently a mature project, with an extended userbase and business-friendly license (GPL and Apache can both used) so a contribution to this project would have a wider reach. However if you have not already invested time using JavaParser and you need an already implemented solution there are alternatives. In that case I suggest you taking a look at https://github.com/rpau/javalang

rpau · 2015-11-24T15:55:01Z

@ptitjes yes, sorry :) I have upgraded the comment

ptitjes · 2015-11-24T16:02:54Z

@rpau :p BTW, how much lexical preservation do you have ? Can you parse a file, modify the ast and write it back as it was (modulo the changes) ?

ftomassetti · 2015-11-24T16:05:28Z

Perhaps you can discuss the differences between your own projects on a different venue :)
I think it is fair to point to their existence but this a page of the JavaParser project.

rpau · 2015-11-24T16:13:42Z

No problem @ftomassetti :-)

larsac07 · 2015-11-25T09:05:40Z

Thank you, guys! This really helps. @ptitjes and @rpau, I will have a look at your suggested repos. @ftomassetti, thank you for clarifying this. I must have missed that while experimenting.

Still, LP would be very helpful, so any suggestion as to where the problem lies would be much appreciated. If I find a solution, I can contribute my findings and make a PR if desirable.

ftomassetti · 2015-11-26T18:04:09Z

I think JavaParser can be used for many different goals and not in all fo them PR makes sense. For example, if I have this statement:

myMethodCall(1, 2, foo +
                                         2*3);

and I remove 2*3, what happens?
Should I get:

myMethodCall(1, 2, foo);

or:

myMethodCall(1, 2 
                                 );

I do not know and I think there are many cases which are not obvious. Another example:

void foo() {
   bla();
      foo();
baz();
}

If I add a statement in the function foo, how much should I indent it? Should I use some standard value? Should I indent it as much as the statement following or the one preceeding?

LP is something that several users could desire but I am not sure everyone would want the same solution.

In my opinion it would be brilliant to have it as something optional. Looks at how comments works: they are extracted separately from the other AST nodes and added later (if the user wants them). We have already information on the position of tokens on from that we coul derive the presence of the whitespace characters. We could need to look into the original source to differentiate between spaces and tabs. Once we have extracted all the information about the whitespace in the original source we could have a variant of the DumpVisitor which uses that: instead of adding "standard" whitespaces it could put back the original whitespaces. Of course it would need some sort of logic to decide what to do around the nodes added/removed.

All in all I think it could be not too difficult to implement this feature and me and the other committers could help you and give feedback on that.

ptitjes · 2015-11-26T20:17:22Z

@ftomassetti
The white-space you give in your first example can either be attached to the + expression itself as a separator between the operator symbol and the right operand, be attached to the previous + symbol directly, be attached to the following 2 * 3 expression, or be split in two with one part being attached to the + symbol and the other to the right operand.

As for the AST modification you give as an example, note that you can't simply "remove 2 * 3" as this would make the surrounding expression incomplete (i.e. foo + ???). You could however

replace the foo + 2 * 3 by foo to get your first example result, in which case the white-space is simply lost as it can only be attached to the + expression or one of its right sub-expression, but in no way to the foo name sub-expression, or
remove the last argument of the surrounding method invocation, in which case the white-space is also simply lost.

I personally found out while developing JLaTo that the more flexible way is to consider every white-space to be split in one, two or three parts depending on whether the previous and next meaningful AST constituents were non-terminals and as such be considered to capture additional white-space, and on what white-space would be expected by a formatter at that location. This proved to be particularly efficient and versatile. This shows, contrary to what you said, that you can devise a way to handle modifications of lexically-preserving AST in a global and uniform way with just some common sense and some sensible defaults.

As for your proposal about using the parsed source locations to retrieve the original white-spaces, I warn you that this may not be a really good idea. As I explained in a comment 9 months ago (#22 (comment)), I produced something similar some years ago and realized this was really hard to make something stable.

Also, you have to understand that the positions (are mutable and) may change due to user manipulation and as such you cannot make an algorithm using source positions that can be fully trusted. A first step would be to make nodes' positions immutable.

ftomassetti · 2015-11-27T10:01:40Z

My point was simply to present some examples to make clear that whitespace preservation is not trivial when the code is modified. Decisions have to be made and there are not clear cut rules.

If your comments help @larsac07 then I am thankful for them.

matozoid · 2015-11-29T12:26:44Z

For me the usual thing goes: what is lost cannot be regained. Once information about the original look of the code is thrown out, it cannot be put back in place. So if we want to have lexical preservation as something optional, then it should be built into the core with a pretty printing step after it by default. When you turn pretty printing off, you will have your original formatting again.

The order of work should then be: make sure lexical preservation works. Add a pretty printer later on. When you go about it like this, the pretty printing part becomes the optional part - which makes sense to me.

larsac07 · 2015-12-09T14:26:53Z

Thanks for more info. @ftomassetti, I'm not sure you actually store the original position of elements correctly. The line number is always correct, but columns seem to consider a tab character as 8 spaces. Take a look at this example:

public class SomeClass {
...
    |<- this should be column 5, but is 9
    public static void main(String[] args) {
        |<- this should be column 9, but is 17
        int divider = 10;
        double[] values = new double[] {4, 3, 3};
        double entropy = calculateEntropy(divider, values);
        System.out.println("Entropy: " + entropy);
        double infoGain = calculateInformationGain(entropy,divider,new double[][] {{3,1,1},{1,2,2}});
        System.out.println("InfoGain: " + infoGain);
    }
...
}

Is this intended behavior?

ftomassetti · 2015-12-10T07:44:52Z

Well, the same tab can be represented as any number of spaces so we have been using the arbitrary value 8. I am not sure what other options there are

larsac07 · 2015-12-10T23:27:37Z

I see, of course. This is different for each IDE, right?

It could be solved with an option for how many columns a tab character is, I guess, but at the same time, that could clutter the API for such a small detail.

ftomassetti · 2015-12-14T09:20:45Z

Not just every IDE, every user can typically configure that value and set it to 1, 3, 4, 8, one billion. I think it would make sense to make the value configurable also in JavaParser (if it is not already configurable)

matozoid · 2016-01-21T08:47:27Z

Let's pick this up step by step. What would be the first steps to make? Can we make issues for those steps? Make sure that no step leaves the code unusable, so we can always release.

Hopefully this will break down this ball of unknowns.

ptitjes · 2016-01-21T11:19:01Z

BTW, you can toy with JavaCharStream.get/setTabSize() to change that value of 8 used by JavaCC.

ftomassetti · 2016-11-11T08:44:20Z

Maybe we can look into this article for ideas:

http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2011-027.pdf

matozoid · 2016-11-11T21:31:55Z

Very interesting document, let's get going on that for version 4.0 :)

theangrydev · 2016-12-05T17:46:04Z

As mentioned in #606, I would be interested in seeing some options in the pretty printer around how to format the code. Something similar to the options an IDE gives you for auto formatting.

matozoid · 2016-12-13T09:16:52Z

@ftomassetti made something interesting: http://tomassetti.me/lexical-preservation-javaparser/

Personally I'd prefer what's the in the PDF. It is comparable, but I think pointing to the original source by pointing at the actual tokens is more reliable, or more direct, than depending on our position information. The position information comes from the tokens :)

ftomassetti · 2016-12-13T09:21:27Z

Yes, I see that using tokens would be a better solution from an engineering point of view.

I think the only benefit of my solution is to be compatible with JavaParser as it is.

Of we can get access to the tokens within a reasonable timeframe I am all for using them.

matozoid · 2016-12-13T09:45:10Z

I do think it's awesome that you built something :)

It shouldn't be too hard to get the tokens in the nodes. Currently we're keeping track of the location of the begin and end token. We only have to start keeping track of the begin and end token instead of that - so mostly a search/replace.

ftomassetti · 2016-12-17T09:10:20Z

I am now trying to adapt my experiment to use tokens: we start by saving the list of elements for each Node. Such elements are either placeholder for children or tokens. For example a class could have as elements: CLASS_KW, SPACE, {, SPACE, fieldA, SPACE ,} where fieldA is a placeholder for the child of type FieldDeclaration

ftomassetti · 2017-02-26T17:58:49Z

Closed through #654

TypeUsage is now Type

ptitjes added the Improvement Not a bug, but a way that JP can be be enhanced to work better. label Mar 13, 2015

ptitjes added this to the 3.0 milestone Mar 13, 2015

ptitjes mentioned this issue Mar 13, 2015

Lex preserving (Feature Request #124) #125

Closed

matozoid removed this from the 3.0 milestone Jul 15, 2016

matozoid added this to the 4.0.0 milestone Jul 29, 2016

matozoid mentioned this issue Dec 5, 2016

DumpVisitor should use the Position information to insert appropriate new lines #606

Closed

matozoid mentioned this issue Dec 7, 2016

Refactor list of tokens #614

Closed

matozoid mentioned this issue Dec 14, 2016

Comment support refactoring approach brainstorm issue #638

Open

8 tasks

matozoid removed this from the 4.0.0 milestone Jan 10, 2017

matozoid assigned ftomassetti Feb 1, 2017

ftomassetti mentioned this issue Feb 25, 2017

POC for preservation of unicode escapes in source code ftomassetti/javaparser#5

Closed

ftomassetti closed this as completed Feb 26, 2017

matozoid added this to the next release milestone Mar 4, 2017

ftomassetti added a commit to ftomassetti/javaparser that referenced this issue Jan 10, 2018

Merge pull request javaparser#124 from pevers/patch-2

83cbe21

TypeUsage is now Type

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexical Preservation (including whitespace) #124

Lexical Preservation (including whitespace) #124

ptitjes commented Mar 13, 2015

ptitjes commented Mar 13, 2015

larsac07 commented Nov 24, 2015

ptitjes commented Nov 24, 2015

rpau commented Nov 24, 2015

ptitjes commented Nov 24, 2015

ftomassetti commented Nov 24, 2015

rpau commented Nov 24, 2015

ptitjes commented Nov 24, 2015

ftomassetti commented Nov 24, 2015

rpau commented Nov 24, 2015

larsac07 commented Nov 25, 2015

ftomassetti commented Nov 26, 2015

ptitjes commented Nov 26, 2015

ftomassetti commented Nov 27, 2015

matozoid commented Nov 29, 2015

larsac07 commented Dec 9, 2015

ftomassetti commented Dec 10, 2015

larsac07 commented Dec 10, 2015

ftomassetti commented Dec 14, 2015

matozoid commented Jan 21, 2016

ptitjes commented Jan 21, 2016

ftomassetti commented Nov 11, 2016

matozoid commented Nov 11, 2016

theangrydev commented Dec 5, 2016

matozoid commented Dec 13, 2016

ftomassetti commented Dec 13, 2016

matozoid commented Dec 13, 2016

ftomassetti commented Dec 17, 2016

ftomassetti commented Feb 26, 2017

Lexical Preservation (including whitespace) #124

Lexical Preservation (including whitespace) #124

Comments

ptitjes commented Mar 13, 2015

ptitjes commented Mar 13, 2015

larsac07 commented Nov 24, 2015

ptitjes commented Nov 24, 2015

rpau commented Nov 24, 2015

ptitjes commented Nov 24, 2015

ftomassetti commented Nov 24, 2015

rpau commented Nov 24, 2015

ptitjes commented Nov 24, 2015

ftomassetti commented Nov 24, 2015

rpau commented Nov 24, 2015

larsac07 commented Nov 25, 2015

ftomassetti commented Nov 26, 2015

ptitjes commented Nov 26, 2015

ftomassetti commented Nov 27, 2015

matozoid commented Nov 29, 2015

larsac07 commented Dec 9, 2015

ftomassetti commented Dec 10, 2015

larsac07 commented Dec 10, 2015

ftomassetti commented Dec 14, 2015

matozoid commented Jan 21, 2016

ptitjes commented Jan 21, 2016

ftomassetti commented Nov 11, 2016

matozoid commented Nov 11, 2016

theangrydev commented Dec 5, 2016

matozoid commented Dec 13, 2016

ftomassetti commented Dec 13, 2016

matozoid commented Dec 13, 2016

ftomassetti commented Dec 17, 2016

ftomassetti commented Feb 26, 2017