New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexical Preservation (including whitespace) #124
Comments
As part of #115. |
How can I help? Using javaparser in my master's thesis, and need lexical preservation for marking findings correctly in an editor. |
@larsac07 I do not contribute to JavaParser anymore as it moved to dual-license with Apache License. As you are doing research, I see no problem that you use my LGPL-only clone in its lex-preserving branch (https://github.com/ptitjes/javaparser/tree/lex-preserving). As an alternative, you can use another of my projects (https://github.com/ptitjes/jlato) that also is LGPL-only, has lexical preservation and is much more compliant than this branch. |
@larsac07 I have also lexical preservation for the walkmod project at https://github.com/rpau/javalang |
@rpau Wrong URL! I guess you meant https://github.com/rpau/javalang, right ? |
Hi @larsac07, if you intend to contribute to JavaParser I would be glad to discuss with you and help. For the sake of clarity, jlato is unrelated to JavaParser (it is a different project, a "competitor" if you want). Consider that while we do not store whitespace we store the position of elements in the original source code. JavaParser is currently a mature project, with an extended userbase and business-friendly license (GPL and Apache can both used) so a contribution to this project would have a wider reach. However if you have not already invested time using JavaParser and you need an already implemented solution there are alternatives. In that case I suggest you taking a look at https://github.com/rpau/javalang |
@ptitjes yes, sorry :) I have upgraded the comment |
@rpau :p BTW, how much lexical preservation do you have ? Can you parse a file, modify the ast and write it back as it was (modulo the changes) ? |
Perhaps you can discuss the differences between your own projects on a different venue :) |
No problem @ftomassetti :-) |
Thank you, guys! This really helps. @ptitjes and @rpau, I will have a look at your suggested repos. @ftomassetti, thank you for clarifying this. I must have missed that while experimenting. Still, LP would be very helpful, so any suggestion as to where the problem lies would be much appreciated. If I find a solution, I can contribute my findings and make a PR if desirable. |
I think JavaParser can be used for many different goals and not in all fo them PR makes sense. For example, if I have this statement:
and I remove
or:
I do not know and I think there are many cases which are not obvious. Another example:
If I add a statement in the function LP is something that several users could desire but I am not sure everyone would want the same solution. In my opinion it would be brilliant to have it as something optional. Looks at how comments works: they are extracted separately from the other AST nodes and added later (if the user wants them). We have already information on the position of tokens on from that we coul derive the presence of the whitespace characters. We could need to look into the original source to differentiate between spaces and tabs. Once we have extracted all the information about the whitespace in the original source we could have a variant of the DumpVisitor which uses that: instead of adding "standard" whitespaces it could put back the original whitespaces. Of course it would need some sort of logic to decide what to do around the nodes added/removed. All in all I think it could be not too difficult to implement this feature and me and the other committers could help you and give feedback on that. |
@ftomassetti As for the AST modification you give as an example, note that you can't simply "remove
I personally found out while developing JLaTo that the more flexible way is to consider every white-space to be split in one, two or three parts depending on whether the previous and next meaningful AST constituents were non-terminals and as such be considered to capture additional white-space, and on what white-space would be expected by a formatter at that location. This proved to be particularly efficient and versatile. This shows, contrary to what you said, that you can devise a way to handle modifications of lexically-preserving AST in a global and uniform way with just some common sense and some sensible defaults. As for your proposal about using the parsed source locations to retrieve the original white-spaces, I warn you that this may not be a really good idea. As I explained in a comment 9 months ago (#22 (comment)), I produced something similar some years ago and realized this was really hard to make something stable. Also, you have to understand that the positions (are mutable and) may change due to user manipulation and as such you cannot make an algorithm using source positions that can be fully trusted. A first step would be to make nodes' positions immutable. |
My point was simply to present some examples to make clear that whitespace preservation is not trivial when the code is modified. Decisions have to be made and there are not clear cut rules. If your comments help @larsac07 then I am thankful for them. |
For me the usual thing goes: what is lost cannot be regained. Once information about the original look of the code is thrown out, it cannot be put back in place. So if we want to have lexical preservation as something optional, then it should be built into the core with a pretty printing step after it by default. When you turn pretty printing off, you will have your original formatting again. The order of work should then be: make sure lexical preservation works. Add a pretty printer later on. When you go about it like this, the pretty printing part becomes the optional part - which makes sense to me. |
Thanks for more info. @ftomassetti, I'm not sure you actually store the original position of elements correctly. The line number is always correct, but columns seem to consider a tab character as 8 spaces. Take a look at this example: public class SomeClass {
...
|<- this should be column 5, but is 9
public static void main(String[] args) {
|<- this should be column 9, but is 17
int divider = 10;
double[] values = new double[] {4, 3, 3};
double entropy = calculateEntropy(divider, values);
System.out.println("Entropy: " + entropy);
double infoGain = calculateInformationGain(entropy,divider,new double[][] {{3,1,1},{1,2,2}});
System.out.println("InfoGain: " + infoGain);
}
...
} Is this intended behavior? |
Well, the same tab can be represented as any number of spaces so we have been using the arbitrary value 8. I am not sure what other options there are |
I see, of course. This is different for each IDE, right? It could be solved with an option for how many columns a tab character is, I guess, but at the same time, that could clutter the API for such a small detail. |
Not just every IDE, every user can typically configure that value and set it to 1, 3, 4, 8, one billion. I think it would make sense to make the value configurable also in JavaParser (if it is not already configurable) |
Let's pick this up step by step. What would be the first steps to make? Can we make issues for those steps? Make sure that no step leaves the code unusable, so we can always release. Hopefully this will break down this ball of unknowns. |
BTW, you can toy with JavaCharStream.get/setTabSize() to change that value of 8 used by JavaCC. |
Maybe we can look into this article for ideas: http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2011-027.pdf |
Very interesting document, let's get going on that for version 4.0 :) |
As mentioned in #606, I would be interested in seeing some options in the pretty printer around how to format the code. Something similar to the options an IDE gives you for auto formatting. |
@ftomassetti made something interesting: http://tomassetti.me/lexical-preservation-javaparser/ Personally I'd prefer what's the in the PDF. It is comparable, but I think pointing to the original source by pointing at the actual tokens is more reliable, or more direct, than depending on our position information. The position information comes from the tokens :) |
Yes, I see that using tokens would be a better solution from an engineering point of view. I think the only benefit of my solution is to be compatible with JavaParser as it is. Of we can get access to the tokens within a reasonable timeframe I am all for using them. |
I do think it's awesome that you built something :) It shouldn't be too hard to get the tokens in the nodes. Currently we're keeping track of the location of the begin and end token. We only have to start keeping track of the begin and end token instead of that - so mostly a search/replace. |
I am now trying to adapt my experiment to use tokens: we start by saving the list of elements for each Node. Such elements are either placeholder for children or tokens. For example a class could have as elements: |
Closed through #654 |
TypeUsage is now Type
A long requested feature...
The text was updated successfully, but these errors were encountered: