-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for line/column/indentation based languages #5
Comments
My toy experiment
proved to me that trying to shoehorn a line-based language in a CFG parser is not feasible. It's too bad, as I could really use a parser for the structure inside a line (not today, but likely tomorrow). I think the whole problem starts with the parsing theory seeing input as a contiguous stream of bytes/characters, while in reality, humans think and work in lines of text. To compensate, we throw big iron like GLR and GLL parsers at the stream to reconstruct the line structure. Your grammar specification already proves it. Most grammar lines fit on a single line, the terminating My experiment above runs above LR(1) parsing capabilities, while in a line-based setting it would be mostly trivial something like
where However, I don't know of any tool that allows me to write this in a convenient way. In a line-based setting, you quickly end up in 'sections' of text. I have |
I am attempting to implement an indentation-based language using parglare. I believe that I have come up with a working solution that uses a combination of custom recognizers/actions to examine/manipulate global variables that keep track of the indentation level (among other things). Keeping this extra parsing state in global variables feels a little clumsy and obviously will not work with GLR. I would rather pass in a custom state object when the parser is created, and have the parser pass it to each recognizer and action as an argument. I would think that a GLR parser could clone this custom state object along with the rest of the parser state when the parser is forked. Would the GSSNode class be the place to store this custom state object? I have started trying to implement this, but I would like some feedback before spending too much time going down this road. Thanks! |
Hmm, it all still sounds pretty complicated. Custom state object for something as simple as indent managing, which pretty much boils down to tracking number of "{" brackets, which a parser is perfectly capable to handle, isn't it? My solution to the section languageI had to implement my language, and nothing nice parser-tool crossed my path. (Maybe a packrat(?) parser could have worked, not sure.) In the end I opted for a top-down parser (simple LL(1)) implemented in pure Python functions. It has 3 levels.
(In the actual implementation, the top-level grouped lines of the same section together, and then handed them over to the middle parser for handling creation of the section object, but that is probably just more convenient for a manually written parser.) Call stack is top -> middle -> bottom, of course. Stacked parsers?As a thought-experiment, what would happen if I implement line-parsing (ie middle+bottom) as a standard parser *seeing EOL as EOF, and implement the top level as another parser that takes matched lines from the line-parser as tokens to implement the sections (or indent levels)? In other words, I am trying to keep the indentation (or section) management in a parser by stacking it on top of a line-parser. It would work for my section language, I think except maybe for passing section information down. Of course, if you can hack a parser-tool, you can fold the section-language and the line-languages (one for each section in general) all in one grammar specification, making it much nicer. More questionsSwitching between sections are lines of their own, I implemented that in the top-level by recognizing the section header lines only, and treating all other lines as 'lines in section X'. This may be tricky in a proper parser setting? I am not sure if indenting could be done in a parser at all without external indent-level tracking. Say you have a language consisting of '-' and 'a' only (every other character is ignored), could you write a parser that groups 'a' based on the length of '-' prefixes?
to More wonderingMore general, isn't it a case of embedding a lines language inside a section language (or Python line language inside an indenting language?) Embedding has know solutions, doesn't it? Maybe my section header lines should be part of the section language rather than the line language. |
@jwcraftsman Hi and thanks for the interest in parglare. I agree with @alberth that keeping state seems complicated. I don't know the details to reason about whether you are on the right path. Maybe that's the only way to do it. I haven't put much thought in this issue yet. What I would like is to make those indent/column constraints a part of the grammar, thus making it more explicit and visible. Here are some of the ideas I was thinking about:
Lets look at the 'if' statement: if condition:
if_body
else:
else_body We could write the grammar rule:
Each
In order for the parser to reduce this to As I said, I haven't give it much thought yet so these ideas are probably full of holes. I'm not sure that with this approach I would still be in the realm of the deterministic LR parsing as the "state" of the parser would now depend on the stored indentation levels in Anyway, I don't want to discourage you. The insights you make along the way could be valuable to us. As a tip, there is a Happy hacking! :) |
While PEP8 recommends 4 spaces indentation with Python, the interpreter actually accepts any indentation for a sub-block that is larger than its parent indentation. Also, different sub-blocks don't need the same indentation. Eg for x in [0,1]:
if x == 0:
print("0")
else:
print("1") works |
Yes, you are right. There are many gotchas lurking around the corner. Probably even more when we take into account the quirks of all indentation/column based languages. The above idea was just a rough draft that needs refinement. E.g. expression The bigger problem here is coming up with the way to calculate LR tables and investigating what would be the implications to LR parsing. |
So maybe my section-based language doesn't quite belong in this issue and should be split away to a different issue, as it doesn't care at all about indentation, it only cares about the text of a line. |
I guess it belongs here as it is also a language where whitespace layout has special meaning. I haven't touch line-based languages with above ideas. We have some support for those in textX. For example, you can see how pyTabs is implemented. In this language music composition is specified in a language that is in a way 2-dimensional. Also, pyFlies is interesting as there is experiment specification given like a plain text table. Grammar specifies the first line to be variable names while each other line is an experiment state with the values of the given variables. Something along these lines could be implemented in parglare in my view. |
In my case, the whitespace rules are fairly complicated, allowing for line continuation characters, "partially" unindented blocks, and ignoring newlines and indentation inside of brackets. Perhaps my use case will be a good test for any proposed parglare features related to this issue. I will try to post a working example soon that illustrates what I am trying to do. |
Regardless of whether my case could be handled by some future built-in features, I think it would still be useful to be able to easily manipulate some external state from actions that could be examined by recognizers, as there will always be something someone is trying to do that hasn't been anticipated. It is always nice to have a "big hammer" available to solve the problem by brute force when the required features are not available. It seems like others need this sort of functionality as well (e.g. @alensuljkanovic in #44 and #45). Along those lines, it seems that actions are being used for two separate purposes: 1) constructing AST tree nodes, and 2) manipulating external state. This is illustrated by the new option added in #45. What would you think about splitting off the external state functionality into something called "side_effects"? These actions would always be called, and would not return any value. If you are open to the idea, I will open a new issue proposing this change. |
Sure, I'm open to new ideas. I would like for parglare to be as much flexible as it can while preserving simplicity and ease of use. |
Yes, I have found that to be the case so far. I think the design is elegant, but also very practical. Thanks for making it available. |
Usually I'm in dire needs for indent/block-based DSL each 6 months or so. Feels like karma.
|
I can feel your pain :) After resolving #52 this issue became probably the most important to resolve together with some new ways for static disambiguation (issues #42 #43 #48). PR #51 will provide a way to hack indentation support but in this case I'm more for a proper solution that would enable declarative specification, i.e. indentation based rules would be a part of the grammar. That still requires some research and design work but I'm interested also for this to be resolved as I find indentation-based DSLs to be very readable and appealing to the users. Thanks for the useful references. |
Since you asked, here is some code that I have been playing with that illustrates one way to implement a parser for a language that uses significant indentation: https://github.com/codecraftingtools/wumps This requires a custom branch of parglare that includes an implementation for #51, although it lags a bit behind the upstream project: https://github.com/codecraftingtools/parglare.git See the wumps README file for more information. Sample input and output files are also included in case you don't want to run it for yourself. The language includes quite a few things that add extra complexity (like implicit continuation lines, disabling indentation significance inside braces, etc.), so please note that just implementing basic indent/unindent blocks could be quite a bit simpler than what you see here. |
@jwcraftsman thanks for sharing, you grammar has some nice ideas for me to explore.
Because being dropped all intermediate steps of grammar's incremental design it's not immediately obvious how and why some rules were introduced, or how to repeat the process one piece at a time. FYI, I think next article (with pdf available) is rather decent read about complexities and pitfalls of indentation-based languages with some formalism and examples of Haskell/Python rules: |
Okay, I hacked something together that may be easier to understand: |
Cool! Thanks so much. |
e.g. FORTRAN as column-based or Python as indentation-based.
The text was updated successfully, but these errors were encountered: