Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for line/column/indentation based languages #5

Open
igordejanovic opened this issue Aug 29, 2017 · 18 comments
Open

Support for line/column/indentation based languages #5

igordejanovic opened this issue Aug 29, 2017 · 18 comments
Assignees
Labels

Comments

@igordejanovic
Copy link
Owner

e.g. FORTRAN as column-based or Python as indentation-based.

@alberth
Copy link
Contributor

alberth commented Feb 9, 2018

My toy experiment

define type
    real
    float is a real
    double is a real

proved to me that trying to shoehorn a line-based language in a CFG parser is not feasible. It's too bad, as I could really use a parser for the structure inside a line (not today, but likely tomorrow).

I think the whole problem starts with the parsing theory seeing input as a contiguous stream of bytes/characters, while in reality, humans think and work in lines of text. To compensate, we throw big iron like GLR and GLL parsers at the stream to reconstruct the line structure.

Your grammar specification already proves it. Most grammar lines fit on a single line, the terminating ; is just clutter. If you defined the LHS to be at the left margin, and everything else not at the left margin, that would work nicer.

My experiment above runs above LR(1) parsing capabilities, while in a line-based setting it would be mostly trivial something like

typesection : typeheader_line definetype_line+
typeheader_line : "define" "type"
definetype_line : NAMETK | NAMETK "is" "a" NAMETK

where *_line nonterminals only match at full lines, and I conveniently consider an empty line to be white space.

However, I don't know of any tool that allows me to write this in a convenient way.

In a line-based setting, you quickly end up in 'sections' of text. I have define type as header to denote a type section, column/indentation is another form of creating a section, I think.

@jwcraftsman
Copy link
Contributor

I am attempting to implement an indentation-based language using parglare. I believe that I have come up with a working solution that uses a combination of custom recognizers/actions to examine/manipulate global variables that keep track of the indentation level (among other things). Keeping this extra parsing state in global variables feels a little clumsy and obviously will not work with GLR. I would rather pass in a custom state object when the parser is created, and have the parser pass it to each recognizer and action as an argument. I would think that a GLR parser could clone this custom state object along with the rest of the parser state when the parser is forked. Would the GSSNode class be the place to store this custom state object? I have started trying to implement this, but I would like some feedback before spending too much time going down this road. Thanks!

@alberth
Copy link
Contributor

alberth commented Jul 17, 2018

Hmm, it all still sounds pretty complicated. Custom state object for something as simple as indent managing, which pretty much boils down to tracking number of "{" brackets, which a parser is perfectly capable to handle, isn't it?
After reading your comment, and thinking about it again, I wonder if there is another solution.

My solution to the section language

I had to implement my language, and nothing nice parser-tool crossed my path. (Maybe a packrat(?) parser could have worked, not sure.) In the end I opted for a top-down parser (simple LL(1)) implemented in pure Python functions. It has 3 levels.

  • Top-level is tracking of sections (define type enters a typedef section, define component enters a component definition section, and so on.) It also has sub-sections, but let's ignore that.
  • Middle level is "single line parser", it takes a single line, and decides what it is. Generally, it knows the section I assume, since I coded it also top-down. Not sure how much that is needed though.
  • Bottom level is "elementary words", much like a scanner providing tokens, implemented with regular expression matching, and tracking column and length for each "word". (My words were often sub-phrases like [ ]+is[ ]+an?, but that's just what you see as elementary word.)

(In the actual implementation, the top-level grouped lines of the same section together, and then handed them over to the middle parser for handling creation of the section object, but that is probably just more convenient for a manually written parser.)

Call stack is top -> middle -> bottom, of course.
You also seem to end up at 3 levels, where 'middle' and 'bottom' is a regular parser, and 'top' is a custom state object, so that looks like a pattern :p

Stacked parsers?

As a thought-experiment, what would happen if I implement line-parsing (ie middle+bottom) as a standard parser *seeing EOL as EOF, and implement the top level as another parser that takes matched lines from the line-parser as tokens to implement the sections (or indent levels)?

In other words, I am trying to keep the indentation (or section) management in a parser by stacking it on top of a line-parser. It would work for my section language, I think except maybe for passing section information down. Of course, if you can hack a parser-tool, you can fold the section-language and the line-languages (one for each section in general) all in one grammar specification, making it much nicer.

More questions

Switching between sections are lines of their own, I implemented that in the top-level by recognizing the section header lines only, and treating all other lines as 'lines in section X'. This may be tricky in a proper parser setting?

I am not sure if indenting could be done in a parser at all without external indent-level tracking. Say you have a language consisting of '-' and 'a' only (every other character is ignored), could you write a parser that groups 'a' based on the length of '-' prefixes?
Eg

a
--a
--a
----------a
a

to (a, (a, a, (a)), a) ?
(Possible solution is to see each - as a group?)

More wondering

More general, isn't it a case of embedding a lines language inside a section language (or Python line language inside an indenting language?) Embedding has know solutions, doesn't it?

Maybe my section header lines should be part of the section language rather than the line language.

@igordejanovic
Copy link
Owner Author

@jwcraftsman Hi and thanks for the interest in parglare.

I agree with @alberth that keeping state seems complicated. I don't know the details to reason about whether you are on the right path. Maybe that's the only way to do it. I haven't put much thought in this issue yet.

What I would like is to make those indent/column constraints a part of the grammar, thus making it more explicit and visible.

Here are some of the ideas I was thinking about:

  • For column based languages we could have a special recognizer column(x) where x is the column. This recognizer would succeed without consuming any input only if called with the position at the given column.
    Thus we could write something like:
Statement :  column(7) 'SomeKeyword' ....;

Statement could be reduced only if SomeKeyword is found on column 7, i.e. if column(7) recognizer succeeds.

  • For indentation languages (Python, YAML etc.) similar approach could be taken. For example, lets say we have a special recognizer indent(x) which recognizes indent level and returns token with the indent level stored inside.

Lets look at the 'if' statement:

    if condition:
        if_body
    else:
        else_body

We could write the grammar rule:

IfStatement: indent(a) 'if' condition 
                   indent(a+1) if_body 
                   indent(a) 'else:' 
                   indent(a+1) else_body;

Each indent recognizer produces indent token with the indent level remembered and connected to the expression inside the parentheses. When the time for reduction comes on the LR stack we have:

[ indent(a), 'if', condition, indent(a+1), if_body, indent(a), 'else:', indent(a+1), else_body ]

In order for the parser to reduce this to if statement all indent token expressions must be solved so that a single value for a is found and that each expression in parentheses equals to the stored indentation level.

As I said, I haven't give it much thought yet so these ideas are probably full of holes. I'm not sure that with this approach I would still be in the realm of the deterministic LR parsing as the "state" of the parser would now depend on the stored indentation levels in indent tokens. The table calculation would probably be influenced greatly also.

Anyway, I don't want to discourage you. The insights you make along the way could be valuable to us.

As a tip, there is a Context object that is passed around. If you give an instance to the parser it will use the same instance during the course of parsing. It is passed to actions, it is not passed to recognizers yet but you could easily hack that to try your approach. Try first to implement it with LR to avoid dealing with GLR complexity.

Happy hacking! :)

@alberth
Copy link
Contributor

alberth commented Jul 17, 2018

While PEP8 recommends 4 spaces indentation with Python, the interpreter actually accepts any indentation for a sub-block that is larger than its parent indentation. Also, different sub-blocks don't need the same indentation. Eg

for x in [0,1]:
    if x == 0:
     print("0")
    else:
            print("1")

works

@igordejanovic
Copy link
Owner Author

Yes, you are right. There are many gotchas lurking around the corner. Probably even more when we take into account the quirks of all indentation/column based languages.

The above idea was just a rough draft that needs refinement. E.g. expression a+1 might also be a+k. We could introduce any number of variables and search for a solution where all variable values are positive in case of reduction. Also, we could say that indent level might be a fraction number (in case of non-four indentation). These are just some of the possible solution. The idea is to try to come up with some formal way of specifying those languages, possibly reusing as much of already existing formalisms, without too much of kludges. This would have a positive impact on understandability and maintainability of the language.

The bigger problem here is coming up with the way to calculate LR tables and investigating what would be the implications to LR parsing.

@alberth
Copy link
Contributor

alberth commented Jul 17, 2018

So maybe my section-based language doesn't quite belong in this issue and should be split away to a different issue, as it doesn't care at all about indentation, it only cares about the text of a line.

@igordejanovic
Copy link
Owner Author

I guess it belongs here as it is also a language where whitespace layout has special meaning. I haven't touch line-based languages with above ideas.

We have some support for those in textX. For example, you can see how pyTabs is implemented. In this language music composition is specified in a language that is in a way 2-dimensional. Also, pyFlies is interesting as there is experiment specification given like a plain text table. Grammar specifies the first line to be variable names while each other line is an experiment state with the values of the given variables.

Something along these lines could be implemented in parglare in my view.

@jwcraftsman
Copy link
Contributor

In my case, the whitespace rules are fairly complicated, allowing for line continuation characters, "partially" unindented blocks, and ignoring newlines and indentation inside of brackets. Perhaps my use case will be a good test for any proposed parglare features related to this issue. I will try to post a working example soon that illustrates what I am trying to do.

@jwcraftsman
Copy link
Contributor

jwcraftsman commented Jul 17, 2018

Regardless of whether my case could be handled by some future built-in features, I think it would still be useful to be able to easily manipulate some external state from actions that could be examined by recognizers, as there will always be something someone is trying to do that hasn't been anticipated. It is always nice to have a "big hammer" available to solve the problem by brute force when the required features are not available. It seems like others need this sort of functionality as well (e.g. @alensuljkanovic in #44 and #45).

Along those lines, it seems that actions are being used for two separate purposes: 1) constructing AST tree nodes, and 2) manipulating external state. This is illustrated by the new option added in #45. What would you think about splitting off the external state functionality into something called "side_effects"? These actions would always be called, and would not return any value. If you are open to the idea, I will open a new issue proposing this change.

@igordejanovic
Copy link
Owner Author

Sure, I'm open to new ideas. I would like for parglare to be as much flexible as it can while preserving simplicity and ease of use.

@jwcraftsman
Copy link
Contributor

Yes, I have found that to be the case so far. I think the design is elegant, but also very practical. Thanks for making it available.

@amerlyq
Copy link

amerlyq commented Oct 22, 2018

Usually I'm in dire needs for indent/block-based DSL each 6 months or so. Feels like karma.
But 20+ parsers I tried in different languages do not provide any sane solution to the problem or any solution at all :(
And still parglare belongs to the few of most sane ones from the bunch.
Will PR #51 introducing side effects allow to implement lexem INDENT / DEDENT to be used as block braces? How to use side effects to introduce them in this case?
Akin to these:

@igordejanovic
Copy link
Owner Author

I can feel your pain :) After resolving #52 this issue became probably the most important to resolve together with some new ways for static disambiguation (issues #42 #43 #48). PR #51 will provide a way to hack indentation support but in this case I'm more for a proper solution that would enable declarative specification, i.e. indentation based rules would be a part of the grammar. That still requires some research and design work but I'm interested also for this to be resolved as I find indentation-based DSLs to be very readable and appealing to the users.

Thanks for the useful references.

@jwcraftsman
Copy link
Contributor

@amerlyq,

Since you asked, here is some code that I have been playing with that illustrates one way to implement a parser for a language that uses significant indentation:

https://github.com/codecraftingtools/wumps

This requires a custom branch of parglare that includes an implementation for #51, although it lags a bit behind the upstream project:

https://github.com/codecraftingtools/parglare.git

See the wumps README file for more information. Sample input and output files are also included in case you don't want to run it for yourself.

The language includes quite a few things that add extra complexity (like implicit continuation lines, disabling indentation significance inside braces, etc.), so please note that just implementing basic indent/unindent blocks could be quite a bit simpler than what you see here.

@amerlyq
Copy link

amerlyq commented Oct 23, 2018

@jwcraftsman thanks for sharing, you grammar has some nice ideas for me to explore.
However I would much appreciate if you had simpler example for sharing of new features ;)
Having introduced simple unencumbered grammar for blocks of outlines woud be nice of you -- i.e. to produce AST = [[a, [b, [c], b]], [d, [e]]] from sample input:

a
  b
    c
  b

d
  e

Because being dropped all intermediate steps of grammar's incremental design it's not immediately obvious how and why some rules were introduced, or how to repeat the process one piece at a time.


FYI, I think next article (with pdf available) is rather decent read about complexities and pitfalls of indentation-based languages with some formalism and examples of Haskell/Python rules:
https://michaeldadams.org/papers/layout_parsing/
It does not provide immediate solution, though, and made me think I bit more then I can chew at once.

@jwcraftsman
Copy link
Contributor

Okay, I hacked something together that may be easier to understand:

https://github.com/jwcraftsman/pgindent

@amerlyq
Copy link

amerlyq commented Oct 23, 2018

Cool! Thanks so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants