Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

support off-side rule languages #5

Closed
CircleCode opened this issue Jun 25, 2013 · 5 comments
Closed

support off-side rule languages #5

CircleCode opened this issue Jun 25, 2013 · 5 comments

Comments

@CircleCode
Copy link
Member

CircleCode commented Jun 25, 2013

It seems Hoa\Compiler cannot parse Off-side rule languages.

Maybe it could be sufficient to have the compiler adding automatically INDENT (respectively UNINDENT) tokens each time indent increase (respectively decrease) by 1.

The tricky part seems to be the matching between spaces, tab, and indent length…

@CircleCode
Copy link
Member Author

since lookahead is supported in tokens, maybe this can be done with some magic tokens… I'll investigate on it.

By the way, even if possible, it would mean one cannot skip \s, thus making parsing a little bit more tedious. So even if some cool tokens can do this, I suppose it would be great if this could be done by the compiler itself.

@CircleCode
Copy link
Member Author

after thinking about it, since INDENT (or UNINDENT) is relative to previous line, it would require look behind assertions, which I suppose are not supported (because of tokens trimming the text from the left)

@CircleCode
Copy link
Member Author

maybe there is something that can be used from this paper: http://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf

@CircleCode
Copy link
Member Author

side note: here are the rules used by python's lexer to add INDENT and DEDENT tokens ( from http://docs.python.org/2/reference/lexical_analysis.html#indentation ):

First, tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a multiple of eight (this is intended to be the same rule as used by Unix). The total number of spaces preceding the first non-blank character then determines the line’s indentation. Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation.

The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows.

Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line’s indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each number remaining on the stack that is larger than zero.

it seems not too hard to implement, but the difficulty comes from the fact that this has to be mixed with user defined grammar

If I find some time, I'll try to play with this

Note: since we are parsing the stream as a single string (and not line by line), we have to include newline in our analysis, and take precedence over user defined tokens

@Hywan
Copy link
Member

Hywan commented Aug 22, 2017

Closing because it's old :-).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

2 participants