Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler Implementation #18

Open
keean opened this issue Sep 27, 2016 · 492 comments
Open

Compiler Implementation #18

keean opened this issue Sep 27, 2016 · 492 comments

Comments

@keean
Copy link
Owner

keean commented Sep 27, 2016

Discussion about the ongoing implementation of the compiler.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

The parser now has an implementation of the indentation aware parser described in this paper: https://pdfs.semanticscholar.org/cd8e/5faaa60dfa946dd8a79a5917fe52b4bd0346.pdf

Here's the implementation of the indentation parser:

    function IndentationParser(init) {
        this.indent = init
    }
    IndentationParser.prototype.get = function() {
        return this.indent
    }
    IndentationParser.prototype.set = function(i) {
        this.indent = i
    }
    IndentationParser.prototype.relative = function(relation) {
        var self = this
        return Parsimmon.custom((success, failure) => {
            return (stream, i) => {
                var j = 0
                while (stream.charAt(i + j) == ' ') {
                    j = j + 1
                }
                if (relation.op(j, self.indent)) {
                    self.indent = j
                    return success(i + j, j)
                } else {
                    return failure(i, 'indentation error: ' + j + relation.err + self.indent)
                }
            }
        })
    }
    IndentationParser.prototype.absolute = function(target) {
        var self = this
        return Parsimmon.custom((success, failure) => {
            return (stream, i) => {
                var j = 0
                while (stream.charAt(i + j) == ' ') {
                    j = j + 1
                }
                if (j == target) {
                    self.indent = j
                    return success(i + j, target)
                } else {
                    return failure(i, 'indentation error: ' + j + ' does not equal ' + target)
                }
            }
        })
    }
    IndentationParser.prototype.eq  = {op: (x, y) => {return x == y}, err: ' does not equal '}
    IndentationParser.prototype.ge  = {op: (x, y) => {return x >= y}, err: ' is not equal or greater than '}
    IndentationParser.prototype.gt  = {op: (x, y) => {return x > y}, err: ' is not greater than '}
    IndentationParser.prototype.any = {op: (x, y) => {return true}, err: ' cannot fail '}

This is what a parser using these new parser combinators looks like:

    block = Parsimmon.succeed({}).chain(() => {
        var indent = Indent.get()
        return Parsimmon.seqMap(
            Indent.relative(Indent.gt).then(statement),
            (cr.then(Indent.relative(Indent.eq)).then(statement)).many(),
            (first, blk) => {
                blk.unshift(first)
                Indent.set(indent)
                return {'blk' : blk}
            }
        )
    })

This parses a block of statements, the first line of the block must be more indented than the previous line, and the remaining lines must be indented the same amount as the first line.

@shelby3
Copy link

shelby3 commented Sep 27, 2016

@keean I will catch up with you later on the parser combinator implementation. I haven't employed them ever, so I will need to dedicate some time to that. My first priority is to write the grammar into an EBNF file and check that it is conflict-free, LL(k), and hopefully also context-free. I read that parser combinators can't check those attributes.

Also I will want to understand whether using a monadic parser combinator library, forces our AST into a monadic structure and whether that is the ideal way for us to implement. Any way, you are rolling on implementation, so I don't want to discourage you at all. I will try to rally around one way and help code. I will need to study. My focus so far has been on nailing down the syntax and early design decisions. Btw, congrats on getting rolling so quickly on the implementation!

Btw, I hate semicolons. Any particular reason you feel you need to litter the code with them? There are only a very few ASI gotchas in JavaScript (and these I think can be checked with jslint) with not including semicolons and these are easy to memorize, such as not having the rest of the line blank after a return as this will return undefined.

Also I prefer the style of this latest code compared to what I saw before, because I don't like trying cram too many operations on one LOC. It makes it difficult to read the code IMO.

Also, I think I would prefer to employ arrow functions as follows (we'll be porting to self-hosted later so we'll have arrow functions as standard to any ES version and to compromise at 3 spaces indentation (even though I prefer 2 spaces lately):

block = Parsimmon.succeed({}).chain(() => {
   var indent = Indent.get()
   return Parsimmon.seqMap(
      Indent.relative(Indent.gt).then(statement),
      (cr.then(Indent.relative(Indent.eq)).then(statement)).many(),
      (first, blk) => {
         blk.unshift(first)
         Indent.set(indent)
         return {'blk' : blk}
      } 
  )
})

Also I would horizontally align as follows because I love pretty code, which is easier to read:

IndentationParser.prototype.eq  = {op: eq(x, y) => {return x == y}, err: ' does not equal '              }
IndentationParser.prototype.ge  = {op: ge(x, y) => {return x >= y}, err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = {op: gt(x, y) => {return x >  y}, err: ' is not greater than '         }
IndentationParser.prototype.any = {op: gt(x, y) => {return true  }, err: ' cannot fail '                 }

I may prefer:

IndentationParser.prototype.eq  = { op: eq(x, y) => {return x == y},
                                   err: ' does not equal '              }
IndentationParser.prototype.ge  = { op: ge(x, y) => {return x >= y},
                                   err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = { op: gt(x, y) => {return x >  y},
                                   err: ' is not greater than '         }
IndentationParser.prototype.any = { op: gt(x, y) => {return true  },
                                   err: ' cannot fail '                 }

Above you are implicitly making the argument again that we should have the ability to name inline functions (without let) in our programming language. Note this would be an alternative solution to the ugly syntax for the case where we need to specify the return type, but afaics we can't unify around (x, y) => x == y without the prefixed name unless we don't use parenthesis for anonymous product (tuple) types and remain LL(k). Any idea how ES6 is parsing their arrow functions? LR grammar? Alternatively you would be writing that in our language:

let eq(x, y) => x == y
let ge(x, y) => x >= y
let gt(x, y) => x >  y
let gt(x, y) => true
IndentationParser.prototype.eq  = {op: eq, err: ' does not equal '              }
IndentationParser.prototype.ge  = {op: ge, err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = {op: gt, err: ' is not greater than '         }
IndentationParser.prototype.any = {op: gt, err: ' cannot fail '                 }

Which would have helped you catch the error on the duplication of the gt name copy+paste typo. The only reason you are adding the redundant naming above is for browser debugging stack traces correct?

Or (unless we change the syntax):

IndentationParser.prototype.eq  = { op: x y => x == y,
                                   err: ' does not equal '              }
IndentationParser.prototype.ge  = { op: x y => x >= y,
                                   err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = { op: x y => x >  y,
                                   err: ' is not greater than '         }
IndentationParser.prototype.any = { op: x y => true,
                                   err: ' cannot fail '                 }

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

The main reason to use function is backwards compatibility, not all browsers support => yet.

With regards to our syntax, function definition should be an expression, so you should be able to include it inline in the object declaration. I think we would end up with something like this:

data Relation = Relation { op : (A, A) : Bool, err : String }

let eq = Relation { op: eq(x, y) => x == y, err: ' does not equal ' }

@shelby3
Copy link

shelby3 commented Sep 27, 2016

@keean wrote:

The main reason to use function is backwards compatibility, not all browsers support => yet.

I know. That is why I wrote:

Also, I think I would prefer to employ arrow functions as follows (we'll be porting to self-hosted later so we'll have arrow functions as standard to any ES version

I had already explained we will get backwards compatibility for free, and by not putting function we are more compatible with the way it will be written in our language when we port over.

Who can't run our compiler in a modern browser in the meantime? This is only alpha.

Please re-read my prior comment, as I added much to the end of it.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

Regarding semi-colons, Douglas Crockford in "JavaScript: The Good Parts" recommends always using semi-colons explicitly because JavaScripts semi-colon insertion can result in the code not doing what you intended.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

I think you are right about '=>' for functions, as it is running in Node which supports them, however, I don't think porting will be that straightforward, as we won't directly support prototypes etc.

@shelby3
Copy link

shelby3 commented Sep 27, 2016

@keean wrote:

because JavaScripts semi-colon insertion can result in the code not doing what you intended.

Did you not read what I wrote?

There are only a very few ASI gotchas in JavaScript (and these I think can be checked with jslint) with not including semicolons and these are easy to memorize, such as not having the rest of the line blank after a return as this will returnundefined.

http://benalman.com/news/2013/01/advice-javascript-semicolon-haters/

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

Regarding semi-colons:

... the specification is clear about this. JavaScript’s syntactic grammar specifies that semi-colons are required. Omitting semi-colons results in invalid JavaScript code. That code won’t throw (thanks to ASI), but it’s still invalid.

@shelby3
Copy link

shelby3 commented Sep 27, 2016

Semicolons won't help you here:

return
   some long shit;

You have to know the rules, whether you use semicolons or not. That is why I am happy we are going to use a Python style indenting.

Semicolons are training wheels that don't protect against every failure.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

Also jshint wants you to put them in, and I am using jshint as part of the build process.

jshint catches the above error :-)

@shelby3
Copy link

shelby3 commented Sep 27, 2016

JSHint can be configured to allow ASI. And I think it will still warn you about ambiguous implicit cases, if I am not mistaken (it should).

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

without semi-colons JSHint cannot recognise the above error because you might mean:

return;
some long stuff

or

return some long stuff;

@shelby3
Copy link

shelby3 commented Sep 27, 2016

Bottom line is you have something at the start of the line which could possibly be a line continuation, then check to make sure you have made it unambiguous.

That is the simple golden rule and it applies whether using semicolons or not. That is not complicated. One simple rule.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

JavaScript was never designed to be used without semi-colons... lets design our new language not to require them, but I don't see any point in fighting JavaScript... We will emit the semi colons into JS :-)

@shelby3
Copy link

shelby3 commented Sep 27, 2016

@keean wrote:

without semi-colons JSHint cannot recognise the above error because you might mean:

It should be warning that the case is ambiguous. I can't be faulted for the JSHint programmers being derelict (if they are, did not confirm).

@shelby3
Copy link

shelby3 commented Sep 27, 2016

@keean wrote:

JavaScript was never designed to be used without semi-colons...

The intent isn't relevant. What is, is what is relevant. We need to know the rules whether we use semicolons or not. We are not stupid programmers who need to make ourselves feel we are more secure by not knowing the rules. I lost my training wheels 30 years ago.

Otherwise we need to find a linter that warns of all ambiguous cases with or without semicolons.

Bottom line is you have something at the start of the line which could possibly be a line continuation, then check to make sure you have made it unambiguous.

That is the simple golden rule and it applies whether using semicolons or not. That is not complicated. One simple rule.

If JSHint isn't doing that checking, then it is derelict. Need to find a better linter.

Wonder if Douglas Crockford ever considered that. Some influential people decide that semicolons every where is the prescribed way, then why the heck did JS offer ASI any way?

Perhaps he could have realized that the only sure way, is to have a linter which properly warns of every ambiguous case, whether using semicolons or not. Instead perhaps these talking heads influenced the JSHint people to not add proper checking for the ASI case? Sigh.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

So here's what the guy that created JS thinks: https://brendaneich.com/2012/04/the-infernal-semicolon/

@shelby3
Copy link

shelby3 commented Sep 27, 2016

It doesn't matter. It is just logic.

There you go. Cockford doesn't agree to support ASI in his tool and thus promulgates that ASI is an error:

Some argue that JSMin has a bug. Doug Crockford does not want to change JSMin, and that’s his choice.

That's right:

And (my point here), neither is newline.

Know the rules. Newline is not a statement nor expression terminator in JavaScript. Simple as that. Resolve all ambiguous cases.

Analogous superfluous redundancy as one wouldn't write ;;;;;;; at the end of every statement or expression to make sure they got it right. They also don't need to write ; to make sure they got it right, if they are using a linter which can warn them whether the preceding expression on the prior line could be joined to the next line and thus that a semicolon or other syntax needs to be inserted to resolve the ambiguity.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

The moral of this story: ASI is (formally speaking) a syntactic error correction procedure. If you start to code as if it were a universal significant-newline rule, you will get into trouble. A classic example from ECMA-262:

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

So I don't write code with syntactic errors... I write Python without I write C++ with... it doesn't bother me, I go with what the language standard says...

@shelby3
Copy link

shelby3 commented Sep 27, 2016

The moral of this story: ASI is (formally speaking) a syntactic error correction procedure. If you start to code as if it were a universal significant-newline rule, you will get into trouble. A classic example from ECMA-262:

Then why did he put it in JavaScript. Linters should do their job correctly.

There is absolutely no reason you ever need a ; after a block { }. The } terminates the statement or expression.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

I wish I had made newlines more significant in JS back in those ten days in May, 1995. Then instead of ASI, we would be cursing the need to use infix operators at the ends of continued lines, or perhaps \ or brute-force parentheses, to force continuation onto a successive line. But that ship sailed almost 17 years ago.

@shelby3
Copy link

shelby3 commented Sep 27, 2016

I wish I had made newlines more significant in JS back in those ten days in May, 1995. Then instead of ASI, we would be cursing the need to use infix operators at the ends of continued lines, or perhaps \ or brute-force parentheses, to force continuation onto a successive line. But that ship sailed almost 17 years ago.

There you go. JS can't require semicolons. So why do you? Probably because we can't use a proper (non-derelict) linter, probably because JSHint probably doesn't warn of all ambiguities with 'ASI' enabled (but I didn't confirm that).

We are moving to block indenting to avoid this entire mess.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

Okay, so conclusion, I will use '=>' for anonymous functions, but leave the ';' in for now...

Our language won't require semi-colons, just like Python does not...

@shelby3
Copy link

shelby3 commented Sep 27, 2016

I go with what the language standard says...

The language standard says ASI is a supported feature. You have to know the rules whether you use semicolons or not. I will not repeat this again.

Let's try to find a linter which isn't brain-dead.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

My two cents: be careful not to use ASI as if it gave JS significant newlines. And please don’t abuse && and || where the mighty if statement serves better.

The standard says it is a syntax error to omit the semi-colon.

@shelby3
Copy link

shelby3 commented Sep 27, 2016

The standard says it is a syntax error to omit the semi-colon.

Then why does it compile. Brendan told you that JS can not require semicolons every where because it breaks other things.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

You are right. If you really can't work with the semi-colons, I will get rid of them for this project.

@keean
Copy link
Owner Author

keean commented Sep 27, 2016

Can I delete the semi-colon discussion, as its cluttering the implementation thread... I am going to remove them.

I discovered this does not work inside => defined functions... that is a bit weird.

@shelby3
Copy link

shelby3 commented Feb 10, 2018

Why parser combinators don’t make grammar analyzers extinct:

https://research.swtch.com/yaccalive

I had made a similar argument to @keean about the need to prove the lack of ambiguities that force backtracking.

@keean
Copy link
Owner Author

keean commented Feb 10, 2018

That article is referring to regular expressions not parser combinators as far as I can see.

@shelby3
Copy link

shelby3 commented Feb 10, 2018

See the section Linear vs Exponential Runtime. He is writing about context-free grammars in general.

This was referenced Mar 3, 2018
@shelby3
Copy link

shelby3 commented May 28, 2018

@NodixBlockchain wrote:

At the basic it can generate a parser, but its only the lower level, the power of it come from listener/visitor class who can work on the parse tree.

That sounds to be in contention with my (our) stated goal:

I prefer the compiler to eventually be bootstrapped to self-hosted.

Because OOP is an anti-pattern and we’ll probably prefer to use functional programming and typeclasses. But I have not yet dug into this detail. Perhaps @keean has an opinion given he has experience in creating a PL which I do not have?

As far as i know its the best tool of its kind.

Intuitively that probably also means carrying around a lot of cruft. Being a jack-of-all-trades has its downsides. But I would need to dig in to really analyse accurately the issues. Btw, I did briefly look at ANTLR in the past.

Lexer/tokenizer are easy to do, but making a good parser for complex grammar can become tricky, especially for protyping when things can change often, its easy for bugs to creep in, or for the code to become a bit ugly without a real grammar parser like antlr.

That’s the same argument I made up-thread to @keean in favor of some integration with a generator. Whereas, @keean had begun implementation with parser combinator libraries. Although I presume his work paused due to ongoing discussions and lack of consensus about what the design of the PL should be. Yet I think for myself the essence of what the design should be has solidified, although I am still remaining open to discussion on the design while moving forward on implementation.

@NodixBlockchain
Copy link

NodixBlockchain commented May 28, 2018 via email

@NodixBlockchain
Copy link

Really? Got any references for that? The reason I ask is because it seems to me that generating good error reports was the main reason. You just cannot get good language specific errors out of a parser generator.

https://softwareengineering.stackexchange.com/questions/17824/should-i-use-a-parser-generator-or-should-i-roll-my-own-custom-lexer-and-parser

Antlv4 is supposed to have improved the error handling, didnt test it heavily though. But the hard part is generally not the grammar itself, but the step after it, and there its possible to have complete error message with context etc

But even with the grammar parsing, antlr v4 is supposed to be able to generate error messages that are better than previously.

For the performance i guess it depends, but if the grammar file is not well structured, they can become slow.

antlr/antlr4#1540

Antlr4 is around 40x slower.

For simple straightforward grammar its probably ok, but it can be easy to write rules that become heavy to process without realizing it.

http://www.antlr3.org/pipermail/antlr-interest/2009-January/032345.html

With hand made parser, its much less likely it will choke on complex rules.

Parser generators grammar definition can still be optimized, but its easy to write grammar rules that will become long to process, and as its general purpose, it might also generate some informations that are not useful for the task.

But it seems depends on the type of parsing, sometime antlr method is not best adapted to the type of grammar and can become inefficient.

For simple grammar, they seem to be equivalant to average hand made, but when the grammar become more complex, it can probably become slower, but depend on the cases.

@shelby3
Copy link

shelby3 commented May 28, 2018

@NodixBlockchain wrote:

What language would you use otherwise ? Im not sure Go is the best language to make a compiler.

I don’t know which language I would suggest to use to bootstrap with. Maybe which ever language accomplishes the bootstrapping most quickly, since the goal is to bootstrap to self-hosted. @keean might have an opinion on this?

There are not that many languages who support typeclass that i know of, other than haskell and scala, and im not sure its extremely useful to make a parser/compiler.

And Rust. And Swift has something similar to typeclasses.

Haskell might be the best choice for bootstrapping, but I feel awkward using it. Just can’t seem to adjust to every function takes one argument.

@NodixBlockchain
Copy link

NodixBlockchain commented May 28, 2018 via email

@shelby3
Copy link

shelby3 commented May 28, 2018

@NodixBlockchain I saw your comment in the other thread about ANTLR. I’m currently working on a draft of the proposed syntax for a hopefully context-free LL(k=1) grammar using the analysis and generation SLK tool. When it’s ready, you’re free to try to plug it into ANTLR and see what advantages and disadvantages fall out.

EDIT: here is what I have thus far and it is LL(k=2) thus far. Also I found some criticism of the SLK tool I am using on the ANTLR forum.

@shelby3
Copy link

shelby3 commented Jun 25, 2018

@keean wrote in the Modularity thread #39:

I am tempted to suggest starting with an s-expression grammar, that way we can separate the language AST from the parser.

You want the AST to have a canonical structure so that changes to the parser grammar don’t require changes to compiler stages that manipulate the AST?

Yet doesn’t this still requires a translation stage from the parser grammar to the s-expression grammar?

Sounds like this will make the compiler more complex because for example error messages won’t be finely tuned to the parser grammar. And the compiler would be slower.

You’re presumably envisioning a more reusable compiler layer upon which different parser grammar can be deployed. Or envisioning much experimentation and changes to parser grammar. This has been your thesis since the beginning that the compiler should be built in canonical orthogonal layers.

Frankly that sounds like an academic research project.

Am I missing any important advantages you envision?

Also I want a self-hosted compiler so there wouldn’t be homoiconicity when programming the compiler passes over the AST. So we won’t be able to write s-expression macros and remain self-hosted, unless we’re using a different s-expression programming language for the AST compiler stages. As you know, a self-hosted compiler provides that anyone who knows the programming language that the compiler compiles, also can read the source code of the compiler.

I’m trying to build a commercial tool. I better know what parser grammar the market wants. Otherwise, I shouldn’t be doing this job. I don’t say “we” because I think your focus is different than mine. I think I sort of know what you want based on our past interactions. You want a research project which will a foundational layer for programming language design experimentation. However, if I’m somehow missing some key advantages or if my understanding is myopic, I do hope to learn from you why we should do it the way you’re contemplating?

I need a commercial tool completed yesterday. I’m not wanting to dig into some “project for the ages.”

I know you have much more hands-on experience than I both building compilers and using many different programming languages. So I always value your knowledge and opinion. But I also have to balance this against my instincts, intuitions, and some modicum of observation of extant programming languages and their issues. I have observed as I read various posts by others on Quora and what not, that I seem to be attuned with what other programmers are thinking and wanting.

We need a very streamlined Go with typeclass polymorphism, better modules, and some functional programming features. And we need it pronto.

@keean
Copy link
Owner Author

keean commented Jun 25, 2018

@shelby3 software spends longer in the support and maintenance phase than it does in the writing stage. Separating the grammar allows unit testing on the parser (tests consist of input syntax and expected output s-expression) and other layers of the compiler. For something like a compiler where regression testing after bug-fixing is essential it's vital that we follow a Test-Driven-Design approach, and write the tests before the code we want to test.

@shelby3
Copy link

shelby3 commented Jun 30, 2018

@keean I fail to understand why regression tests must operate on an s-expression instead of an AST in our preferred programming language and data structures.

@keean
Copy link
Owner Author

keean commented Jun 30, 2018

@shelby3 well you have to write the tests somehow, and human readable is better for validation. Remember an s-expression is just a univeral way of writing a tree (abstract or otherwise), the 'function' the first word after an open parenthesis '(' is the name of the node, and all the subesquent words are its children. For example we can parse an expression like (3 * x) + (2 * y) as:

(sum (product (literal "3") (variable "x")) (product (literal "2") (variable "y")))

This is just a way of writing the abstract syntax tree that is human readable. Now it has some drawbacks, and the main one is that we only know the semantics of the arguments to the tree nodes by the position of their arguments, which is fine for things like literal "3" but if we start wanting to decorate the tree with other information it can get hard to tell what is what. Now I would say you are better off keeping the tree as simple as possible and adding derived information by adding new 'synthetic' nodes to the tree like this:

(has-type int32 (sum (product (literal "3") (variable "x")) (product (literal "2") (variable "y"))))

The alternative is to use something like XML or JSON to give each node named properties:

{type: int32, sum-left: {product-left: {literal: "3"}, product-right: {variable "x"}}, sum-right: {product-left: {literal: "2"}, {variable: "y"}}}

or

<expr type="int32"><sum><product><literal value="3"/><variable name="x"/></product><product><literal value="2"/><variable name="y"></product></sum></expr>

Whats interesting is that both the s-expression and XML have named objects, and both the JSON and the XML have named propeties, although XML is not regular in that named properties (attributes) do not contain nested XML, which actually has positional arguments like s-expressions as well. This annoys me, it's like there are so many formats, but none of them quite get it right.

So lets just try something like:

(expr type:int32 (sum (product (literal "3") (variable "x")) (product (literal "2") (variable "y"))))

So here we can imagine that a named argument like "type" could contain further nested s-expressions, and we chose to omit the argument name where it is obvious. We have named objects and named properties for decoration, and those properties can be full expressions so type itself could be a tree structure like:

(expr type:(list(pair int string)) ...)

Its probably also worth mentioning "i-expressions" at this point too, which are s-expressions with the parentheses replaced by indentation like this (and keeping the named arguments too):

expr
  type:int32
  sum
    product
      literal "3"
      variable "x"
    product
      literal "2"
      variable "y"

Which seems a nice universal way to write down data-structures, but sometimes using the parentheses on a single line can be clearer for short expressions. Overall allowing mix-and-match seems best:

expr
  type:int32
  sum
    product (literal "3") (variable "x")
    product (literal "2") (variable "y")

@shelby3
Copy link

shelby3 commented Jul 5, 2018

@keean we agreed in the Macros and DSLs thread to just use the language syntax as the textual serialization format.

I already have named and positional arguments for data in the Zer0 syntax.

I like the idea of being able to use indenting in-lieu of parentheses when the nesting is deep. I added that to the rough draft proposal for the Zer0 syntax as an optional function application syntax along with the parenthetical tuple.

Note the draft Zer0 grammar I proposed requires instead:

expr:
  type=int32
  sum:
    product:
      literal("3")
      variable("x")
    product:
      literal("2")
      variable("y")
expr:
  type=int32
  sum:
    product(literal("3"), variable("x"))
    product(literal("2"), variable("y"))

@sighoya
Copy link

sighoya commented Aug 11, 2018

I haven't read all comments, but which parser technology you prefer, LL or LR or some extension of it and why?

@keean
Copy link
Owner Author

keean commented Aug 11, 2018

I personally prefer parser combinators, and parsing and lexing together, otherwise literals like strings cause problems. As for LL or LR, with parser combinators you can be LL where possible, and then backtrack where necessary. You can do this with combinators that are LL but then having decision points that save the state for backtracking.

@sighoya
Copy link

sighoya commented Aug 12, 2018

@keean
ty, for answering. I think that LL parsing ("top down") is more natural and modular than LR. LR feels like a bit low level programming with gotos.

@keean
Copy link
Owner Author

keean commented Aug 12, 2018

@sighoya have a look at my parser combinator library for C++. This works about as well as I could get it in C++. The interesting thing is the state for backtracking, currently this just saves a copy of the whole state at a decision point, and backtracking happens by resetting the file offset and restoring the saved state, I think something better could be done with a read only state and 'diff' based updates that can be undone, a bit like how we can unpick state in a Prolog interpreter by rolling back unifications. Prolog makes writing compilers so easy. If the self interpreter was Lisp's special trick I think the self compiler us probably Prologs's party piece.

@shelby3
Copy link

shelby3 commented Aug 13, 2018

I am working on a proposed LL(k) grammar for the syntax of Zer0 and I hope to entirely avoid backtracing. Hopefully k limited to 2 tokens of look ahead. I will continue to update that linked post as I do more work on the grammar. I got sidetracked with the recent discussion about Subtyping and typeclasses. I need those design decisions so I know what to put in the grammar.

Notwithstanding whether we implement that grammar with parser combinators or table driven LL(k) algorithm, I think checking the grammar in a LL(k) parser generator such as SLKpg is important. This allows me (us) to experiment rapidly with grammar tweaks and to check that it’s sound. Note these comments from the ANTLR creators about the SLKpg parser generator. I suppose soon I will submit that grammar to a Zer0 Git repository so others can contribute. I wanted to get something fairly solid before doing that, since our design discussions are still in flux. I am thinking of using Gitlab instead of Github since Github has been bought by Microsoft.

My recollection from prior discussion is that @keean thinks parser combinators are better for more precise compile-time errors.

Note the SLKpg documentation says:

  1. Is a hand-written parser more powerful than one that is generated by a parser generator?

    Yes and No. It is generally much easier to code local parsing exceptions in recursive-descent. However, maintaining the parser is difficult and it is nearly impossible to make the parser correctly handle all error inputs. It should be noted that for modern programming languages, i.e. post C++, hand-written recursive descent may be a good choice. Its weakness is actually an advantage in this case. That is the language ambiguities that cause so much trouble for a parser generator are easily just ignored by the programmer. This works because the vast bulk of code to be compiled is correct, or nearly so with most types of errors being well known. If a particular parse derivation does not occur in practice, there is no need to account for it in the parser. This can cause incorrect programs to be silently accepted by the compiler. But good testing can usually handle much of this.

P.S. I hope others will help develop Zer0. I really don’t want to try to do it all by myself. I am trying to find someone to hire to work on it with us. If anyone knows anyone sufficiently qualified who wants to work on it and be compensated, please have them contact me via email: shelby@coolpage or they can add me on LinkedIn: https://www.linkedin.com/in/shelby-moore-iii-b31488b0

I do not know yet whether what I want to do for Zer0 can incorporate sufficient set of what @keean wants so that we work on the same code base. @keean had wanted to make the typeclasses sort of like a Prolog logic which compiler builders could build on top of. I think we discussed that up-thread. Also see me arguing against that on a specific issue recently in the Subtyping discussion. I am wanting to go more directly the specific goal that I want for Zer0 and not try to build the most generalized underlying programmable engine possible. Because my needs are for this to be production ready ASAP, not an ongoing experimental project. Yet I remain open-minded to whatever seems to make the most sense and is realistic within available resources and mutual goals.

@shelby3
Copy link

shelby3 commented Aug 21, 2018

Now that we’re contemplating targeting Go as the initial compiler target for Zer0 and since we won’t have a self-hosting compiler on the first working version of the compiler, what programming language do you guys think would be the best to implement the compiler? Which PL could we most readily complete the compiler?

Haskell? (OMG I suck at Haskell but I guess I could learn).
TypeScript?

@shelby3
Copy link

shelby3 commented Dec 9, 2018

@keean wrote in 2016 in response to my comment:

This is not my experience having worked on several reasonably complex parsers commercially including HTML5 and JavaScript parsers in the past. Parser generators never work because you end up having to hand modify the code they output, and the output code is normally of poor quality.

Could you give me an example of hand modification you would envision, if my responses below don’t ameliorate your point?

Because I’m contemplating that if we are the ones creating the parser generator, then we can modify it as necessary to accommodate whatever we need, including making sure the performance of the generated code is awesome.

LL(k) table driven parsing is more efficient than parser combinators or hand-coded recursive decent, because there’s no backtracking.

Writing your own recursive descent parser quickly becomes unmaintainable and very difficult to make changes to.

Agreed. I was not proposing to do that.

I was proposing to have the source code file parsed by the generated parser into the initial AST. Hand-coded compilation stages would operate on that automatically generated AST.

combinators resulting in shorter and more readable code, that you can actually make changes to without messing up the whole grammar (modularity is a very important feature, and yes, modularity does require backtracking. In my experience the loss in performance is minor, compared to the gain in the ability to change and add new parsers).

I presume what you mean is that a particular function can operate on a subset of the grammar?

Why wouldn’t that also be the case with a LL(k) grammar?

Backtracking is because you don’t have a LL(k) table of (up to k number of) lookahead tokens which informs the parser (in advance of executing processing for a branch) which branch of multiple choice productions of the grammar will be matched.

Again not my experience, once you introduce object hierarchies and inheritance into the AST it adds a large amount of boilerplate into the code. For example take a look at my code generator for JavaScript:

function gen_fn(ast) {
    return 'function ' + ast.fn + '(' + ast.args.join(',') + '){' + gen_blk(ast.body) + '}'
}

function gen_exp(ast) {
    if (ast.lit) {
        return ast.lit.toString()
    } else if (ast.var) {
        return ast.var
    } else if (ast.app) {
        return ast.app + '(' + ast.args.map(gen_exp).join(',') + ')'
    } else if (ast.fn !== undefined) {
        return gen_fn(ast)
    } else {
        return ''
    }
}

function gen_stmt(ast) {
    if (ast.fn) {
        return gen_fn(ast)
    } else if (ast.decl) {
        return 'var ' + ast.decl + '=' + gen_exp(ast.exp) + ';'
    } else if (ast.ass) {
        return ast.ass + '=' + gen_exp(ast.exp) + ';'
    } else if (ast.rtn) {
       return 'return ' + gen_exp(ast.rtn) + ';'
    } else if (ast.app) {
        return ast.app + '(' + ast.args.map(gen_exp).join(',') + ');'
    } else {
        var exp = gen_exp(ast)
        if (exp !== '') {
            exp += ';'
        }
        return exp
    }
}

function gen_blk(ast) {
    if (ast.blk) {
        return ast.blk.map(gen_stmt).join('')
    } else {
        return gen_stmt(ast)
    }
}

var exports = function generate(ast) {
    return gen_blk(ast)
}

Have you any idea of the amount of boilerplate the object-oriented approach would generate? And it gets worse when you want to do subtree-matching and tree-rewriting for the intermediate pass stages of the compiler. I have treated the JS objects as a tagged union, as we would write data Ast = Blk {...} | ... and we can then simply match on the tag.

Trying to grok your point. Seems you’re referring to the boilerplate necessary to visit each node (object) the DAG tree of the AST and do useful work on it such as transforming it or (apparently as shown in your example) serializing it to output code of the compiler.

So AFAICT it seems that for every action we want to apply to the AST, we define a type class for that action and write a type class instance for every data type (i.e. where data types are declared with class or data depending on the PL) of object in the AST. That type class walks all of its child objects, implicitly (by the implicit type class instance resolution mechanism) applying the specific type class instance necessary for each child object’s data type. This has nothing to do with the parser generator, since these type classes are hand-coded.

The main boilerplate to eliminate in this hand-coded code is to automate the case where the type class instance does nothing but walk its children, so the only thing that changes between type classes is the name of the type class. Is it possible to express this in Haskell? I think we can do this in Scala with a macro.

Seems very straightforward and elegant to me. What if anything am I missing from your point?


It’s interesting to compare how the above would be accomplished without type classes. In that case we would write functions that input each object type, and which call other such functions for their child objects. So superficially there seems to be no significant advantage for type classes.

But for example let’s consider the multiple choice productions of a grammar rule such as:

member:
   data
   class
   instance
   value
   variable

The possible productions of member forms a logical disjunction. Any of those productions (which would each be a data type output by the parser generator) which appear as identical productions elsewhere in the grammar would form non-disjoint disjunctions (i.e. different disjunctions would share some possible value(s) and thus logically be overlapping and not disjoint). If member is made a supertype of those said productions which appear identically elsewhere in the grammar, I posit the said supertype would destroy the disjoint invariant for nominal types and allow possibilities for vacuous reasoning about nominal types in the type parametrised case. If instead member is given the type of an anonymous, structural union Data | Class | Instance | Value | Variable then only the anonymous, structural unions don’t have the disjoint invariant, which is an intentional necessary feature of anonymous, structural unions.

So in the type class case the type class bounded function (i.e. a function with require clause) is invoked with object of type Member = Data | Class | Instance | Value | Variable. If the compiler is smart enough to automatically create the switch-case logic for each possible value in the union then it can automatically call the type class instance for each of those possible types in the union. This is necessarily dynamic dispatch and can’t be monomorphised by the compiler because the type of the value of Member isn’t known at compile-time. When a new production is added to member, the parser generator automatically generates the new data type corresponding to the new production and only a new type class instance must be hand-coded and then the compiler will automatically patch up all the uses everywhere in the code base.

Without type classes the switch-case logic must be hand-coded¹ unless there’s function overloading available. Unless function overloading is available, removing, changing, or adding a production requires hand-coding¹ to remove, change, or add respectively the function call to said logic. Additionally, some compilers may not even warn us if a case in that logic isn’t possible (if production removed or changed) or missing (if production was added). Perhaps this is the boilerplate @keean is referring to? Whereas, the type class case will always warn if missing an instance of a type class.

Note function overloading seems to accomplish much of what type classes do, except no concept of a default function and no associated output types (other than a return type of the overloaded function if overloading allows varying the return type). The §10.2.3 Ad-hoc polymorphism is recognizing the congruence to function overloads with dependent return types (which is analogous to a type class with output associated type): (t₁ → t₂) ⨅ (t₁' → t₂').

As I alluded to, the case for type classes becomes even more compelling if some actions on the AST only process some types and so many of the types will simply apply the action to any children of the said type (thus walking the AST). So if we have some way to make a compiler smarter about this case, then we can write one type class that does nothing but walk the AST. And then we can tell the compiler that when an instance for a type class of a specific action is not available, then this means use the said default type class instead which only walks, but apply the action type class to any children. And repeat that recursively. So in this way we would only have to write instances for types which do an action, unlike in the case without type classes, were we would have to hand-code all the functions for every type for every action even if only the default of walking children applies to a type for a particular action. Those who don’t understand type classes well may not be able to visualize what I’m describing but they will learn by example when I implement this.

Of course we may be able to accomplish analogous automation of coding and refactoring without type classes by employing macros but @keean already argued that macros are bad because they essentially fork the PL. Much better to have a consistent type class feature in the PL which all users of that PL already understand and so they don’t have to learn custom macros for each different code base.

¹ One could presumably write a macro in some PLs which would automate that logic, but @keean already argued that macros are bad because they essentially fork the PL.


A tangential issue is how will we simulate anonymous, structural unions in Scala 2 (until Scala 3 which has unions is ready to deploy ~2020).

The only way I can think that will really work correctly is to employ a sealed trait² (c.f. also) for every distinct anonymous, structural union type. And those anonymous, structural unions which have a subtyping relationship need to be in a trait hierarchy. So this means we have to do whole program analysis so that each type which is a member of any said unions inherit from every said trait which corresponds to said unions of which the type is a member. The programmer will have to do this manually if coding in Scala. A Zer0 transpiler could automate this.

Note employing such inheritance thus allows the aforementioned “vacuous reasoning” if the said traits are employed in some type parametrised logic. Normally we would want the compiler (e.g. Zer0) to exclude anonymous, structural unions from those cases that can cause “vacuous reasoning”. So if manually coding said trait simulation of unions, then programmer should avoid passing the said traits into type parameters which would cause “vacuous reasoning”. I think that could be accomplished by only applying type class bound operations on anonymous, structural unions (and thus their trait simulations), which effectively removes any overlapping non-disjoint portions of any union of said unions because the type class instances are resolved at compile-time and can’t be overlapping.³

² With some algorithmically generated type name such as all the possible member types concatenated with underscore separator character, e.g. sealed trait Data_Class_Instance_Value_Variable.

³ Haskell does have an optional extension for overlapping instances but presuming not enabling overlapping instances.

@sighoya
Copy link

sighoya commented Dec 9, 2018

@shelby3 wrote:

Backtracking is because you don’t have a LL(k) table of (up to k number of) lookahead tokens which informs the parser (in advance of executing processing for a branch) which branch of multiple choice productions of the grammar will be matched.

Can't we lookahead in parser combinators? For the rest of non LL(k) grammar, why not spawning multiple threads parsing different rules and write their result to some shared state, but the clue is that only one will write the result as the others will kill themselves because of no production satisfaction, implying no synchronization.

@keean wrote:

Have you any idea of the amount of boilerplate the object-oriented approach would generate? And it gets worse when you want to do subtree-matching and tree-rewriting for the intermediate pass stages of the compiler. I have treated the JS objects as a tagged union, as we would write data Ast = Blk {...} | ... and we can then simply match on the tag.

I don't see the reason why inheritance causes this bloat. What you may complaining about is the use of the visitor anti-pattern, a fact even accepted by a fair amount of OOP lovers.

@shelby3 wrote:

This is necessarily dynamic dispatch and can’t be monomorphised by the compiler because the type of the value of Member isn’t known at compile-time

Dynamic dispatch, yes but the functions are monomorphized anyways, only an initial branching is required to choose the appropriate monomorphized function instead to call one function accepting a box of possible types.

Without type classes the switch-case logic must be hand-coded¹.

Without ad hoc polymorphism, method overloading would also suffice. And macros are indeed a bad deal to handle ad hoc dispatching.

The only way I can think that will really work correctly is to employ a sealed trait² (c.f. also) for every distinct anonymous, structural union type.

Or inheritance in general, creating wrapper types all inheriting from the same supertype encoding the specified union type, each wrapper type wraps all the values of exactly one type in the union. Then you would wrap and unwrap values and do some ugly casts.

@shelby3
Copy link

shelby3 commented Dec 9, 2018

@sighoya wrote:

Can't we lookahead in parser combinators?

Apparently not holistically to avoid all cases of backtracking, because that requires a holistic analysis of the grammar. You could for example familiarize yourself (if not already familiar) with the concepts of first and follow sets which is a global table for the entire LL(k) grammar.

Dynamic dispatch, yes but the functions are monomorphized anyways, only an initial branching is required to choose the appropriate monomorphized function instead to call one function accepting a box of possible types.

Agreed we’re not referring to the boxed Member throughout the type class instance functions, but rather only the dispatch is dynamic, i.e. choosing a type class instance dynamically that applies to the type of the value of the Member union type.

Without ad hoc polymorphism, method overloading would also suffice.

This entails adding virtual methods to a base class (of all types in the AST) for each action instead of declaring a new type class for each action we want to perform on all types in the AST. Note that cases where the type is known at compile-time will not monomorphise the dispatch as they would be with type classes because virtual methods are always dynamic dispatch. Statically resolved function overloading would be monomorphised dispatch, not require adding methods to a base case , and avoid the subtyping issue below.

But that doesn’t remove the boilerplate of needing to implement that virtual method for every type even if all that type does for a particular action is walk its children and delegate the action to its children. Whereas, I proposed an extension for type classes (not currently available in Haskell apparently) which I posit would automate that boilerplate.

The problem with method overloading is it requires subtyping and in my prior post I have already explained (and linked to detailed explanations) that this subtyping causes unsoundness in the type system. I posited that only subtyping of the anonymous, structural unions can possibly be sound by restricting them only to type class bounded operations. All System F and programming languages with subtyping (along with intersections and/or generics) are apparently unsound! I think that is a profound realization. And basically vindicates @keean’s dislike of the type systems in Scala and Java.

Then you would wrap and unwrap values and do some ugly casts.

Scala has Either but that is (as you admit) a mess of boilerplate compared to my trait idea.

@keean
Copy link
Owner Author

keean commented Dec 11, 2018

@shelby3 @sighoya There are many comments I could make, but being pushed for time, I will go for the one that stands out the most.

Of course overloading (providing it is multiple dispatch, so all functions arguments take place in specialisation) can do what type-classes can do, type-classes are constrained (controlled) overloading after all. The question is more the other way around, can you do everything useful with type-classes that you can do with overloads? If so then type-classes have the advantage of having a static (polymorphic) type known at compile time for all overloads. As such they are a specification for all future overloads that can be defined. Ad-hoc overloading is fine for uni-typed languages, but in typed languages a future overload could break existing code (where there is dynamic dispatch).

Further funding the least general supertype is not solvable for all cases (not decidable) and is not principal (there is not always one clear least general supertype).

Type-classes provide generalisation of types over overloads.

@shelby3
Copy link

shelby3 commented Dec 16, 2018

Let me give my interpretation of @keean’s latest post above.

AFAICT, he is essentially reiterating the point I was making¹ over at the Pony developer mailing list, that typeclasses can substitute for the need to have intersection types in the type system. Because instead of modeling the overloaded function as an intersection of function types in the type system (for each specific permutation of concrete types for the overloaded input arguments), we instead model the function as polymorphic (i.e. generic type parameters) on those overload input arguments. The overloaded implementations are created by the compiler by finding instances of the required typeclass interfaces, but the type system is not involved in this unless the required typeclasses have associated types. Yet as I had explained, even associated types don’t introduce the need for intersection types nor subtyping.

So typeclasses enable controlling of overloading of function definition’s input argument types orthogonal to the type system. The function definition is fully polymorphic so it will have the same polymorphic type in the future. Future not yet contemplated overloads will be added orthogonal to the type of the function definition, at the function call site. Whereas, with concrete (i.e. not polymorphic) ad hoc overloading, the function definition’s type (which is an intersection type) changes when adding new overloads. This is why I suppose typeclasses are referred to as ad-hoc polymorphism, as a distinct terminology to ad hoc overloading.

@keean is pointing out that intersection types inherently introduce subtyping and given a union or intersection of types then finding the greatest-upper bound (GUB, aka least general supertype) makes principal type inference undecidable. That is why MLsub put all unions and intersections in the ground types, so that GUB is always the intersection or union itself.

Note however as I cited for the MLsub paper, when the return type of a function depends on input types via associated types, then the principal types are intersections which is a contradiction of @keean’s point.

AFAICT, what is really going on is that Haskell has no other subtyping other than what would be introduced by the inherent intersection types created by associated types. Haskell is able to attain global principal type inference by placing restrictions on the rest of the type system (i.e. no other subtyping in the type system). I pointed out at the Pony developer mailing list how subtyping of traits and interfaces is ostensibly making the type system unwieldy and which could be discarded in favor of typeclasses as a suitable (and IMO superior) replacement.

So I really don’t find @keean’s type system related point to be entirely consistent and compelling (the devil is in the details and function overloading could also form principal types if the rest of the type system is limited as necessary for Haskell). Rather I think my point was the salient one, in that typeclasses provide a more structured form of overloading and thus can enable us to better automate boilerplate.

It was very difficult for me to learn the generative essence point of typeclasses, because everyone seems to mislead the reader, as for example how @keean is doing here. The truth is actually what I wrote. But no where will anyone tell you this!

¹ Although my point was about intersection types in general for forming subtypes from intersections of traits and interfaces. @keean is focused on intersections of function types, which as I explain AFAICT we still need with typeclasses if we have associated types. My point over at the Pony developer mailing list was to simplify (and remove a source of unsoundness from) the type system by removing intersection types for traits and interfaces, because we can attain analogous functionality with typeclasses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants