Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example: Dart lexer. #38

Closed
modulovalue opened this issue Nov 15, 2023 · 1 comment
Closed

Example: Dart lexer. #38

modulovalue opened this issue Nov 15, 2023 · 1 comment

Comments

@modulovalue
Copy link

I have a rough VPL specification of the lexical structure of Dart, which I think is a VPL. It looks like it just needs a little massaging to make it a great real world example grammar for owl.

One of my parsing systems attempts to simulate VPAs by having a set of regular automata, and a stack to track which regular automaton is active. Push and pop actions associated with certain literals switch between regular automata. The specification is designed around that idea.

Here's a snippet from my declarative specification for the lexical structure of Dart written in Dart (it's a little verbose, I apologize)
    final key_blockcomment = LexiKey();
    final key_mldq = LexiKey();
    final key_mlsq = LexiKey();
    final key_sldq = LexiKey();
    final key_slsq = LexiKey();
    final key_rmldq = LexiKey();
    final key_rmlsq = LexiKey();
    final key_rsldq = LexiKey();
    final key_rslsq = LexiKey();
    final key_base = LexiKey();
    return LexiLexerInfoImpl(
      root: inner(
        name: "top",
        atoms: [],
        children: () {
          final lang_base = leaf(
            name: "base",
            atoms: [
              apgm(v: "'", n: "startslsq", k: key_slsq),
              apgm(v: '"', n: "startsldq", k: key_sldq),
              argm(r: r"'''((\\? )*\\?(\n|\r|\r\n))?", n: "startmlsq", k: key_mlsq),
              argm(r: r'"""((\\? )*\\?(\n|\r|\r\n))?', n: "startmldq", k: key_mldq),
              apgm(v: "r'", n: "startrslsq", k: key_rslsq),
              apgm(v: 'r"', n: "startrsldq", k: key_rsldq),
              argm(r: r"r'''((\\? )*\\?(\n|\r|\r\n))?", n: "startrmlsq", k: key_rmlsq),
              argm(r: r'r"""((\\? )*\\?(\n|\r|\r\n))?', n: "startrmldq", k: key_rmldq),
              arnm(r: r'[0-9]+([eE][+\-]?[0-9]+)?', n: "integer"),
              arnm(r: r'[0-9]*\.[0-9]+([eE][+\-]?[0-9]+)?', n: "decimal"),
              arnm(r: r'0[xX][0-9ABCDEFabcdef]+', n: "hex"),
              apgm(v: r'{', k: key_base, n: "lcur"),
              aplm(v: r'}', n: "rcur"),
            ],
            handle: key_base,
          );
          return [
            () {
              final base_identifier = arnm(r: r'[_$a-zA-Z][_$0-9a-zA-Z]*', n: "identifier");
              final base_kabstract = apnm(v: r'abstract', n: "kabstract");
              final base_kas = apnm(v: r'as', n: "kas");
              final base_kassert = apnm(v: r'assert', n: "kassert");
              final base_kasync = apnm(v: r'async', n: "kasync");
              final base_kawait = apnm(v: r'await', n: "kawait");
              final base_ksealed = apnm(v: r'sealed', n: "ksealed");
              final base_kbase = apnm(v: r'base', n: "kbase");
              final base_kwhen = apnm(v: r'when', n: "kwhen");
              final base_kbreak = apnm(v: r'break', n: "kbreak");
              final base_kcase = apnm(v: r'case', n: "kcase");
              final base_kcatch = apnm(v: r'catch', n: "kcatch");
              final base_kclass = apnm(v: r'class', n: "kclass");
              final base_kconst = apnm(v: r'const', n: "kconst");
              final base_kcontinue = apnm(v: r'continue', n: "kcontinue");
              final base_kcovariant = apnm(v: r'covariant', n: "kcovariant");
              final base_kdefault = apnm(v: r'default', n: "kdefault");
              final base_kdeferred = apnm(v: r'deferred', n: "kdeferred");
              final base_kdo = apnm(v: r'do', n: "kdo");
              final base_kdynamic = apnm(v: r'dynamic', n: "kdynamic");
              final base_kelse = apnm(v: r'else', n: "kelse");
              final base_kenum = apnm(v: r'enum', n: "kenum");
              final base_kexport = apnm(v: r'export', n: "kexport");
              final base_kextends = apnm(v: r'extends', n: "kextends");
              final base_kextension = apnm(v: r'extension', n: "kextension");
              final base_kexternal = apnm(v: r'external', n: "kexternal");
              final base_kfactory = apnm(v: r'factory', n: "kfactory");
              final base_kfalse = apnm(v: r'false', n: "kfalse");
              final base_kfinal = apnm(v: r'final', n: "kfinal");
              final base_kfinally = apnm(v: r'finally', n: "kfinally");
              final base_kfor = apnm(v: r'for', n: "kfor");
              final base_kfunction = apnm(v: r'Function', n: "kfunction");
              final base_kget = apnm(v: r'get', n: "kget");
              final base_khide = apnm(v: r'hide', n: "khide");
              final base_kif = apnm(v: r'if', n: "kif");
              final base_kimplements = apnm(v: r'implements', n: "kimplements");
              final base_kimport = apnm(v: r'import', n: "kimport");
              final base_kin = apnm(v: r'in', n: "kin");
              final base_kinterface = apnm(v: r'interface', n: "kinterface");
              final base_kis = apnm(v: r'is', n: "kis");
              final base_klate = apnm(v: r'late', n: "klate");
              final base_klibrary = apnm(v: r'library', n: "klibrary");
              final base_kmixin = apnm(v: r'mixin', n: "kmixin");
              final base_knew = apnm(v: r'new', n: "knew");
              final base_knull = apnm(v: r'null', n: "knull");
              final base_kof = apnm(v: r'of', n: "kof");
              final base_kon = apnm(v: r'on', n: "kon");
              final base_koperator = apnm(v: r'operator', n: "koperator");
              final base_kpart = apnm(v: r'part', n: "kpart");
              final base_krequired = apnm(v: r'required', n: "krequired");
              final base_krethrow = apnm(v: r'rethrow', n: "krethrow");
              final base_kreturn = apnm(v: r'return', n: "kreturn");
              final base_kset = apnm(v: r'set', n: "kset");
              final base_kshow = apnm(v: r'show', n: "kshow");
              final base_kstatic = apnm(v: r'static', n: "kstatic");
              final base_ksuper = apnm(v: r'super', n: "ksuper");
              final base_kswitch = apnm(v: r'switch', n: "kswitch");
              final base_ksync = apnm(v: r'sync', n: "ksync");
              final base_kthis = apnm(v: r'this', n: "kthis");
              final base_kthrow = apnm(v: r'throw', n: "kthrow");
              final base_ktrue = apnm(v: r'true', n: "ktrue");
              final base_ktry = apnm(v: r'try', n: "ktry");
              final base_ktypedef = apnm(v: r'typedef', n: "ktypedef");
              final base_kvar = apnm(v: r'var', n: "kvar");
              final base_kvoid = apnm(v: r'void', n: "kvoid");
              final base_kwhile = apnm(v: r'while', n: "kwhile");
              final base_kwith = apnm(v: r'with', n: "kwith");
              final base_kyield = apnm(v: r'yield', n: "kyield");
              return inner(
                name: "kw",
                atoms: [
                  base_identifier,
                  // region keywords
                  base_kabstract,
                  base_kas,
                  base_kassert,
                  base_kasync,
                  base_kawait,
                  base_ksealed,
                  base_kbase,
                  base_kwhen,
                  base_kbreak,
                  base_kcase,
                  base_kcatch,
                  base_kclass,
                  base_kconst,
                  base_kcontinue,
                  base_kcovariant,
                  base_kdefault,
                  base_kdeferred,
                  base_kdo,
                  base_kdynamic,
                  base_kelse,
                  base_kenum,
                  base_kexport,
                  base_kextends,
                  base_kextension,
                  base_kexternal,
                  base_kfactory,
                  base_kfalse,
                  base_kfinal,
                  base_kfinally,
                  base_kfor,
                  base_kfunction,
                  base_kget,
                  base_khide,
                  base_kif,
                  base_kimplements,
                  base_kimport,
                  base_kin,
                  base_kinterface,
                  base_kis,
                  base_klate,
                  base_klibrary,
                  base_kmixin,
                  base_knew,
                  base_knull,
                  base_kof,
                  base_kon,
                  base_koperator,
                  base_kpart,
                  base_krequired,
                  base_krethrow,
                  base_kreturn,
                  base_kset,
                  base_kshow,
                  base_kstatic,
                  base_ksuper,
                  base_kswitch,
                  base_ksync,
                  base_kthis,
                  base_kthrow,
                  base_ktrue,
                  base_ktry,
                  base_ktypedef,
                  base_kvar,
                  base_kvoid,
                  base_kwhile,
                  base_kwith,
                  base_kyield,
                  // endregion
                ],
                children: [
                  lang_base,
                ],
                r: LexiResolutionHandlerExplicitImpl(
                  resolvers: [
                    LexiResolverImpl(match: [base_identifier, base_kabstract], prefer: base_kabstract),
                    LexiResolverImpl(match: [base_identifier, base_kas], prefer: base_kas),
                    LexiResolverImpl(match: [base_identifier, base_kwhen], prefer: base_kwhen),
                    LexiResolverImpl(match: [base_identifier, base_ksealed], prefer: base_ksealed),
                    LexiResolverImpl(match: [base_identifier, base_kbase], prefer: base_kbase),
                    LexiResolverImpl(match: [base_identifier, base_kassert], prefer: base_kassert),
                    LexiResolverImpl(match: [base_identifier, base_kasync], prefer: base_kasync),
                    LexiResolverImpl(match: [base_identifier, base_kawait], prefer: base_kawait),
                    LexiResolverImpl(match: [base_identifier, base_kbreak], prefer: base_kbreak),
                    LexiResolverImpl(match: [base_identifier, base_kcase], prefer: base_kcase),
                    LexiResolverImpl(match: [base_identifier, base_kcatch], prefer: base_kcatch),
                    LexiResolverImpl(match: [base_identifier, base_kclass], prefer: base_kclass),
                    LexiResolverImpl(match: [base_identifier, base_kconst], prefer: base_kconst),
                    LexiResolverImpl(match: [base_identifier, base_kcontinue], prefer: base_kcontinue),
                    LexiResolverImpl(match: [base_identifier, base_kcovariant], prefer: base_kcovariant),
                    LexiResolverImpl(match: [base_identifier, base_kdefault], prefer: base_kdefault),
                    LexiResolverImpl(match: [base_identifier, base_kdeferred], prefer: base_kdeferred),
                    LexiResolverImpl(match: [base_identifier, base_kdo], prefer: base_kdo),
                    LexiResolverImpl(match: [base_identifier, base_kdynamic], prefer: base_kdynamic),
                    LexiResolverImpl(match: [base_identifier, base_kelse], prefer: base_kelse),
                    LexiResolverImpl(match: [base_identifier, base_kenum], prefer: base_kenum),
                    LexiResolverImpl(match: [base_identifier, base_kexport], prefer: base_kexport),
                    LexiResolverImpl(match: [base_identifier, base_kextends], prefer: base_kextends),
                    LexiResolverImpl(match: [base_identifier, base_kextension], prefer: base_kextension),
                    LexiResolverImpl(match: [base_identifier, base_kexternal], prefer: base_kexternal),
                    LexiResolverImpl(match: [base_identifier, base_kfactory], prefer: base_kfactory),
                    LexiResolverImpl(match: [base_identifier, base_kfalse], prefer: base_kfalse),
                    LexiResolverImpl(match: [base_identifier, base_kfinal], prefer: base_kfinal),
                    LexiResolverImpl(match: [base_identifier, base_kfinally], prefer: base_kfinally),
                    LexiResolverImpl(match: [base_identifier, base_kfor], prefer: base_kfor),
                    LexiResolverImpl(match: [base_identifier, base_kfunction], prefer: base_kfunction),
                    LexiResolverImpl(match: [base_identifier, base_kget], prefer: base_kget),
                    LexiResolverImpl(match: [base_identifier, base_khide], prefer: base_khide),
                    LexiResolverImpl(match: [base_identifier, base_kif], prefer: base_kif),
                    LexiResolverImpl(match: [base_identifier, base_kimplements], prefer: base_kimplements),
                    LexiResolverImpl(match: [base_identifier, base_kimport], prefer: base_kimport),
                    LexiResolverImpl(match: [base_identifier, base_kin], prefer: base_kin),
                    LexiResolverImpl(match: [base_identifier, base_kinterface], prefer: base_kinterface),
                    LexiResolverImpl(match: [base_identifier, base_kis], prefer: base_kis),
                    LexiResolverImpl(match: [base_identifier, base_klate], prefer: base_klate),
                    LexiResolverImpl(match: [base_identifier, base_klibrary], prefer: base_klibrary),
                    LexiResolverImpl(match: [base_identifier, base_kmixin], prefer: base_kmixin),
                    LexiResolverImpl(match: [base_identifier, base_knew], prefer: base_knew),
                    LexiResolverImpl(match: [base_identifier, base_knull], prefer: base_knull),
                    LexiResolverImpl(match: [base_identifier, base_kof], prefer: base_kof),
                    LexiResolverImpl(match: [base_identifier, base_kon], prefer: base_kon),
                    LexiResolverImpl(match: [base_identifier, base_koperator], prefer: base_koperator),
                    LexiResolverImpl(match: [base_identifier, base_kpart], prefer: base_kpart),
                    LexiResolverImpl(match: [base_identifier, base_krequired], prefer: base_krequired),
                    LexiResolverImpl(match: [base_identifier, base_krethrow], prefer: base_krethrow),
                    LexiResolverImpl(match: [base_identifier, base_kreturn], prefer: base_kreturn),
                    LexiResolverImpl(match: [base_identifier, base_kset], prefer: base_kset),
                    LexiResolverImpl(match: [base_identifier, base_kshow], prefer: base_kshow),
                    LexiResolverImpl(match: [base_identifier, base_kstatic], prefer: base_kstatic),
                    LexiResolverImpl(match: [base_identifier, base_ksuper], prefer: base_ksuper),
                    LexiResolverImpl(match: [base_identifier, base_kswitch], prefer: base_kswitch),
                    LexiResolverImpl(match: [base_identifier, base_ksync], prefer: base_ksync),
                    LexiResolverImpl(match: [base_identifier, base_kthis], prefer: base_kthis),
                    LexiResolverImpl(match: [base_identifier, base_kthrow], prefer: base_kthrow),
                    LexiResolverImpl(match: [base_identifier, base_ktrue], prefer: base_ktrue),
                    LexiResolverImpl(match: [base_identifier, base_ktry], prefer: base_ktry),
                    LexiResolverImpl(match: [base_identifier, base_ktypedef], prefer: base_ktypedef),
                    LexiResolverImpl(match: [base_identifier, base_kvar], prefer: base_kvar),
                    LexiResolverImpl(match: [base_identifier, base_kvoid], prefer: base_kvoid),
                    LexiResolverImpl(match: [base_identifier, base_kwhile], prefer: base_kwhile),
                    LexiResolverImpl(match: [base_identifier, base_kwith], prefer: base_kwith),
                    LexiResolverImpl(match: [base_identifier, base_kyield], prefer: base_kyield),
                  ],
                ),
              );
            }(),
            lang_base,
            inner(
              name: "sy",
              atoms: [
                apnm(v: r'[', n: "slbra"),
                apnm(v: r']', n: "srbra"),
                apnm(v: r'(', n: "slpar"),
                apnm(v: r')', n: "srpar"),
                apnm(v: r',', n: "scomma"),
                apnm(v: r':', n: "scolon"),
                apnm(v: r';', n: "ssemicolon"),
                apnm(v: r'&', n: "sand"),
                apnm(v: r'&=', n: "sandeq"),
                apnm(v: r'^', n: "scaret"),
                apnm(v: r'^=', n: "scareteq"),
                apnm(v: r'>', n: "sg"),
                apnm(v: r'<', n: "sl"),
                apnm(v: r'%', n: "spercent"),
                apnm(v: r'%=', n: "spercenteq"),
                apnm(v: r'+', n: "splus"),
                apnm(v: r'+=', n: "spluseq"),
                apnm(v: r'|', n: "spipe"),
                apnm(v: r'|=', n: "spipeeq"),
                apnm(v: r'??', n: "sqq"),
                apnm(v: r'??=', n: "sqqeq"),
                apnm(v: r'~/', n: "stildeslash"),
                apnm(v: r'~/=', n: "stildeslasheq"),
                apnm(v: r'*', n: "sstar"),
                apnm(v: r'*=', n: "sstareq"),
                apnm(v: r'/', n: "sslash"),
                apnm(v: r'/=', n: "sslasheq"),
                apnm(v: r'-', n: "sminus"),
                apnm(v: r'-=', n: "sminuseq"),
                apnm(v: r'=', n: "seq"),
                apnm(v: r'==', n: "seqeq"),
                apnm(v: r'&&', n: "sandand"),
                apnm(v: r'@', n: "sat"),
                apnm(v: r'.', n: "sdot"),
                apnm(v: r'..', n: "sdotdot"),
                apnm(v: r'...', n: "sdotdotdot"),
                apnm(v: r'...?', n: "sdotdotdotq"),
                apnm(v: r'--', n: "sminusminus"),
                apnm(v: r'!', n: "sbang"),
                apnm(v: r'!=', n: "sbangeq"),
                apnm(v: r'++', n: "splusplus"),
                apnm(v: r'#', n: "shash"),
                apnm(v: r'||', n: "spipepipe"),
                apnm(v: r'?', n: "squestion"),
                apnm(v: r'?.', n: "squestiondot"),
                apnm(v: r'?..', n: "squestiondotdot"),
                apnm(v: r'~', n: "stilde"),
              ],
              children: [
                lang_base,
              ],
            ),
            leaf(
              name: "blockcomment",
              handle: key_blockcomment,
              atoms: [
                argm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*/*/\*", k: key_blockcomment, n: "bcstart"),
                arlm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*\**\*/", n: "bcend"),
              ],
            ),
            inner(
              name: "t",
              atoms: [
                arnm(r: r'( |\t|\n|\r)+', n: "ws"),
                arnm(r: r'#![^\n\r]*(\r|\n|\r\n)?', n: "scripttag"),
                // TODO • can't parse '/**' because this will mess up parsing '/**/' I think I'll need a new language for the body of bcstart if it starts with *.
                apgm(v: r'/*', k: key_blockcomment, n: "bcinit"),
                // TODO • have a separate language for line comments to parse '///' correctly.
                arnm(r: r'//([^\n\r]+)?(\r|\n|\r\n)?', n: "singlelinenondoccomment"),
              ],
              children: [
                lang_base,
              ],
              r: const LexiResolutionHandlerExplicitImpl(
                resolvers: [],
              ),
            ),
            inner(
              name: "bsc",
              atoms: [
                arnm(r: r'\\n', n: "escaped_lf"),
                arnm(r: r'\\r', n: "escaped_cr"),
                arnm(r: r'\\b', n: "escaped_b"),
                arnm(r: r'\\t', n: "escaped_ht"),
                arnm(r: r'\\v', n: "escaped_vt"),
                arnm(r: r'\\x[0-9a-fA-F][0-9a-fA-F]', n: "escaped_byte"),
                arnm(r: r'\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]', n: "escaped_unicodesimple"),
                arnm(r: r'\\u{[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?}', n: "escaped_unicoderaw"),
                arnm(r: r'\\[^bnrtuvx]', n: "escaped_unescaped"),
                arnm(r: r'$[_a-zA-Z][_0-9a-zA-Z]*', n: "simpleinterpolation"),
                argm(r: r'${', k: key_base, n: "advancedinterpolation"),
              ],
              children: [
                leaf(
                  name: "slsq",
                  atoms: [
                    aplm(v: "'", n: "slsqpop"),
                    arnm(r: r"[^'$\\\n\r]+", n: "slsqcontent"),
                  ],
                  handle: key_slsq,
                ),
                leaf(
                  name: "sldq",
                  atoms: [
                    aplm(v: '"', n: "sldqpop"),
                    arnm(r: r'[^"$\\\n\r]+', n: "sldqcontent"),
                  ],
                  handle: key_sldq,
                ),
                leaf(
                  name: "mlsq",
                  atoms: [
                    aplm(v: "'''", n: "mlsqpop"),
                    arnm(r: r"[^'$\\]+", n: "mlsqcontent"),
                    arnm(r: r"('|'')", n: "mlsqq"),
                  ],
                  handle: key_mlsq,
                ),
                leaf(
                  name: "mldq",
                  atoms: [
                    aplm(v: '"""', n: "mldqpop"),
                    arnm(r: r'[^"$\\]+', n: "mldqcontent"),
                    arnm(r: r'("|"")', n: "mldqq"),
                  ],
                  handle: key_mldq,
                ),
              ],
            ),
            leaf(
              name: "rslsq",
              atoms: [
                aplm(v: "'", n: "rslsqend"),
                arnm(r: r"[^\n\r']+", n: "rslsqcontent"),
              ],
              handle: key_rslsq,
            ),
            leaf(
              name: "rsldq",
              atoms: [
                aplm(v: '"', n: "rsldqend"),
                arnm(r: r'[^\n\r"]+', n: "rsldqcontent"),
              ],
              handle: key_rsldq,
            ),
            leaf(
              name: "rmlsq",
              atoms: [
                aplm(v: r"'''", n: "rmlsqend"),
                arnm(r: r"(('|'')?[^'])+", n: "rmlsqcontent"),
              ],
              handle: key_rmlsq,
            ),
            leaf(
              name: "rmldq",
              atoms: [
                aplm(v: '"""', n: "rmldqend"),
                arnm(r: r'(("|"")?[^"])+', n: "rmldqcontent"),
              ],
              handle: key_rmldq,
            ),
          ];
        }(),
      ),
      start: key_base,
    );

Here's a visual representation that I use to visualize that specification:
dot.pdf

Here's the ANTLR specification of Dart.

ANTLR uses order to disambiguate between identifiers and keywords. The way I handle the specification, only keywords and identifiers need to be disambiguated in favor of the keyword.

I plan to translate that specification to owl just to see how it works out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant