Skip to content

Polyglot Language Understanding

prabhu edited this page Jan 3, 2024 · 32 revisions

This is a survey of projects/research that try to understand multiple programming languages in a "unified" way.

There are many lexing / syntax-highlighting-only projects toward the end of the page. The more interesting ones attempt something closer to parsing, and even semantic analysis.

But the simpler projects are naturally the most comprehensive in terms of the number of languages supported. They're valuable "corpuses" of language info.

This page is editable -- feel free to add other projects, with links, a description, and why they're interesting.


I made a rough categorization by light vs. heavy. It refers to how much code is shared between language "back ends". If no code is shared, it's "heavy".

That is, you could "simply" import entire compiler front ends and output protobufs, which is what Google Kythe did I believe. That would be heavy. Or you could rewrite lightweight lexers/parsers for every language in your own DSL.

(Note: light is not necessarily better than heavy!)


Note that finding patterns for syntax highlighting kind of "bleeds in" to the problem of finding patterns that indicate bugs and security issues.

Lightweight Implementations

Concept: Island Grammars. An island grammar only precisely defines small portions of the syntax of a language. The rest of the syntax is defined imprecisely, for instance as a list of characters, or a list of tokens.

Heavyweight Implementations

  • semgrep / coccinelle (OCaml)

    • Semgrep: a static analysis journey (2021) - How an academic project for the Linux kernel evolved into a multilingual security tool
    • INRIA -> Facebook -> r2c
    • facebook/pfff repo (OCaml) style issues and potential bugs.*
  • https://github.com/github/semantic -- appears inactive

    • Haskell
  • Google Kythe - open source version of code search project started by Steve Yegge

  • ROSE

    • Developed at Lawrence Livermore National Laboratory (LLNL), ROSE is an open source compiler infrastructure to build source-to-source program transformation and analysis tools for large-scale C (C89 and C98), C++ (C++98 and C++11), UPC, Fortran (77, 95, 2003), OpenMP, Java, Python, PHP, and Binary applications.
    • ROSE is particularly well suited for building custom tools for static analysis, program optimization, arbitrary program transformation, domain-specific optimizations, complex loop optimizations, performance analysis, and cyber-security
    • Written in C++ - https://github.com/rose-compiler/rose/tree/weekly/src/AstNodes/Expression
  • Doxygen

    • [Doxygen] automates the generation of documentation from source code comments, parsing information about classes, functions, and variables to produce output in formats like HTML and PDF
    • Doxygen provides robust support for documenting C++ code, recognizing the intricacies of the language and generating comprehensive documentation.
    • Next to C++, Doxygen also supports C, Python, PHP, Java, C#, Objective-C, Fortran, VHDL, Splice, IDL, and Lex.
  • Atom

    • [Atom] is a novel intermediate representation and a cli tool for parsing and slicing codebases in multiple programming languages
    • Generate usages, data flows, and reachable flow slices for codebases in json format
    • Export the various representations including data flows to graphml and dot format for advanced visualization and analysis
    • Written in Scala and distributed as a container image and npm package.

Polyglot Interfaces

Syntax Highlighting

Other Surveys

Clone this wiki locally