Skip to content

Compiler Engineer Notes

andychu edited this page Mar 31, 2022 · 2 revisions

Old Notes for Compiler Engineer Job

More Background

  • Oil is written as a high level "executable spec" in a subset of typed Python. This is ~40K lines of code painstakingly hand-crafted over 5 years! It recapitulates 50 years of shell history.
  • We have several code generators that translate this spec into ~90K lines of C++.
  • The primary one is called "mycpp", which reuses the MyPy front end. (A secondary one is Zephyr ASDL for algebraic data types, and another important one is re2c.)
  • There is a small C++ runtime implementing garbage collected, statically typed data structures like List<T>, Dict<K, V>, and Tuple2<A, B>. This is a straightforward translation of MyPy's type system to C++'s (they match quite closely!)
  • This translation passes more than half the OSH spec tests! 1115 out of 1900 or so.

However, mycpp needs a rewrite. It's using the internals of MyPy in a hacky way.

And I don't like the "inverted" visitor style. The current idea is to rewrite the type checker with Python's 3.10 match statement (only released in October 2021!). We can reuse the ast library which in recent versions of Python supports type comments.

However, most of the time I spent on this issue was in the relatively small C++ runtime. The garbage collector is a Cheney semi-space collector, and it took me a long time to debug! It's still not done.

High Level Idea / Work Estimates

You should look at this problem and think I can do this whole thing ! (I believe there are many compiler engineers out there with this skillset.)

Again, we want to translate 40K lines of typed Python to a similar amount of C++. And the result has to be debugged and work!

  • The translator should be around 3K lines of code for a type checker, and maybe 3K lines for a code generator. You get the parser for free with Python.
  • The C++ runtime is probably 3-5K lines for the data structures and garbage collector.
  • The C++ syscall bindings might be another 3K lines. This is so you can translate code like os.execvpe() to C++.

So the whole job should be approximately 10K to 15K lines of code?

However note that we already have a working prototype, which was started in 2019. It passes over half the tests (though the garbage collector not turned on; it only works on small examples). So I consider this project "low risk" in some sense.

Compensation / Starting Date

  • TBD !! This is a draft.
  • To be transparent, we will try to get funding from a variety of sources
    • Probably apply for NLNet sponsorship (an organization sponsoring a lot of Nix projects, I noticed)
    • Github Sponsors (Zig is using this)
    • Solicit donations on the blog (which is popular e.g. on Hacker News)
  • I want to prioritize the right person over the funding / compensation. If the right person shows up, and has a somewhat flexible starting date, I think we can make it happen. (I can even front a salary myself, but I would prefer not to.)
  • We should accept "bids" from multiple candidates. But again we are not just choosing the lowest price! We want the best quality result, at any price.
  • The candidate should provide time estimates of the work above. It's like a "contracting" position.
  • I think that the person should be paid every 2 weeks, similar to a US Job.

Location: can be anywhere in the world. You will probably be video conferencing with me in the US eastern time zone.

Other details

  • All the code will be open source under the existing license (Apache 2)

Code Requirements

  • Output should be human readable, and run in a debugger (already done in oil-native, must not regress)
    • We often choose to avoid "safe" features in favor of ones that can be easily and transparently stepped through in a debugger, similar to C code. But we're more type safe than C code in general.
  • Should be compatible with common tools like ASAN and profilers (already done in oil-native, must not regress)
  • Should be portable C++. There should be almost zero #ifdefs for specific architectures or OSes.
    • I will help with automated testing on multiple compilers/architectures to verify this.
    • We are currently compiling with -std=c++11 for wider platform support
  • The translator should run relatively fast.
  • The translator mostly needs to handle our 40K lines of code, so shortcuts should be taken at first. But eventually the corner cases should turn into hard errors. (Example: Python flat function scope vs. C++ block scope. You can simply ignore this issue at first, and let the C++ compiler handle it.)
  • The output should compile relatively fast. We are tracking compile times with GCC and Clang. (Right now I think the header structure is suboptimal.)

Questions

Who will I be working with?

Mainly me :) I will be working on the language, documentation, and improving the (very large) build system. This mostly shouldn't affect what you're doing, although the code will be evolving. (We catch any regressions in type checking or tests on every commit)

Why not do this yourself?

Basically because it's going too slowly, and potential users like Nix need a fast version of Oil.

Why not recruit open source contributors to do it?

I think this problem can be finished with a big block of time to concentrate on it. I'm spread too thin, and other contributors have jobs.

Note that there have been ~47 contributors to Oil's codebase, although none that are consistent over many months.

Why C++?

  • For the widest OS and CPU architecture support. Shells are used on all sorts of systems. To easily turn up new systems, the shell should be compiled with the same compiler as the kernel.
  • People have started rewriting Oil in other languages, like Nim and Rust. I encourage those efforts and hope Oil will have multiple implementations. But they're not part of the project's core, whereas the C++ version is.
  • Note that the estimated ~10K lines of C++ here is the only C++ in the Oil project! We try to avoid writing C++ by hand
    • This also means that this part of the project has to be well-written and well-tested! Must be of professional quality.

Why not plain C?

It makes the translator easier to write, and the code should be easier to read and step through with GDB.

  • Python classes are translated to C++ classes; functions are translated to functions
  • Dict[K, V] in MyPy is translated to Dict<K, V>* in C++
  • Python exceptions are C++ exceptions
  • Technical note: ASDL makes use of multiple inheritance in C++ to implement first class variant types. (Rust can't apparently express this.). This makes the layout slightly more efficient, and note we only don't inherit any fields. We could do this in plain C without types, but it's nice to have the C++ type system as a sanity check.

Why didn't you write the whole thing in C++ to begin with?

bash is at least 140K of C code. We implement most of it, and "engulf" it in the much richer Oil language. So you're basically asking why I didn't write 200K or 300K lines of C or C++ by hand :)

Our code is also memory safe by construction, since the metalanguage can't express anything unsafe. So we aim to have 5-10K lines of hand-written native code, rather than 200K or 300K lines.

Why Work on this? / What You Get For Free

TODO

  • A useful project
  • Shell corner cases ironed out. You are working with the metalanguage, not the language! So you don't have to know much about shell. (Although as mentioned, all the dev tools are written in shell + Python)
  • Prototype of all the work that passes over half the tests.

Milestones

  1. First make more tests pass with the existing translator and old runtime. (2 weeks)
  2. Make Python's configure run (the OSH 0.0 milestone!)
  3. Make the existing ~1100 OSH tests pass
  4. Make all the OSH tests pass.
    • This is the "delivery" of Four Features That Justify a New Unix Shell
  5. Make it fast
    • maybe optimize the garbage collector itself, or optimize the code to generate less garbage
    • generally we aim to be on par with bash or better. If OSH is a lot slower than bash in some use case, then that's considered a bug.
  6. Later: Translate the Oil language.
    • This depends on some work I have to do, like rewriting the expression evaluator.

TODO: when is it considered "done" ? There are corner cases like the ./configure script and extended globs, etc.

Similar Work

  • Pyxell (does type checking and C++ code generation in a single pass)
    • This is done entirely by a talented undergrad! (who does competitive programming) Oil probably needs more polish and "production quality", but it's very similar.
  • Languist?

Prerequisite Work

  • Contributors need to be able to run CI
    • Otherwise we're ready to start!

To Discuss

  • The new "Tea" translator should run on itself! So we can compile our code faster.

Fun Stuff in the Future

This is definitely out of the scope of the project.

But the person who is a good fit for this job might be excited or interested by future work.

TODO: Tea language, bootstrapping, etc.

Clone this wiki locally