Un-inline float intrinsics into runtime. #694

bollu · 2021-09-28T19:32:45Z

This patch extracts out floating-point functions in the core library which are marked as @[extern c inline ...]. This aids the development of new backends such as an LLVM, since the new backend can now link against the corresponding runtime function, instead of having to parse and code-generate arbitrary C expressions.

If this PR is of interest, I shall send a sequence of PRs that applies the same transformation to the rest of the src/Init/* library.

leodemoura · 2021-09-28T21:34:57Z

@bollu What about the https://github.com/leanprover/lean4/blob/master/src/Init/Data/UInt.lean? We have a bunch o @[extern c inline ...] there too. Are they creating problems for you too?

bollu · 2021-09-28T22:16:34Z

@leodemoura Yes, they create problems as well. I'd be happy to modify this patch to replace all of these in one shot, or send separate patches for each component with @[extern c inline...] declarations.

tydeu · 2021-09-28T23:10:37Z

I was also stumbling into this issue while coding an LLVM emitter for Lean. I don't think extracting them into runtime functions is a good idea. For instance, in LLVM, linking to an external function and performing a call instead of just using an fadd instruction introduces overhead and can make many desirable optimization passes fail.

Instead, if possible, I would suggest instead replacing the @[extern c inline <code>] attribute for these functions with a new @[intrinsic <name>] (.e.g., @[intrinsic uint64_add] or @[intrinsic id]) attribute that could be match upon in the emitters to produce the relevant code.

Alternatively, one way to resolve this problem without modifying Lean or parsing C expression is to special case these definitions by matching on the name itself (i.e., UInt64.add). The custom emitter can then just ignore the attribute and generate the desired code for the definition.

bollu · 2021-09-29T00:41:28Z

I feel floating @[extern c inline <code>] into the runtime is the simplest short term solution. While it leads to a performance penalty, it still enables one to write runtimes. Furthermore, if one is very performance conscious, one can build a .ll bitcode file of the runtime and link this in (either using the llvm-link command-line tool, or the JIT API) before running optimization passes. This pulls in all the definitions from the runtime, allowing LLVM to inline runtime calls. [This is the approach I use].

In the medium to long term, I quite like the intrinsics solution, since it means that these intrinsics will be documented, which makes it easy to judge what a new backend needs to support to get the LEAN runtime up and running.

I am not too fond of special-casing on the names, since it feels like working with a contract written in magic string constants with no stability guarantees. We also lose the nice benefit of exhaustiveness checking for pattern matching on intrinsics.

tobiasgrosser · 2021-09-29T04:29:09Z

I am sympathetic to avoiding 'inline' definitions and instead just calling the runtime library. I see two benefits:

The C++ backend should not see a performance penalty as all these calls will be inlined by default.
Linking with the runtime library is always sufficient to get a working backend which simplifies bootstrapping new backends and maintenance of existing ones.

AFAIU GHC has 100s of primops which makes writing a new backend rather painful.

I somehow like the fact that backend development could be incremental. First, we just use the lean runtime always. Second, we inline the definitions from the lean runtime for performance (or emit the corresponding IR directly). Then, we optimize the most performance critical functions by emitting custom IR for them that outperforms the IR clang emits.

DanielFabian · 2021-09-29T11:21:48Z

We had loads of conversations about doing intrinsics. Long-term it would be desirable to express the intrinsics in lean directly somehow. But it entails defining a whole new c-like language to define them and it is not easy.

Because you'd have to define a language that's sufficiently c-like that we can emit it using the c-emitter and then using the LLVM backend.

The @[intrinsic <name>] option may seem like a nice way of doing it, but it leads to having to re-implement each intrinsic for each backend. And chances that we'd introduce bugs would be extremely high. Also, it would make the burden of introducing a new backend substantially higher. (And you can't readily fix bugs in intrinsics)

The low-level embedded DSL approach is more desirable as a long-term solution.

tobiasgrosser · 2021-09-29T13:33:12Z

Right, long-term an embedded DSL would be great. This reminds me very much of Terra+Lua: https://terralang.org/ AFAIU Siddharth had an idea to build something similar, but this seems orthogonal to this very bug report.

DanielFabian · 2021-09-29T15:21:05Z

orthogonal, yes. I just wanted to be transparent about the thinking we have done and about what the outcome was.

tydeu · 2021-09-29T18:09:10Z

@DanielFabian

The @[intrinsic <name>] option may seem like a nice way of doing it, but it leads to having to re-implement each intrinsic for each backend.

I fail to to see how it would be possible to not re-implement the intrinsic in each (completely new) backend. Most of the current intrinsics (i.e., @[extern c inline] expressions) are primitive operations like casts and arithmetic for specific hardware types. Virtually every backend will implement those primitive operations differently, likely through whatever that backend's own primitive operations are. A DSL would just make the notation prettier, it wouldn't prevent the backend from still having to implement said operations.

An exception to this case would be examples like @tobiasgrosser's where the idea is only to incrementally modify Lean's current backend. In which case, one may want to fall back on the existing implementation of the intrinsic. But that what would be just as possible with the @[intrinsic <name>] route. That is, an incremental backend could just special case a few of the intrinsics and fall back on Lean's implementation for others.

And chances that we'd introduce bugs would be extremely high. Also, it would make the burden of introducing a new backend substantially higher. (And you can't readily fix bugs in intrinsics)

How would the intrinsic attribute make the chance for bugs likelier or harder to debug? The only difference I see would be that, instead of specifying the relevant C code at the declaration, it is specified in the emitter.

For example, the current definition of @[extern c inline "#1 + #2"] def UInt64.add is implemented in the emitter like so:

def emitExternCall (f : FunId) (ps : Array Param) (extData : ExternAttrData) (ys : Array Arg) : M Unit :=
  match getExternEntryFor extData `c with
  | some (ExternEntry.inline _ pat) => do emit (expandExternPattern pat (toStringArgs ys)); emitLn ";"
  | -- ...

where expandExternPattern replaces #n tags with the nth argument. Thus, for UInt64.add ,it reduces to:

emitLn f"{argToCString ys[0]} + {argToCString ys[1]};"

With the intrinsic attribute, one would instead have @[intrinsic uint64_add] def UInt64.add and the emitter code like the following:

def emitExternCall (f : FunId) (ps : Array Param) (extData : ExternAttrData) (ys : Array Arg) : M Unit :=
  match getExternEntryFor extData `c with
  | some (ExternEntry.intrinsic `uint64_add) => emitLn f"{argToCString ys[0]} + {argToCString ys[1]};"
  | -- ...

which is essentially equivalent. If anything, I would think it would be easier to debug it would be no longer necessary to support arbitrary C injections (which can introduce all kinds of interesting bugs).

tobiasgrosser · 2021-09-29T19:56:04Z

@tydeu thank you for your comment. I am curious to understand the difference between your intrinsic ids as well as the current extern c references to the lean runtime. As far as I understand both assign a string to refer to some external functionality. The difference seems to be that the extern c ships a default implementation of this functionality in the lean runtime while the intrinsic annotation requires a backend to emit the relevant definitions inline.

Are there situations and reasons where we would prefer to not have a default implementation available? I guess your argument is that some functionality would be prohibitively slow so one should require a backend to implement it natively? Is this understandijg correct?

tydeu · 2021-09-29T20:38:19Z

@tobiasgrosser in my view, there are a few reasons:

The @[extern <symbol>] tag implies that the def in question is invoked invoking an external symbol. For a backend to special case on this and output different code based on what said symbol is feels deceptive to me. Furthermore, it encourage backends to special case different extern based on their needs which makes programs harder to reason about from the front-end -- i.e., what exactly is in the trusted codebase is no longer deducible to the front-end. The @[intrinsic] tag, on the other hand, makes it clear that the code in question is going to be determined by the back-end being used.
Along the lines of @DanielFabian's DSL suggestion, the @[intrinsic] tag could be more flexible than a simple name. Some possible variants:
- No argument instrinsics e,g., @[instrinsic] def UInt64.add, where the matching occurs on the name of the def instead.
- Parameterized intrinsics e.g., @[intrinsic cast i64 i32] def UInt64.toUInt32 or @intrinsic add i64] def UInt64.toUInt32, where a class of intrinsic has arguments, making it easier for the backend to abstract an emitter across classes of intrinsics.
- Intelligent intrinsics e.g., @[intrinsic tag] def Bool.tag or the erroneous @[intrinsic tag] def UInt32.tag, where the backend (and/or possibly the elaborator) smartly emits code or fails to emit code based on the kind of intrinsic.
Finally, what you mentioned is also a factor. Most intrinsic operations just don't make much conceptual sense as external functions (as they are simple primitive operations which are prohibitively slow without the optimizer undoing the external call for you). As such, I feel it makes sense to distinguish the two rather than conflate them. I also feel that just because an optimize can make inefficient code efficient, that doesn't mean that you should be producing inefficient code when there is an viable way not to do so.

DanielFabian · 2021-09-30T01:56:54Z

@tydeu the kinds of intrinsics the dsl is mostly meant for are the ref counts, memory layout, unboxing, etc.

Not the 1-line ones. But rather the 10-20 line ones which are inexpressible in lean.

tydeu · 2021-09-30T04:23:12Z

@DanielFabian I was just using you as inspiration. I didn't mean to imply that that was what you meant by the DSL. XD

DanielFabian · 2021-09-30T08:14:19Z

@tydeu my comment was responding to how the dsl can help re-implementing for all backends. The non-trivial intrinsics, like ref counting or memory layout do change, albeit sparingly. But I certainly remember a fix in thunks not too long back.

If they are implemented in the DSL, they'll make it to all backends. If they are implemented in the backend itself, each backend will have the bug if we find one or will have to make similar changes when we want to change the memory layout. Also, keeping 3-4 code bases vs. 1 in sync is harder.

E.g. adding support for structs (value types, unboxed types, however you want to call them) is something we are considering a bit further out. This'll almost certainly entail changing the memory layout. And then, the DSL vs intrinsics in the backend would help.

tydeu · 2021-09-30T09:52:22Z

@DanielFabian

The non-trivial intrinsics, like ref counting or memory layout do change, albeit sparingly.

By 'backend' are we talking about the Lean C++ implementation, the IR, or the C emitter? I was focusing on alternate emitters, which I don't think would be heavily effected by such changes.

DanielFabian · 2021-09-30T10:09:23Z

I did mean the emitter, yes. Because the DSL would be a new low-level IR that's below lean's usual IR. But it would be simple, it would maybe know loops (not even sure of that), it would know pointers, and maybe 2-3 more concepts.

Now C and LLVM would have to emit code for the low-level IR. OTOH, you'd be able to write intrinsics in lean (using the DSL).

DanielFabian · 2021-09-30T10:21:04Z

@DanielFabian

The non-trivial intrinsics, like ref counting or memory layout do change, albeit sparingly.

By 'backend' are we talking about the Lean C++ implementation, the IR, or the C emitter? I was focusing on alternate emitters, which I don't think would be heavily effected by such changes.

yes, they would. Because e.g. ref counting has to be written in C++ right now, since it's inexpressible in lean.

leodemoura · 2021-09-30T15:13:12Z

Regarding different emitters, we already have support for them at the extern attribute. One can already write

import Lean

@[extern c inline "#1 + #2 * 2" llvm "llvm_foo" javascript "jfoo"]
constant foo (x y : Nat) : Nat

open Lean
def test : MetaM Unit := do
  IO.println (getExternNameFor (← getEnv) `llvm `foo)

#eval test
-- some llvm_foo

There are other APIs for extracting the extern information. This is just a simple one.

As far as I understand, @bollu is developing support for LLVM without writing a new emitter. That is, he is reusing the C emitter. @bollu Could you confirm?

bollu · 2021-09-30T17:08:35Z

@leodemoura I'm writing a new EmitLLVM.lean / EmitMLIR.lean, based on the current EmitC.lean.

I was unaware that one can declare custom externs of the form: @[extern c inline "c_foo" llvm "llvm_foo"]. Is the suggestion that we add LLVM annotations to the functions which I wish to extract out into the runtime?

My feeling is that the runtime intrinsics ought not to depend on the @[extern llvm ...] feature, as a C expression could expand into multiple LLVM/WASM instructions. For example, the above #1 + #2 * #2 is equivalent to two LLVM instructions %prod = imul #2, #2; %out = add i32 #1, %prod. This leads to fresh name problems (%prod ought to be a fresh name, not a fixed name prod).

These problems are avoided if we have a first-class concept of an intrinsic (as @tydeu suggests) which the backends EmitC/EmitLLVM/EmitMLIR/EmitWASM/... know how to code generate.

leodemoura · 2021-09-30T17:43:09Z

Is the suggestion that we add LLVM annotations to the functions which I wish to extract out into the runtime?

If you are building a new emitter, this is the preferred way.

For example, the above #1 + #2 * #2 is equivalent to two LLVM instructions %prod = imul #2, #2; %out = add i32 #1, %prod. This leads to fresh name problems (%prod ought to be a fresh name, not a fixed name prod).

I don't think this is a problem. The annotation is just a string. The emitter can do whatever it wants with it, even creating fresh names.

These problems are avoided if we have a first-class concept of an intrinsic (as @tydeu suggests) which the backends EmitC/EmitLLVM/EmitMLIR/EmitWASM/... know how to code generate.

I am sorry. I don't have cycles to discuss new features (e.g., intrinsic). At first sight, it looks unnecessary. One can simulate it right now using

@[extern "f_impl"]
constant f : Nat → Nat

The f_impl is treated as the implementation for f for all emitters. Each emitter can do whatever they want with this information.

I wish we had more cycles to explore all these exciting possibilities, but right now our top priority is the Mathlib port, bugs, and missing features.

cpehle · 2021-10-03T18:49:53Z

@DanielFabian

The non-trivial intrinsics, like ref counting or memory layout do change, albeit sparingly.

By 'backend' are we talking about the Lean C++ implementation, the IR, or the C emitter? I was focusing on alternate emitters, which I don't think would be heavily effected by such changes.

yes, they would. Because e.g. ref counting has to be written in C++ right now, since it's inexpressible in lean.

It might make sense to implement those as a "lean_rt" MLIR dialect, that way the lowering can be implemented relatively easily in MLIR and there is a clear separation of concerns, in particular one doesn't have to invent a whole new C like intermediate language.

bollu · 2021-10-04T07:05:31Z

@cpehle The MLIR dialect defined at https://github.com/bollu/lz/blob/e9c5c0e65975637cb3b3d675afedf69b4a7b2b60/lib/Hask/HaskDialect.cpp#L45-L73 may be of interest. The file name is a misnomer, the MLIR dialect has a combination of lambdapure primitives and primitives for laziness as an experiment. In particular, refcounting is represented by IncOp and DecOp which are later lowered down to their respective LEAN runtime calls.

leodemoura · 2021-10-06T15:26:47Z

@bollu Sorry for the delay. Thanks again for the PR. @Kha and I discussed it today. We want to propose the following:

Remove c from the extern declarations, it is not needed.
Include UInt inline functions in the PR, as discussed above.

Then, we will bench! here and check if there was any impact on performance (we are not expecting any), and merge if everything looks good.
Does it sound reasonable to you?

bollu · 2021-10-06T15:45:28Z

Yes, this works perfectly. I imagine we'll do the same for extern c declarations in Float.lean as well?

tobiasgrosser · 2021-10-12T04:32:50Z

src/include/lean/lean.h

@@ -1757,6 +1762,7 @@ static inline uint64_t lean_name_hash(b_lean_obj_arg n) {
 static inline double lean_float_div(double a, double b) { return a / b; }
 static inline double lean_float_negate(double a) { return -a; }

+


Spurious whitespace change?

Kha · 2021-10-13T05:54:27Z

You shouldn't need to update stage 0 here, I think

This is necessary to support backends such as LLVM which do not emit C.

bollu · 2021-10-13T09:54:37Z

Right. I've rebased and fixed the spurious change to stage0.

Kha · 2021-10-13T10:55:40Z

!bench

leanprover-bot · 2021-10-13T11:16:32Z

Here are the benchmark results for commit 13a8412.

  Benchmark   Metric              Change
  ================================================
+ stdlib      C code generation      -2% (-27.8 σ)

src/Init/Data/UInt.lean

src/Init/Prelude.lean

* outline all intrinsics into the runtime. This is necessary to support backends such as LLVM which do not emit C. * fix style

tobiasgrosser reviewed Oct 12, 2021

View reviewed changes

Outline all intrinsics into the runtime.

13a8412

This is necessary to support backends such as LLVM which do not emit C.

bollu force-pushed the 2021-sep-27-RTS-extract-functions-float branch from bd4f624 to 13a8412 Compare October 13, 2021 09:40