Allow compiler to inline code #3

denzp · 2017-05-29T19:36:25Z

With a help of #[inline] Rust will be able to inline the code into dependent crates.
This might be useful in cases when we need --emit asm.

For example, Rust code can be built for NVPTX arch which allows to run it on CUDA GPU but unfortunately, we must inline all code for that.

nagisa · 2017-05-29T19:52:27Z

I do not think this achieves what you want it to achieve, sadly, unless you’re calling the functions from this library directly, as opposed to calling them via intrinsics/operations.

I also would like to not to #[inline] everything. That is, adding an #[inline] hint on cosf might make sense as it is a very thin wrapper, but I would be fairly reluctant to #[inline] cos itself which is a fairly large function.

Finally, I doubt the NVPTX backend will use any of these functions at all (if the proper intrinsics are called), and will generate relevant hardware instructions instead.

denzp · 2017-05-29T20:58:50Z

I agree that marking as "inline" large functions is not that good and might be confusing, but I think (or I hope) Rust compiler will decide where to inline the function, and where to call it.

Without #[inline] we are not letting a chance of inlining with --emit asm because as far as I understand compiler can inline small functions only at the linking phase with LTO. But there is no linking phase for asm output.

My problem with NVPTX backend is that it doesn't have std and so we don't have any mathematic functions available which are very critically because CUDA is being used for any kind of computations.

So exactly the problem: without #[inline] generated assembly was

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

.extern .func  (.param .b64 func_retval0) exp
(
	.param .b64 exp_param_0
)
;

// ...

call.uni (retval0), exp, (param0);

which isn't a valid code to run on CUDA.

And with #[inline] the code got inlined without any .extern .func.

It's in general possible to use NVVM intrinsics from code, but it doesn't mean we have to do this. Intrinsics are a faster alternative to standard functions in CUDA.
Still, it's up to a developer to decide to use CUDA runtime lib or the intrinsics. So far I didn't find a way how to link the runtime lib, so my hope was on native Rust implementation.

Like an compromise, we can add #[inline] only when "inline feature" is active.

nagisa · 2017-05-29T22:17:06Z

My problem with NVPTX backend is that it doesn't have std and so we don't have any mathematic functions available which are very critically because CUDA is being used for any kind of computations.

Not exactly true. Something like

#![no_core]
#![feature(intrinsics, no_core, lang_items)]

#[lang = "sized"] trait Sized {}
#[lang = "copy"] trait Copy {}
#[lang = "freeze"] trait Freeze {}

extern "rust-intrinsic" {
    fn cosf32(x: f32) -> f32;
}

pub fn foo(f: f32) -> f32 {
    unsafe {
        cosf32(f)
    }
}

will gladly compile to

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 32

	// .globl	_ZN3foo3foo17hf6aa9aa894843982E
                                        // @_ZN3foo3foo17hf6aa9aa894843982E
.visible .func  (.param .b32 func_retval0) _ZN3foo3foo17hf6aa9aa894843982E(
	.param .b32 _ZN3foo3foo17hf6aa9aa894843982E_param_0
)
{
	.reg .f32 	%f<3>;

// BB#0:                                // %start
	ld.param.f32 	%f1, [_ZN3foo3foo17hf6aa9aa894843982E_param_0];
	cos.approx.f32 	%f2, %f1;
	st.param.f32	[func_retval0+0], %f2;
	ret;
}

without any special support otherwise.

I see what the problem is here, though. Like I said, I fear #[inline] attribute does not exactly do what you want here. It might be working now, but that’s only by coincidence.

I feel that to avoid this problem you rather might want to use rustc to emit LLVM-IR (--emit=llvm-ir), rather than (--emit=asm) for each crate you use; then use llvm-link to combine the IR for all the crates into one and finally llc to compile the IR to the assembly.

The only problem I see with this approach is that LLVM is excessively buggy and even something like this:

target triple = "nvptx-nvidia-cuda"

define float @foo(float) unnamed_addr #0 {
start:
  %1 = tail call float @llvm.exp.f32(float %0)
  ret float %1
}

define float @expf(float) unnamed_addr #0 {
start:
  ret float 0.0
}

; Function Attrs: nounwind readnone
declare float @llvm.exp.f32(float) #1

attributes #0 = { nounwind readnone uwtable }
attributes #1 = { nounwind readnone }

will fail to compile with an error

LLVM ERROR: Cannot select: 0xcca058: i32 = ExternalSymbol'expf'
In function: foo

So with all that in mind I’m fine with accepting this PR, I guess. I’ll think about it a bit tomorrow.

denzp · 2017-05-30T15:20:05Z

Thanks for the explanation and approach with llvm-link. It is pretty good but more complicated than just inlining the code ;)

I didn't know PTX has any math function instructions like trigonometry. I tried a similar example with --emit asm and expf64 function from extern "rust-intrinsic" before and got the similar error:

LLVM ERROR: Cannot select: t118: i64 = ExternalSymbol'exp'
In function: bilateral_filter

That's why I abandoned my tries to go with intrinsics.

Looks like, this happened because PTX has limited support for the math function instructions compared to intrinsics.

Btw, I found that inlined and optimised math.rs sqrt function is faster than PTX sqrt.rn.f64 instruction while final computation result stays the same.

sqrt:

test filter::bilateral_cuda::tests::bench_1024 ... bench:  56,563,586 ns/iter (+/- 372,705)

sqrt.rn.f64:

test filter::bilateral_cuda::tests::bench_1024 ... bench:  60,532,764 ns/iter (+/- 636,010)

Add #[inline] attribute for compiler to decide about inlining the code

fafb15b

nagisa merged commit fafb15b into nagisa:master May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow compiler to inline code #3

Allow compiler to inline code #3

denzp commented May 29, 2017

nagisa commented May 29, 2017

denzp commented May 29, 2017

nagisa commented May 29, 2017

denzp commented May 30, 2017 •

edited

Loading

Allow compiler to inline code #3

Allow compiler to inline code #3

Conversation

denzp commented May 29, 2017

nagisa commented May 29, 2017

denzp commented May 29, 2017

nagisa commented May 29, 2017

denzp commented May 30, 2017 • edited Loading

denzp commented May 30, 2017 •

edited

Loading