Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow compiler to inline code #3

Merged
merged 1 commit into from
May 30, 2017
Merged

Allow compiler to inline code #3

merged 1 commit into from
May 30, 2017

Conversation

denzp
Copy link
Contributor

@denzp denzp commented May 29, 2017

With a help of #[inline] Rust will be able to inline the code into dependent crates.
This might be useful in cases when we need --emit asm.

For example, Rust code can be built for NVPTX arch which allows to run it on CUDA GPU but unfortunately, we must inline all code for that.

@nagisa
Copy link
Owner

nagisa commented May 29, 2017

I do not think this achieves what you want it to achieve, sadly, unless you’re calling the functions from this library directly, as opposed to calling them via intrinsics/operations.

I also would like to not to #[inline] everything. That is, adding an #[inline] hint on cosf might make sense as it is a very thin wrapper, but I would be fairly reluctant to #[inline] cos itself which is a fairly large function.

Finally, I doubt the NVPTX backend will use any of these functions at all (if the proper intrinsics are called), and will generate relevant hardware instructions instead.

@denzp
Copy link
Contributor Author

denzp commented May 29, 2017

I agree that marking as "inline" large functions is not that good and might be confusing, but I think (or I hope) Rust compiler will decide where to inline the function, and where to call it.

Without #[inline] we are not letting a chance of inlining with --emit asm because as far as I understand compiler can inline small functions only at the linking phase with LTO. But there is no linking phase for asm output.

My problem with NVPTX backend is that it doesn't have std and so we don't have any mathematic functions available which are very critically because CUDA is being used for any kind of computations.

So exactly the problem: without #[inline] generated assembly was

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

.extern .func  (.param .b64 func_retval0) exp
(
	.param .b64 exp_param_0
)
;

// ...

call.uni (retval0), exp, (param0);

which isn't a valid code to run on CUDA.

And with #[inline] the code got inlined without any .extern .func.

It's in general possible to use NVVM intrinsics from code, but it doesn't mean we have to do this. Intrinsics are a faster alternative to standard functions in CUDA.
Still, it's up to a developer to decide to use CUDA runtime lib or the intrinsics. So far I didn't find a way how to link the runtime lib, so my hope was on native Rust implementation.

Like an compromise, we can add #[inline] only when "inline feature" is active.

@nagisa
Copy link
Owner

nagisa commented May 29, 2017

My problem with NVPTX backend is that it doesn't have std and so we don't have any mathematic functions available which are very critically because CUDA is being used for any kind of computations.

Not exactly true. Something like

#![no_core]
#![feature(intrinsics, no_core, lang_items)]

#[lang = "sized"] trait Sized {}
#[lang = "copy"] trait Copy {}
#[lang = "freeze"] trait Freeze {}

extern "rust-intrinsic" {
    fn cosf32(x: f32) -> f32;
}

pub fn foo(f: f32) -> f32 {
    unsafe {
        cosf32(f)
    }
}

will gladly compile to

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 32

	// .globl	_ZN3foo3foo17hf6aa9aa894843982E
                                        // @_ZN3foo3foo17hf6aa9aa894843982E
.visible .func  (.param .b32 func_retval0) _ZN3foo3foo17hf6aa9aa894843982E(
	.param .b32 _ZN3foo3foo17hf6aa9aa894843982E_param_0
)
{
	.reg .f32 	%f<3>;

// BB#0:                                // %start
	ld.param.f32 	%f1, [_ZN3foo3foo17hf6aa9aa894843982E_param_0];
	cos.approx.f32 	%f2, %f1;
	st.param.f32	[func_retval0+0], %f2;
	ret;
}

without any special support otherwise.


I see what the problem is here, though. Like I said, I fear #[inline] attribute does not exactly do what you want here. It might be working now, but that’s only by coincidence.

I feel that to avoid this problem you rather might want to use rustc to emit LLVM-IR (--emit=llvm-ir), rather than (--emit=asm) for each crate you use; then use llvm-link to combine the IR for all the crates into one and finally llc to compile the IR to the assembly.

The only problem I see with this approach is that LLVM is excessively buggy and even something like this:

target triple = "nvptx-nvidia-cuda"

define float @foo(float) unnamed_addr #0 {
start:
  %1 = tail call float @llvm.exp.f32(float %0)
  ret float %1
}

define float @expf(float) unnamed_addr #0 {
start:
  ret float 0.0
}

; Function Attrs: nounwind readnone
declare float @llvm.exp.f32(float) #1

attributes #0 = { nounwind readnone uwtable }
attributes #1 = { nounwind readnone }

will fail to compile with an error

LLVM ERROR: Cannot select: 0xcca058: i32 = ExternalSymbol'expf'
In function: foo

So with all that in mind I’m fine with accepting this PR, I guess. I’ll think about it a bit tomorrow.

@denzp
Copy link
Contributor Author

denzp commented May 30, 2017

Thanks for the explanation and approach with llvm-link. It is pretty good but more complicated than just inlining the code ;)

I didn't know PTX has any math function instructions like trigonometry. I tried a similar example with --emit asm and expf64 function from extern "rust-intrinsic" before and got the similar error:

LLVM ERROR: Cannot select: t118: i64 = ExternalSymbol'exp'
In function: bilateral_filter

That's why I abandoned my tries to go with intrinsics.

Looks like, this happened because PTX has limited support for the math function instructions compared to intrinsics.

Btw, I found that inlined and optimised math.rs sqrt function is faster than PTX sqrt.rn.f64 instruction while final computation result stays the same.

sqrt:

test filter::bilateral_cuda::tests::bench_1024 ... bench:  56,563,586 ns/iter (+/- 372,705)

sqrt.rn.f64:

test filter::bilateral_cuda::tests::bench_1024 ... bench:  60,532,764 ns/iter (+/- 636,010)

@nagisa nagisa merged commit fafb15b into nagisa:master May 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants