# CUDAnative.jl

Just another package, no changes to Julia itself.

In [1]:
using CUDAnative

In [2]:
function vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
end

vadd (generic function with 1 method)

In [3]:
using CuArrays

In [4]:
a = CuArray([1,2,3])
b = CuArray([4,5,6])
c = zero(a)

3-element CuArray{Int64,1}:
 0
 0
 0

In [5]:
@cuda threads=length(a) vadd(a, b, c)
c

3-element CuArray{Int64,1}:
 5
 7
 9

In [6]:
@device_code_ptx @cuda threads=length(a) vadd(a, b, c)

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

.extern .func  (.param .b32 func_retval0) vprintf
(
	.param .b64 vprintf_param_0,
	.param .b64 vprintf_param_1
)
;
.func ptx_throw_boundserror
()
;
.extern .func jl_gc_queue_root
(
	.param .b64 jl_gc_queue_root_param_0
)
;
.extern .func  (.param .b64 func_retval0) jl_gc_pool_alloc
(
	.param .b64 jl_gc_pool_alloc_param_0,
	.param .b32 jl_gc_pool_alloc_param_1,
	.param .b32 jl_gc_pool_alloc_param_2
)
;
.extern .func  (.param .b64 func_retval0) jl_gc_big_alloc
(
	.param .b64 jl_gc_big_alloc_param_0,
	.param .b64 jl_gc_big_alloc_param_1
)
;
.global .align 1 .b8 __unnamed_1[41] = {69, 82, 82, 79, 82, 58, 32, 97, 32, 98, 111, 117, 110, 100, 115, 101, 114, 114, 111, 114, 32, 101, 120, 99, 101, 112, 116, 105, 111, 110, 32, 111, 99, 99, 117, 114, 114, 101, 100, 10, 0};
                                        // -- Begin function julia_vadd_37900
                                        // @julia_vadd_37900
.f

It's fast! We outperform `nvcc` on Rodinia benchmark suite.

![CUDAnative performance](img/cudanative_perf.png)

# CuArrays.jl

Array-based abstractions of GPU computations:

In [7]:
a = CuArray(rand(2,2))
b = CuArray(rand(2,2))

2×2 CuArray{Float64,2}:
 0.560497   0.248382 
 0.0249621  0.0141561

In [8]:
a * b

2×2 CuArray{Float64,2}:
 0.478548  0.213071
 0.221448  0.098413

But we have a Julia to GPU compiler! Which makes our abstractions **much more powerful**:

In [9]:
reduce(+, a)

1.6450283402849906

In [10]:
map((x,y) -> x*y, a, b)

2×2 CuArray{Float64,2}:
 0.47045     0.0805833
 0.00976213  0.0012765

Generalized to `broadcast`, where shapes are extended:

In [11]:
c = CuArray(rand(2))
broadcast((x,y) -> x*y, a, c)

2×2 CuArray{Float64,2}:
 0.51599    0.199446  
 0.0200414  0.00462107

Convenient short-hand syntax:

In [12]:
a .* c

2×2 CuArray{Float64,2}:
 0.51599    0.199446  
 0.0200414  0.00462107

In [13]:
@device_code_ptx a .* c

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

.extern .func  (.param .b32 func_retval0) vprintf
(
	.param .b64 vprintf_param_0,
	.param .b64 vprintf_param_1
)
;
.extern .func jl_gc_queue_root
(
	.param .b64 jl_gc_queue_root_param_0
)
;
.extern .func  (.param .b64 func_retval0) jl_gc_pool_alloc
(
	.param .b64 jl_gc_pool_alloc_param_0,
	.param .b32 jl_gc_pool_alloc_param_1,
	.param .b32 jl_gc_pool_alloc_param_2
)
;
.extern .func  (.param .b64 func_retval0) jl_gc_big_alloc
(
	.param .b64 jl_gc_big_alloc_param_0,
	.param .b64 jl_gc_big_alloc_param_1
)
;
.global .align 1 .b8 __unnamed_1[38] = {69, 82, 82, 79, 82, 58, 32, 97, 110, 32, 117, 110, 107, 110, 111, 119, 110, 32, 101, 120, 99, 101, 112, 116, 105, 111, 110, 32, 111, 99, 99, 117, 114, 114, 101, 100, 10, 0};
                                        // -- Begin function julia__25_38117
                                        // @julia__25_38117
.func julia__25_38117(
	.param .b64 julia__25_38117

Expressions **fused together** into single applications:

In [14]:
Meta.@lower sin.(a .* c .- c)

:($(Expr(:thunk, CodeInfo(
[90m[77G│[1G[39m[90m [39m1 ─ %1 = (Base.getproperty)(Base.Broadcast, :materialize)
[90m[77G│[1G[39m[90m [39m│   %2 = (Base.getproperty)(Base.Broadcast, :broadcasted)
[90m[77G│[1G[39m[90m [39m│   %3 = (Base.getproperty)(Base.Broadcast, :broadcasted)
[90m[77G│[1G[39m[90m [39m│   %4 = (Base.getproperty)(Base.Broadcast, :broadcasted)
[90m[77G│[1G[39m[90m [39m│   %5 = (%4)(*, a, c)
[90m[77G│[1G[39m[90m [39m│   %6 = (%3)(-, %5, c)
[90m[77G│[1G[39m[90m [39m│   %7 = (%2)(sin, %6)
[90m[77G│[1G[39m[90m [39m│   %8 = (%1)(%7)
[90m[77G│[1G[39m[90m [39m└──      return %8
))))