Freed reference problem when combining cuTENSOR and Zygote #169

hxjz233 · 2024-04-20T10:09:17Z

I tested a bit further on a GPU version for my TensorOperations + Zygote code on CPU, and currently I meet new problem, giving me either Freed reference or nonbits exception when running the program. For most of the case demonstrated here, @tensor passes and @cutensor gives me some error. Here you can have an example to the problem. Also I explained a bit on why I used a weird scalar function, but it's not that important for the case. Nevertheless, You can comment on those scalar functions if the behaviors intrigue you

Example code

using LinearAlgebra, TensorOperations
using ChainRulesCore, Zygote
using CUDA, cuTENSOR

function free_ref_or_nonbits(A; use_complex=false, normalize=false)
    if use_complex
        A = ComplexF64.(A)
    end
    B = ones(Float64, 2, 2, 2)
    C = ones(Float64, 2, 2, 2)
    if isa(A, Array)
        @tensor D[5,1] := (B[1,2,3] * A[4,2]) * C[3,4,5]
    else
        @cutensor D[5,1] := (B[1,2,3] * A[4,2]) * C[3,4,5]
    end
    println(D)
    if normalize
        normcoef = maximum(abs.(D))
        D = D / normcoef
    end
    println(maximum(abs.(D)))
    return maximum(abs.(D))
end

function scalar_func(A; use_sum=0)   # a (maybe) similar case
    B = ones(Float64, 2, 2)
    if isa(A, Array)
        @tensor D[1,3] := B[1,2] * A[2,3]           # passes
    else
        @cutensor D[1,3] := B[1,2] * A[2,3]         # passes
    end
    println(D)
    if use_sum == 0
        println(maximum(abs.(D)))
        return maximum(abs.(D))
    elseif use_sum == 1
        println(sum(D))
        return sum(D)
    else 
        println(reduce(+, D))
        return reduce(+, D)
    end
end

function AD()
    ######################
    ##  Using 2 flags in `free_ref_or_nonbits()` to demonstrate different exception behaviors, 
    ##  and `use_sum` at `scalar_func()` for explaining why I used a cumbersome `maximum(abs)` to return a scalar
    ##
    ##  Consider using maximum(abs) to return a target scalar (just as free_ref_or_nonbits())
    ##      if !(use_complex && normalize), @tensor will pass and @cutensor will return freed reference error
    ##      if use_complex && normalize, @tensor will pass and @cutensor will return KernelError: passing and using non-bitstype argument at line normcoef = maximum(abs.(D))
    ##
    ##  At the scalar_func side, consider using maximum(abs) (i.e. use_sum=0), @tensor and @cutensor will pass
    ##  consider using sum() (i.e. use_sum=1), @tensor has MethodError: no method matching StridedViews.StridedView(::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
    ##                                         and @cutensor will pass
    ##  consider using reduce() (i.e. use_sum=2), @tensor will pass 
    ##                                            and @cutensor will return Zygote: try/catch is not supported at line println(reduce)
    ######################
    f(x) = free_ref_or_nonbits(x; use_complex=true, normalize=true)   # Test by changing this function to the given ones
    g(x) = gradient(f, x)[1]
    initval = Matrix{Float64}([1 0; 0 0])
    println("gradient: $(g(initval))")              
    initval = CuArray(Matrix{Float64}([1 0; 0 0]))  
    println("gradient: $(g(initval))")              
    return nothing
end

AD()

Version Info

Status `D:\Julia\depot\environments\v1.9\Project.toml`
⌅ [052768ef] CUDA v5.1.2
  [d360d2e6] ChainRulesCore v1.23.0
  [4db3bf67] StridedViews v0.2.2
  [6aa20fa7] TensorOperations v4.1.1
  [409d34a3] VectorInterface v0.4.5
  [cd998857] Yota v0.8.5
  [e88e6eb3] Zygote v0.6.69
⌃ [011b41b2] cuTENSOR v1.2.1

I am looking for help on what can be done to resolve the freed reference or non_bitstype issue

The text was updated successfully, but these errors were encountered:

lkdvos · 2024-04-20T17:39:13Z

Hi hxjz233,

I looked into this for a bit, firstly:
The MethodError: no method matching StridedView.StridedView(::FillArrays...) is an incredibly annoying feature of Zygote, which generates these types of arrays when working with sum. We currently do not have an implementation of tensoroperations that supports fillarrays, which is what makes that fail. (See the discussions here, here and this PR in zygote).
I do agree that we should probably fix that, but I currently haven't been able to find the time to tackle writing fallback tensoroperations methods for generic arrays.

The freed reference is actually interesting, and it's something that I overlooked. We try to offload some of the memory pressure of the GPU by inserting CUDA.unsafe_free on temporary tensors that are generated during the @(cu)tensor forward pass, which we cannot do for reverse mode AD, as then of course these objects are no longer temporary. I'll try and get a fix going, and update you when I figure it out!

lkdvos · 2024-04-21T13:21:53Z

Ok, I think at the very least that first PR should fix the freed reference errors. Do you mind testing it out, and letting me know if it resolves the issues?

hxjz233 · 2024-04-21T15:24:26Z

Thanks for the quick fix! It works quite well for the freed reference problem (and is eventually enabling my whole cuda+tensoroperation+zygote project, when using real numbers).
The non_bitstype problem when use_complex=true, normalize=true persists. Let's keep in touch on that!

hxjz233 · 2024-05-29T07:19:03Z

Hi there, it's been a while and I find the former fix on the unsafe_free useful! But I still feel confused by the non_bitstype problem. Is there a way to fix or understand it? BTW I am good with the current sum implementation.

To reproduce, please use use_complex=true, normalize=true for the example code above. I see on my side

Error message

ERROR: LoadError: GPU compilation of MethodInstance for (::GPUArrays.var"#broadcast_kernel#38")(::CUDA.CuKernelContext, ::CuDeviceMatrix{ComplexF64, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1412#1416"{1, Int64}, Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float64, 2}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, ::Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1412#1416"{1, Int64}, Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float64, 2}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float64, 2}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}} which is not isbits.
    .1 is of type Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}} which is not isbits.
      .x is of type Matrix{Float64} which is not isbits.


Stacktrace:
  [1] check_invocation(job::GPUCompiler.CompilerJob)
    @ GPUCompiler D:\Julia\depot\packages\GPUCompiler\U36Ed\src\validation.jl:92
  [2] macro expansion
    @ D:\Julia\depot\packages\GPUCompiler\U36Ed\src\driver.jl:123 [inlined]
  [3] macro expansion
    @ D:\Julia\depot\packages\TimerOutputs\RsWnF\src\TimerOutput.jl:253 [inlined]
  [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler D:\Julia\depot\packages\GPUCompiler\U36Ed\src\driver.jl:121
  [5] codegen
    @ D:\Julia\depot\packages\GPUCompiler\U36Ed\src\driver.jl:110 [inlined]
  [6] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler D:\Julia\depot\packages\GPUCompiler\U36Ed\src\driver.jl:106
  [7] compile
    @ D:\Julia\depot\packages\GPUCompiler\U36Ed\src\driver.jl:98 [inlined]
  [8] #1075
    @ D:\Julia\depot\packages\CUDA\rXson\src\compiler\compilation.jl:247 [inlined]
  [9] JuliaContext(f::CUDA.var"#1075#1077"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler D:\Julia\depot\packages\GPUCompiler\U36Ed\src\driver.jl:47
 [10] compile(job::GPUCompiler.CompilerJob)
    @ CUDA D:\Julia\depot\packages\CUDA\rXson\src\compiler\compilation.jl:246
 [11] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler D:\Julia\depot\packages\GPUCompiler\U36Ed\src\execution.jl:125
 [12] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler D:\Julia\depot\packages\GPUCompiler\U36Ed\src\execution.jl:103
 [13] macro expansion
    @ D:\Julia\depot\packages\CUDA\rXson\src\compiler\execution.jl:359 [inlined]
 [14] macro expansion
    @ .\lock.jl:267 [inlined]
 [15] cufunction(f::GPUArrays.var"#broadcast_kernel#38", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceMatrix{ComplexF64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1412#1416"{1, Int64}, Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float64, 2}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA D:\Julia\depot\packages\CUDA\rXson\src\compiler\execution.jl:354
 [16] cufunction
    @ D:\Julia\depot\packages\CUDA\rXson\src\compiler\execution.jl:351 [inlined]
 [17] macro expansion
    @ D:\Julia\depot\packages\CUDA\rXson\src\compiler\execution.jl:104 [inlined]
 [18] #launch_heuristic#1118
    @ D:\Julia\depot\packages\CUDA\rXson\src\gpuarrays.jl:17 [inlined]
 [19] launch_heuristic
    @ D:\Julia\depot\packages\CUDA\rXson\src\gpuarrays.jl:15 [inlined]
 [20] _copyto!
    @ D:\Julia\depot\packages\GPUArrays\dAUOE\src\host\broadcast.jl:70 [inlined]
 [21] copyto!
    @ D:\Julia\depot\packages\GPUArrays\dAUOE\src\host\broadcast.jl:51 [inlined]
 [22] copy
    @ D:\Julia\depot\packages\GPUArrays\dAUOE\src\host\broadcast.jl:42 [inlined]
 [23] materialize
    @ .\broadcast.jl:873 [inlined]
 [24] broadcast(::Zygote.var"#1412#1416"{1, Int64}, ::Matrix{Float64}, ::CuArray{ForwardDiff.Dual{Nothing, Float64, 2}, 2, CUDA.Mem.DeviceBuffer})
    @ Base.Broadcast .\broadcast.jl:811
 [25] #1411
    @ D:\Julia\depot\packages\Zygote\jxHJc\src\lib\broadcast.jl:325 [inlined]
 [26] ntuple
    @ .\ntuple.jl:48 [inlined]
 [27] bc_fwd_back
    @ D:\Julia\depot\packages\Zygote\jxHJc\src\lib\broadcast.jl:324 [inlined]
 [28] #4163#back
    @ D:\Julia\depot\packages\ZygoteRules\M4xmc\src\adjoint.jl:72 [inlined]
 [29] #291
    @ D:\Julia\depot\packages\Zygote\jxHJc\src\lib\lib.jl:206 [inlined]
 [30] #2173#back
    @ D:\Julia\depot\packages\ZygoteRules\M4xmc\src\adjoint.jl:72 [inlined]
 [31] Pullback
    @ .\broadcast.jl:1311 [inlined]
 [32] (::Zygote.Pullback{Tuple{typeof(Base.Broadcast.broadcasted), typeof(abs), CuArray{ComplexF64, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.Pullback{Tuple{typeof(Base.Broadcast.broadcastable), CuArray{ComplexF64, 2, CUDA.Mem.DeviceBuffer}}, Tuple{}}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2849#back#673"{Zygote.var"#map_back#667"{typeof(Base.Broadcast.broadcastable), 1, Tuple{Tuple{}}, Tuple{Val{0}}, Tuple{}}}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing, Nothing, Nothing}, Tuple{}}, Zygote.var"#4163#back#1426"{Zygote.var"#bc_fwd_back#1414"{1, CuArray{ForwardDiff.Dual{Nothing, Float64, 2}, 2, CUDA.Mem.DeviceBuffer}, Tuple{CuArray{ComplexF64, 2, CUDA.Mem.DeviceBuffer}}, Val{1}}}}}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{}}, Zygote.var"#combine_styles_pullback#1168"{Tuple{Nothing, Nothing}}}}}})(Δ::Matrix{Float64})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [33] Pullback
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD2.jl:22 [inlined]
 [34] (::Zygote.Pullback{Tuple{var"##free_ref_or_nonbits#3", Bool, Bool, typeof(free_ref_or_nonbits), CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Any})(Δ::Float64)     
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [35] Pullback
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD2.jl:5 [inlined]
 [36] (::Zygote.Pullback{Tuple{typeof(Core.kwcall), NamedTuple{(:use_complex, :normalize), Tuple{Bool, Bool}}, typeof(free_ref_or_nonbits), CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Any})(Δ::Float64)
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [37] Pullback
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD2.jl:67 [inlined]
 [38] (::Zygote.Pullback{Tuple{var"#f#5", CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.Pullback{Tuple{typeof(Core.kwcall), NamedTuple{(:use_complex, :normalize), Tuple{Bool, Bool}}, typeof(free_ref_or_nonbits), CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Any}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.Pullback{Tuple{Type{NamedTuple{(:use_complex, :normalize)}}, Tuple{Bool, Bool}}, Tuple{Zygote.Pullback{Tuple{Type{NamedTuple{(:use_complex, :normalize), Tuple{Bool, Bool}}}, Tuple{Bool, Bool}}, Tuple{Zygote.var"#2224#back#315"{Zygote.Jnew{NamedTuple{(:use_complex, :normalize), Tuple{Bool, Bool}}, Nothing, true}}}}}}}})(Δ::Float64)
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [39] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{var"#f#5", CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.Pullback{Tuple{typeof(Core.kwcall), NamedTuple{(:use_complex, :normalize), Tuple{Bool, Bool}}, typeof(free_ref_or_nonbits), CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Any}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.Pullback{Tuple{Type{NamedTuple{(:use_complex, :normalize)}}, Tuple{Bool, Bool}}, Tuple{Zygote.Pullback{Tuple{Type{NamedTuple{(:use_complex, :normalize), Tuple{Bool, Bool}}}, Tuple{Bool, Bool}}, Tuple{Zygote.var"#2224#back#315"{Zygote.Jnew{NamedTuple{(:use_complex, :normalize), Tuple{Bool, Bool}}, Nothing, true}}}}}}}}})(Δ::Float64)
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:91
 [40] gradient(f::Function, args::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:148
 [41] g
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD2.jl:68 [inlined]
 [42] AD()
    @ Main D:\MagBEC\juliatest\t_adjulia\cuTensorAD2.jl:72
 [43] top-level scope
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD2.jl:76

hxjz233 · 2024-06-03T01:58:22Z

I guess that the original issue reporter is not able to reopen an issue if formerly closed by the contributers. Please feel free to let me know if the problem is reproducible on your side, and maybe reopen the issue if you feel it necessary. Many thanks! @lkdvos

lkdvos · 2024-06-03T04:21:00Z

Hi @hxjz233, my apologies, it seems like this previous message went under my radar. I'll try and get some time to investigate this week, in the meantime I re-opened the issue so it's less likely that I'll miss it 😉 Should I forget, definitely don't hesitate to ping me once more!

lkdvos · 2024-06-03T08:40:00Z

I started playing around with this a bit, and I actually don't know what is going on here, but I think it's worth opening an issue with either CUDA or Zygote, as the issue is not actually TensorOperations-related. The following code produces the same error:

A = CUDA.rand(ComplexF64, 2, 2)
function f(A)
    normcoeff = maximum(abs.(A))
    A = A / normcoeff
    return maximum(abs.(A))
end
gradient(f, A)[1]

I don't know enough of the inner workings of Zygote nor CUDA to know where to point you to, but I hope they will be able to help you further.
PS: I also tried norm(A, Inf), hoping that this would work better, but my preliminary testing yielded the same results

hxjz233 · 2024-06-04T10:55:35Z

Thanks for testing out! Probably be a CUDA or Zygote issue.

My purpose was to write a line that would put a soft ceiling on a tensor which should be iteratively updated (A here). Since I just found that norm(A,2) was far faster than this clumsy max(abs(A)) normalization before taking the gradient (and also norm(A, Inf) and max(abs, A) which saves the time accessing the memory) I think I would stick to that solution first.

It can be a bit surprising for a beginner like me to see that finding a max element is not computationally cheaper than finding the squared norm, both for CPU and GPU. But anyway, many thanks for the inspiration on trying out other normalizations!

Closing this issue because this is indeed not TensorOperations-related. If you feel that there is anything about those normalization functions that is worth mentioning I would surely appreciate it!

lkdvos linked a pull request Apr 20, 2024 that will close this issue

Fix premature freeing of temporary tensors during reverse mode AD. #170

Merged

lkdvos mentioned this issue Apr 20, 2024

Add naive Base julia AbstractArray implementation #171

Merged

4 tasks

Jutho closed this as completed in #170 Apr 21, 2024

lkdvos reopened this Jun 3, 2024

hxjz233 closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freed reference problem when combining cuTENSOR and Zygote #169

Freed reference problem when combining cuTENSOR and Zygote #169

hxjz233 commented Apr 20, 2024 •

edited

Loading

lkdvos commented Apr 20, 2024

lkdvos commented Apr 21, 2024

hxjz233 commented Apr 21, 2024

hxjz233 commented May 29, 2024

hxjz233 commented Jun 3, 2024

lkdvos commented Jun 3, 2024

lkdvos commented Jun 3, 2024

hxjz233 commented Jun 4, 2024 •

edited

Loading

Freed reference problem when combining cuTENSOR and Zygote #169

Freed reference problem when combining cuTENSOR and Zygote #169

Comments

hxjz233 commented Apr 20, 2024 • edited Loading

lkdvos commented Apr 20, 2024

lkdvos commented Apr 21, 2024

hxjz233 commented Apr 21, 2024

hxjz233 commented May 29, 2024

hxjz233 commented Jun 3, 2024

lkdvos commented Jun 3, 2024

lkdvos commented Jun 3, 2024

hxjz233 commented Jun 4, 2024 • edited Loading

hxjz233 commented Apr 20, 2024 •

edited

Loading

hxjz233 commented Jun 4, 2024 •

edited

Loading