I created a new kind of mpohamiltonian for quantum chemistry hamiltonians, because you can gain some speed there by pre-summing the environments. I also invested a bunch of effort making the effective hamiltonian fast. In this notebook I try to benchmark the MPSKit effective hamiltonians against my (?optimized) implementation.

I don't make use of the pre-summation in this benchmark, both codes should do the exact same thing.

In [1]:
using Revise, MPSKit, MPSKitModels, TensorKit
using LinearAlgebra, Base.Threads, BenchmarkTools
using MPSKitExperimental:FusedMPOHamiltonian

In [2]:
BLAS.set_num_threads(1)

In [3]:
nthreads()

4

## heisenberg spin 1

In [4]:
fmps_len = 10;
middle_site = Int(round(fmps_len/2))

ts = FiniteMPS(rand,ComplexF64,fmps_len,Rep[SU₂](1=>1),Rep[SU₂](i => 20 for i in 0:10));

th_orig = heisenberg_XXX(SU2Irrep);
th_fused = convert(FusedMPOHamiltonian,repeat(th_orig,length(ts)));

env_orig = environments(ts,th_orig);
env_fused = environments(ts,th_fused);

ac = ts.AC[middle_site]
ac_eff_orig = MPSKit.∂∂AC(middle_site,ts,th_orig,env_orig);
ac_eff_fused = MPSKit.∂∂AC(middle_site,ts,th_fused,env_fused);


ac2 = ts.AC[middle_site]*MPSKit._transpose_tail(ts.AR[middle_site+1])
ac2_eff_orig = MPSKit.∂∂AC2(middle_site,ts,th_orig,env_orig);
ac2_eff_fused = MPSKit.∂∂AC2(middle_site,ts,th_fused,env_fused);

In [5]:
@benchmark ac_eff_orig(ac)

BenchmarkTools.Trial: 3028 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m859.053 μs[22m[39m … [35m13.965 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 80.16%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m  1.368 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m  1.643 ms[22m[39m ± [32m 1.638 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m13.85% ± 12.20%

  [39m▁[39m▅[39m█[34m▅[39m[32m▁[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[34m█[

In [6]:
@benchmark ac_eff_fused(ac)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 34.610 μs[22m[39m … [35m 16.455 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 97.11%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m 88.554 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m105.244 μs[22m[39m ± [32m402.377 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m9.71% ±  2.54%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▄[39m▆[39m█[39m▇[39m▆[39m▆[39m▇[34m▅[39m[39m▅[39m▄[39m▃[39m▃[32m▁[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▆[39m▄

In [7]:
@benchmark ac2_eff_orig(ac2)

BenchmarkTools.Trial: 2227 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.316 ms[22m[39m … [35m24.615 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 87.28%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.787 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.237 ms[22m[39m ± [32m 3.018 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m19.29% ± 12.93%

  [39m▇[34m█[39m[39m▅[32m▂[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[34m█[39m[39m█[32m█[39m[39m▃[39

In [8]:
@benchmark ac2_eff_fused(ac2)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m39.529 μs[22m[39m … [35m 17.418 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 95.38%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m78.534 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m93.214 μs[22m[39m ± [32m297.227 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m5.30% ±  1.66%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m▄[39m█[39m▂[34m [39m[39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[39m▂[39m▂[39m▂[39

## some weird hamiltonian

In [9]:
using MAT
include("haagerup/H3_anyon.jl")

In [10]:
fmps_len = 10;
middle_site = Int(round(fmps_len/2))

physical = Vect[H3](H3(4)=>1);
virtual = Vect[H3](sector => 20 for sector in TensorKit.SectorValues{H3}());
ts = FiniteMPS(rand,ComplexF64,fmps_len,physical,virtual);

mpotensor = TensorMap(ones,ComplexF64,physical*physical,physical*physical);
blocks(mpotensor)[H3(1)]*=0;
blocks(mpotensor)[H3(5)]*=0;
blocks(mpotensor)[H3(6)]*=0;
th_orig = MPOHamiltonian(-mpotensor);
th_fused = convert(FusedMPOHamiltonian,repeat(th_orig,length(ts)));

env_orig = environments(ts,th_orig);
env_fused = environments(ts,th_fused);

ac = ts.AC[middle_site]
ac_eff_orig = MPSKit.∂∂AC(middle_site,ts,th_orig,env_orig);
ac_eff_fused = MPSKit.∂∂AC(middle_site,ts,th_fused,env_fused);


ac2 = ts.AC[middle_site]*MPSKit._transpose_tail(ts.AR[middle_site+1])
ac2_eff_orig = MPSKit.∂∂AC2(middle_site,ts,th_orig,env_orig);
ac2_eff_fused = MPSKit.∂∂AC2(middle_site,ts,th_fused,env_fused);

In [11]:
@benchmark ac_eff_orig(ac)

BenchmarkTools.Trial: 747 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m3.480 ms[22m[39m … [35m28.923 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 75.71%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.967 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m6.684 ms[22m[39m ± [32m 4.785 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m24.63% ± 23.09%

  [39m [39m▁[39m▄[39m█[39m█[34m▄[39m[39m▁[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m█[39m█[39m█[39m█[34m█[39m[39

In [12]:
@benchmark ac_eff_fused(ac)

BenchmarkTools.Trial: 7108 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m463.256 μs[22m[39m … [35m 16.480 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 92.75%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m618.374 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m696.599 μs[22m[39m ± [32m685.468 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m4.42% ±  4.40%

  [39m [39m [39m▁[39m▄[39m▆[39m█[39m█[39m▆[39m▄[39m▂[39m▂[39m [39m [39m [34m [39m[39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m▁[39m [39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▅[39m█[39m█[

In [13]:
@benchmark ac2_eff_orig(ac2)

BenchmarkTools.Trial: 266 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m12.984 ms[22m[39m … [35m36.734 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 57.73%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m14.589 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m18.823 ms[22m[39m ± [32m 8.341 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m22.97% ± 23.63%

  [39m [39m▃[39m█[39m█[34m▂[39m[39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▅[39m█[39m█[39m█[34m█[39m

In [14]:
@benchmark ac2_eff_fused(ac2)

BenchmarkTools.Trial: 4869 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m804.375 μs[22m[39m … [35m21.560 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 90.57%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m897.784 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m  1.017 ms[22m[39m ± [32m 1.048 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m6.60% ±  6.10%

  [39m [39m [39m▅[39m▇[39m█[39m▅[34m▃[39m[39m▁[39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▅[39m█[39m█[39m█[39m

## manueel

manually written out @tensor, compared to a version that used DelayedFact & friends. The whole recycling of intermediaries becomes mostly important when multithreading, so not very applicable here.

In [15]:
using MPSKitExperimental:DelayedFact,TransposeFact,free!

In [16]:
virtual = Rep[SU₂](i => 20 for i in 0:10);
physical = Rep[SU₂](1=>1)

ac = TensorMap(rand,ComplexF64,virtual*physical,virtual)

fun = let        
    e = transpose(TensorMap(rand,ComplexF64,oneunit(physical)*physical,physical*oneunit(physical)),(1,),(3,4,2))
    l = transpose(TensorMap(rand,ComplexF64,virtual*oneunit(physical)',virtual),(3,1),(2,))
    le = transpose(l*e,(2,5,4),(1,3))

    r = TensorMap(rand,ComplexF64,virtual*oneunit(physical),virtual)


    function tobench(x)
        @plansor y[-1 -2;-3] := le[-1 -2 5;2 3]*x[2 3;4]*r[4 5;-3]
    end
end

@benchmark fun(ac)

BenchmarkTools.Trial: 5192 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m548.926 μs[22m[39m … [35m14.786 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 90.50%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m704.005 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m955.215 μs[22m[39m ± [32m 1.664 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m23.67% ± 12.61%

  [39m█[34m▇[39m[32m▄[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[32m█[3

In [17]:
virtual = Rep[SU₂](i => 20 for i in 0:10);
physical = Rep[SU₂](1=>1)

ac = TensorMap(rand,ComplexF64,virtual*physical,virtual)

fun = let        
    e = transpose(TensorMap(rand,ComplexF64,oneunit(physical)*physical,physical*oneunit(physical)),(1,),(3,4,2))
    l = transpose(TensorMap(rand,ComplexF64,virtual*oneunit(physical)',virtual),(3,1),(2,))
    le = transpose(l*e,(2,5,4),(1,3))

    r = TensorMap(rand,ComplexF64,virtual*oneunit(physical),virtual)


    function tobench(x)

        lex = le*x
        lex_trans = transpose(lex,(1,2),(4,3))
        lex_trans*r
    end
end

@benchmark fun(ac)

BenchmarkTools.Trial: 6814 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m457.697 μs[22m[39m … [35m13.747 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 91.21%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m523.471 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m725.849 μs[22m[39m ± [32m 1.418 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m23.83% ± 11.48%

  [34m█[39m[32m▄[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [34m█[39m[32m█[39m[39

In [18]:
virtual = Rep[SU₂](i => 20 for i in 0:10);
physical = Rep[SU₂](1=>1)

ac = TensorMap(rand,ComplexF64,virtual*physical,virtual)

fun = let        
    e = transpose(TensorMap(rand,ComplexF64,oneunit(physical)*physical,physical*oneunit(physical)),(1,),(3,4,2))
    l = transpose(TensorMap(rand,ComplexF64,virtual*oneunit(physical)',virtual),(3,1),(2,))
    le = transpose(l*e,(2,5,4),(1,3))

    r = TensorMap(rand,ComplexF64,virtual*oneunit(physical),virtual)

    multfactory_lex = DelayedFact(codomain(le)←virtual,storagetype(r))
    transposefactor_lex = TransposeFact(codomain(le)←virtual,storagetype(r),(1,2),(4,3))
    multfact_out = DelayedFact(virtual*physical←virtual,storagetype(r))

    function tobench(x)
        t_1 = multfactory_lex()
        mul!(t_1,le,x)

        t_2 = transposefactor_lex(t_1)
        free!(multfactory_lex,t_1);

        y = multfact_out()

        mul!(y,t_2,r)

        free!(transposefactor_lex,t_2)
        y
    end
end

@benchmark fun(ac)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m233.359 μs[22m[39m … [35m 15.651 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 94.49%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m246.359 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m309.244 μs[22m[39m ± [32m731.047 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m15.04% ±  6.25%

  [39m▄[39m▇[39m█[39m▇[34m▆[39m[39m▅[39m▄[39m▄[39m▄[39m▄[39m▄[39m▄[39m▄[39m▄[39m▃[39m▃[39m▃[39m▂[39m▂[39m▂[39m▂[39m▁[39m▁[39m▁[39m▁[32m▁[39m[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[39m█[39

## Team Topo

In [19]:
# original

using TensorKit, TensorOperations
using BenchmarkTools

tensorexpr(name::Symbol, inds) = Expr(:ref, name, inds...)
@generated function apply_transfer2(A, x::AbstractTensorMap{S,N,0}) where {S,N}
    N > 0 || error("undefined behaviour for length 0")

    out_part = tensorexpr(:y, -1:-1:-N)

    in_part = Expr(:call, :*, tensorexpr(:x, 1:2:2N), tensorexpr(:A, (2N, -1, 1, 2)))
    for i = 2:N
        push!(in_part.args, tensorexpr(:A, (2i - 2, -i,  2i - 1, 2i)))
    end

    return :(@tensor $out_part := $in_part)
end


println(Threads.nthreads())
for n = 3:5
    println("tests for n = $n")
    let N = n, sp = ℂ^2
        A = permute(TensorMap(rand, ComplexF64, sp^2 ← sp^2),(4,1),(3,2))
        X = Tensor(rand, ComplexF64, (sp)^N)
        @btime apply_transfer2($A, $X)
        nothing
    end

    let N = n, sp = Z2Space(0 => 1, 1 => 1)
        A = permute(TensorMap(rand, ComplexF64, sp^2 ← sp^2),(4,1),(3,2))
        X = Tensor(rand, ComplexF64, (sp)^N)
        @btime apply_transfer2($A, $X)
        nothing
    end
end

4
tests for n = 3


  15.569 μs (88 allocations: 10.17 KiB)


  114.559 μs (2504 allocations: 169.06 KiB)
tests for n = 4


  37.380 μs (126 allocations: 18.50 KiB)


  198.259 μs (4891 allocations: 367.80 KiB)
tests for n = 5


  64.930 μs (172 allocations: 33.98 KiB)


  384.678 μs (10019 allocations: 859.91 KiB)


In [18]:
# now with transposefactory
using ProfileView
using PProf
function apply_transfer_fast(A,x::AbstractTensorMap{S,N,0}) where {S,N}
    N > 0 || error("undefined behaviour for length 0")

    trans_x_first = TransposeFact(codomain(x)←domain(x),storagetype(x),(1,),tuple(N:-1:2...))
    trans_A_first = TransposeFact(codomain(A)←domain(A),storagetype(A),(1,2,4),(3,))
    
    multcache_1 = DelayedFact(trans_A_first.delayed.cod←trans_x_first.delayed.dom,storagetype(x))

    trans_AX_first = TransposeFact(multcache_1,(3,N+2),(2,1,4:N+1...))

    trans_A_middle = TransposeFact(codomain(A)←domain(A),storagetype(A),(2,4),(1,3))
    multcache_middle = DelayedFact(trans_A_middle.delayed.cod←prod(space(x,1)' for i in 1:N),storagetype(A))
    trans_AX_middle = TransposeFact(multcache_middle.cod←multcache_middle.dom,storagetype(A),(2,N+2),(1,3:N+1...))

    function ff(A,x)
        for top in 1:1000
            t_A1 = trans_A_first(A)
            t_x1 = trans_x_first(x)
            
            t_Ax1 = multcache_1()
            mul!(t_Ax1,t_A1,t_x1)
            free!(trans_A_first,t_A1)
            free!(trans_x_first,t_x1)

            t_Ax2 = trans_AX_first(t_Ax1)
            free!(multcache_1,t_Ax1)

            for i in 2:N-1
                tA = trans_A_middle(A)
                t_m = multcache_middle()
                mul!(t_m,tA,t_Ax2)
                
                free!(trans_A_middle,tA)
                if i == 2
                    free!(trans_AX_first,t_Ax2)
                else
                    free!(trans_AX_middle,t_Ax2)
                end

                t_Ax2 = trans_AX_middle(t_m)
                free!(multcache_middle,t_m)
            end
        end
    end

end

println(Threads.nthreads())
n = 4
using Profile
Profile.clear()
let N = n, sp = Z2Space(0 => 1, 1 => 1)
    A = TensorMap(rand, ComplexF64, sp*sp ← sp*sp)
    X = Tensor(rand, ComplexF64, (sp)^N)
    fastfun = apply_transfer_fast(A,X)
    #@btime $fastfun($A, $X)
    fastfun(A,X)
    #VSCodeServer.@profview fastfun(A,X)
    ProfileView.@profview fastfun(A,X)
end

4


FATAL ERROR: Gtk state corrupted by error thrown in a callback:



FATAL ERROR: Gtk state corrupted by error thrown in a callback:
FATAL ERROR: Gtk state corrupted by error thrown in a callback:


[91m[1mERROR: [22m[39mAssertionError: g_stack === nothing && !prev
Stacktrace:
 [1] [0m[1mg_sigatom[22m[0m[1m([22m[90mf[39m::[0mAny[0m[1m)[22m
[90m   @[39m [36mGtk.GLib[39m [90m~/.julia/packages/Gtk/oo3cW/src/GLib/[39m[90m[4msignals.jl:174[24m[39m
 [2] [0m[1mgtk_main[22m[0m[1m([22m[0m[1m)[22m
[90m   @[39m [36mGtk[39m [90m~/.julia/packages/Gtk/oo3cW/src/[39m[90m[4mevents.jl:1[24m[39m
[91m[1mERROR: [22m[39mAssertionError: g_stack === nothing && !prev[91m[1mERROR: [22m[39m
Stacktrace:





  [1] AssertionError: [0m[1mg_sigatom[22mg_stack === nothing && !prev[0m[1m([22m
Stacktrace:[90mf[39m
:: [0mAny[1][0m[1m)[22m 
[0m[1mg_sigatom[22m[90m    @[39m[0m[1m([22m [90mf[39m[36mGtk.GLib[39m:: [0mAny[90m~/.julia/packages/Gtk/oo3cW/src/GLib/[39m[0m[1m)[22m[90m[4msignals.jl:174[24m[39m
[90m   @[39m
  [36mGtk.GLib[39m [2]  [90m~/.julia/packages/Gtk/oo3cW/src/GLib/[39m[0m[1mmacro expansion[22m[90m[4msignals.jl:174[24m[39m
[90m    @[39m
  [90m~/.julia/packages/Gtk/oo3cW/src/GLib/[39m[2][90m[4msignals.jl:200[24m[39m [90m [inlined][39m[0m[1mgtk_main[22m
[0m[1m([22m [0m[1m)[22m [3]
 [90m   @[39m[0m[1mshow[22m 
[36mGtk[39m[90m    @[39m  [90m~/.julia/packages/Gtk/oo3cW/src/[39m[90m~/.julia/packages/Gtk/oo3cW/src/[39m[90m[4mevents.jl:1[24m[39m[90m[4mbase.jl:39[24m[39m[90m [inlined][39m

  [4] [0m[1mGtk.GtkWindowLeaf[22m




[0m[1m([22m[90mtitle[39m::[0mString, [90mw[39m::[0mInt64, [90mh[39m::[0mInt64, [90mresizable[39m::[0mBool, [90mtoplevel[39m::[0mBool[0m[1m)[22m
[90m    @[39m [36mGtk[39m [90m~/.julia/packages/Gtk/oo3cW/src/[39m[90m[4mwindows.jl:14[24m[39m
  [5] [0m[1mGtkWindowLeaf[22m
[90m    @[39m [90m~/.julia/packages/Gtk/oo3cW/src/[39m[90m[4mwindows.jl:2[24m[39m[90m [inlined][39m
  [6] [0m[1m#GtkWindow#196[22m
[90m    @[39m [90m~/.julia/packages/Gtk/oo3cW/src/GLib/[39m[90m[4mgtype.jl:226[24m[39m[90m [inlined][39m
  [7] [0m[1mGtkWindow[22m
[90m    @[39m [90m~/.julia/packages/Gtk/oo3cW/src/GLib/[39m[90m[4mgtype.jl:225[24m[39m[90m [inlined][39m
  [8] [0m[1mviewgui[22m[0m[1m([22m[90mfcolor[39m::[0mFlameGraphs.FlameColors, [90mgdict[39m::[0mDict[90m{Symbol, Dict{Symbol, LeftChildRightSiblingTrees.Node{FlameGraphs.NodeData}}}[39m; [90mdata[39m::[0mVector[90m{UInt64}[39m, [90mlidict[39m::[0mDict[90m{UInt64, Vector{

error in running finalizer: AssertionError(msg="xor(prev, current_task() !== g_stack)")
g_siginterruptible at /home/maarten/.julia/packages/Gtk/oo3cW/src/GLib/signals.jl:209
unknown function (ip: 0x7fce8fb69f36)
GClosureMarshal at /home/maarten/.julia/packages/Gtk/oo3cW/src/GLib/signals.jl:58
unknown function (ip: 0x7fce8fb55f8a)
jlcapi_GClosureMarshal_1208 at /home/maarten/.julia/compiled/v1.9/Gtk/Vjnq0_ClZO7.so (unknown line)
g_closure_invoke at /home/maarten/.julia/artifacts/b6ebc4def1211ec4043d7c055f450d59f56747cc/lib/libgobject-2.0.so.0 (unknown line)
signal_emit_unlocked_R at /home/maarten/.julia/artifacts/b6ebc4def1211ec4043d7c055f450d59f56747cc/lib/libgobject-2.0.so.0 (unknown line)
g_signal_emit_valist at /home/maarten/.julia/artifacts/b6ebc4def1211ec4043d7c055f450d59f56747cc/lib/libgobject-2.0.so.0 (unknown line)
g_signal_emit at /home/maarten/.julia/artifacts/b6ebc4def1211ec4043d7c055f450d59f56747cc/lib/libgobject-2.0.so.0 (unknown line)
gtk_widget_dispose at /workspace/srcd

AssertionError: AssertionError: g_stack === nothing && !prev