# Memory allocations on different streams in Pytorch with some parts of profiles 


I want to demonstrate the origins of cudaFree in profiles and attribute that to allocation on different stream than main.

There are screenshots of some parts of the profiles.

In [3]:
import torch
total_mem = torch.cuda.get_device_properties(0).total_memory
print(total_mem)

to_alloc = int(0.9 * total_mem)

# warm up alloc 

x = torch.empty(to_alloc, dtype=torch.uint8, device="cuda")
del x

50887524352


In [None]:
# without streams

for _ in range(10):
    x = torch.empty(to_alloc, dtype=torch.uint8, device="cuda")
    del x

torch.cuda.synchronize()

The profile has no GPU operations - caching allocator works properly.
<img src="./img/1.png">

In [5]:
# with streams, no sync

s = torch.cuda.Stream()

for _ in range(10):
    with torch.cuda.stream(s):
        x = torch.empty(to_alloc, dtype=torch.uint8, device="cuda")
        del x
    x = torch.empty(to_alloc, dtype=torch.uint8, device="cuda")
    del x

torch.cuda.synchronize()

There are malloc/free calls for every empty_call. This is zoom into one of them:


<img src="./img/2.png">

In [7]:
# with streams, sync

s = torch.cuda.Stream()

for _ in range(10):
    with torch.cuda.stream(s):
        x = torch.empty(to_alloc, dtype=torch.uint8, device="cuda")
        del x
    torch.cuda.synchronize()
    x = torch.empty(to_alloc, dtype=torch.uint8, device="cuda")
    del x
    torch.cuda.synchronize()

torch.cuda.synchronize()

There are malloc/free calls for every empty_call. This is zoom into one of them:

<img src="./img/3.png">