# Bark Memory Profiling
Bark has two ways to reduce GPU memory: 
 - Small models: a smaller version of the model. This can be set by using the environment variable `SUNO_USE_SMALL_MODELS`
 - offloading models to CPU: Holding only one model at a time on the GPU, and shuttling the models to the CPU in between generations. 

# $ \\ $
## First, we'll use the most memory efficient configuration

In [3]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["SUNO_USE_SMALL_MODELS"] = "1"
os.environ["SUNO_OFFLOAD_CPU"] = "1"

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark import generate_audio, SAMPLE_RATE

import torch

In [7]:
#torch.cuda.reset_peak_memory_stats()
preload_models()
audio_array = generate_audio("madam I'm adam", history_prompt="v2/en_speaker_5")
max_utilization = torch.cuda.max_memory_allocated()
print(f"max memory usage = {max_utilization / 1024 / 1024:.0f}MB")

No GPU being used. Careful, inference might be very slow!


text.pt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

coarse.pt:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

fine.pt:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /Users/imjisu/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████████████████████████████████| 88.9M/88.9M [00:08<00:00, 11.0MB/s]
100%|█████████████████████████████████████████| 192/192 [00:09<00:00, 20.16it/s]
100%|███████████████████████████████████████████| 10/10 [01:01<00:00,  6.15s/it]


max memory usage = 0MB


# Memory Profiling:
We can profile the memory consumption of 4 scenarios
 - Small models, offloading to CPU
 - Large models, offloading to CPU
 - Small models, not offloading to CPU
 - Large models, not offloading to CPU

In [8]:
import os

from bark.generation import (
    generate_text_semantic,
    preload_models,
    models,
)
import bark.generation

from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE

import torch
import time

In [10]:
global models

for offload_models in (True, False):
    # this setattr is needed to do on the fly
    # the easier way to do this is with `os.environ["SUNO_OFFLOAD_CPU"] = "1"`
    setattr(bark.generation, "OFFLOAD_CPU", offload_models)
    for use_small_models in (True, False):
        models = {}
        torch.cuda.empty_cache()
#        torch.cuda.reset_peak_memory_stats()
        preload_models(
            text_use_small=use_small_models,
            coarse_use_small=use_small_models,
            fine_use_small=use_small_models,
            force_reload=True,
        )
        t0 = time.time()
        audio_array = generate_audio("madam I'm adam", history_prompt="v2/en_speaker_5", silent=True)
        dur = time.time() - t0
        max_utilization = torch.cuda.max_memory_allocated()
        print(f"Small models {use_small_models}, offloading to CPU: {offload_models}")
        print(f"\tmax memory usage = {max_utilization / 1024 / 1024:.0f}MB, time {dur:.0f}s\n")

No GPU being used. Careful, inference might be very slow!
No GPU being used. Careful, inference might be very slow!


Small models True, offloading to CPU: True
	max memory usage = 0MB, time 97s



No GPU being used. Careful, inference might be very slow!


Small models False, offloading to CPU: True
	max memory usage = 0MB, time 110s



No GPU being used. Careful, inference might be very slow!


Small models True, offloading to CPU: False
	max memory usage = 0MB, time 84s

Small models False, offloading to CPU: False
	max memory usage = 0MB, time 83s

