# Guidance acceleration

When multiple generation or LLM-directed control flow statements are used in a single Guidance program then we can significantly improve inference performance by maintaining a session state with the LLM inferencer and so reusing the Key/Value caches as we progress through the prompt. This is much faster than letting the model generate all of the structural tokens itself (for example if the structure was demonstrated using a one-shot example), and also faster that simply recalling the model without any state at each point in the Guidance program.

We call this "guidance acceleration" and an early implementation of this is available in `guidance.llms.Transformers` as demonstrated below. Note we also show the performance impact of token healing (which has not yet been highly optimized).

In [1]:
import time
import torch
import guidance

In [2]:
# Define a trivial string we can extend with small models easily
reps = 20
prefix = "Repeat this. Repeat this. "*5 + "Repeat this. Repeat this."
print(prefix)

Repeat this. Repeat this. Repeat this. Repeat this. Repeat this. Repeat this. Repeat this. Repeat this. Repeat this. Repeat this. Repeat this. Repeat this.


In [None]:
llm = guidance.models.Transformers('gpt2', device='cpu')
print("# prefix tokens:", len(llm.encode(prefix)))
del llm.model_obj

In [None]:
dir(llm)

Define a function for benchmarking:

In [3]:
def time_model(guidance_model) -> float:
    start = time.time()
    for _ in range(reps):
       guidance_model += prefix + guidance.gen(name='story', stop=".", max_tokens=50) + "."
    end = time.time()
    return end - start

Run the benchmark on a default model load:

In [4]:
model = 'gpt2-large'
device = 'cuda'
llm_gpt2_large_gpu = guidance.models.Transformers(model, device=device)
print("Model loaded")
# guidance.models.Transformers.cache.clear()

t_accel_and_heal = time_model(llm_gpt2_large_gpu)

print(f"With guidance acceleration and token healing:", t_accel_and_heal)

torch.cuda.empty_cache()

With guidance acceleration and token healing: 8.942150831222534


In [None]:
llm_gpt2_large_gpu_no_healing = guidance.models.Transformers(model, device=device, token_healing=False)
# guidance.llms.Transformers.cache.clear()
t_accel_and_noheal = time_model(llm_gpt2_large_gpu_no_healing)
print(f"                  With guidance acceleration:", t_accel_and_noheal)

torch.cuda.empty_cache()

In [None]:
llm = guidance.llms.Transformers(model, device=device, acceleration=False, token_healing=False)
guidance.llms.Transformers.cache.clear()
start = time.time()
template = """{{prefix}}{{gen 'story' stop="." max_tokens=50}}."""*reps
program = guidance(template, llm=llm)
b = program(prefix=prefix)
end = time.time()
print(f"               Without guidance accelaration:", end - start)
del llm.model_obj
torch.cuda.empty_cache()

In [5]:
llm_gpt2_large_gpu_basic = guidance.models.Transformers(model, device=device)
# guidance.llms.Transformers.cache.clear()
start = time.time()

# max_tokens=len(llm.encode(str(a))) - len(llm.encode("Repeat this. Repeat this.")
llm_gpt2_large_gpu_basic += prefix + guidance.gen(name='story', max_tokens=500) + "."
end = time.time()
print(f"       Single generation call of same length:", end - start)

torch.cuda.empty_cache()

       Single generation call of same length: 335.5419445037842


<hr style="height: 1px; opacity: 0.5; border: none; background: #cccccc;">
<div style="text-align: center; opacity: 0.5">Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!</div>