# Guidance acceleration

When multiple generation or LLM-directed control flow statements are used in a single Guidance program then we can significantly improve inference performance by maintaining a session state with the LLM inferencer and so reusing the Key/Value caches as we progress through the prompt. This is much faster than letting the model generate all of the structural tokens itself (for example if the structure was demonstrated using a one-shot example), and also faster that simply recalling the model without any state at each point in the Guidance program.

We call this "guidance acceleration" and an early implementation of this is available in `guidance.llms.Transformers` as demonstrated below. Note we also show the performance impact of token healing (which has not yet been highly optimized).

In [3]:
import time
import torch
import guidance

# Define a trivial string we can extend with small models easily
reps = 20
prefix = "Repeat this. Repeat this. " * 5 + "Repeat this. Repeat this."
llm = guidance.llms.Transformers("gpt2", device="cpu")
print("# prefix tokens:", len(llm.encode(prefix)))
del llm.model_obj

model = "gpt2-large"
device = "cuda"
llm = guidance.llms.Transformers(model, device=device)
guidance.llms.Transformers.cache.clear()
start = time.time()
template = """{{prefix}}{{gen 'story' stop="." max_tokens=50}}.""" * reps
program = guidance(template, llm=llm)
a = program(prefix=prefix)
end = time.time()
print(f"With guidance accelaration and token healing:", end - start)
del llm.model_obj
torch.cuda.empty_cache()

llm = guidance.llms.Transformers(model, device=device, token_healing=False)
guidance.llms.Transformers.cache.clear()
start = time.time()
template = """{{prefix}}{{gen 'story' stop="." max_tokens=50}}.""" * reps
program = guidance(template, llm=llm)
b = program(prefix=prefix)
end = time.time()
print(f"                  With guidance accelaration:", end - start)
del llm.model_obj
torch.cuda.empty_cache()

llm = guidance.llms.Transformers(
    model, device=device, acceleration=False, token_healing=False
)
guidance.llms.Transformers.cache.clear()
start = time.time()
template = """{{prefix}}{{gen 'story' stop="." max_tokens=50}}.""" * reps
program = guidance(template, llm=llm)
b = program(prefix=prefix)
end = time.time()
print(f"               Without guidance accelaration:", end - start)
del llm.model_obj
torch.cuda.empty_cache()

llm = guidance.llms.Transformers(model, device=device)
guidance.llms.Transformers.cache.clear()
start = time.time()
template = """{{prefix}}{{gen 'story' max_tokens=max_tokens}}"""
program = guidance(template, llm=llm)
b = program(
    prefix=prefix,
    max_tokens=len(llm.encode(str(a))) - len(llm.encode("Repeat this. Repeat this.")),
)
end = time.time()
print(f"       Single generation call of same length:", end - start)
del llm.model_obj
torch.cuda.empty_cache()

# prefix tokens: 36
With guidance accelaration and token healing: 1.780165433883667
                  With guidance accelaration: 1.4087750911712646
               Without guidance accelaration: 2.530573606491089
       Single generation call of same length: 13.744869470596313


<hr style="height: 1px; opacity: 0.5; border: none; background: #cccccc;">
<div style="text-align: center; opacity: 0.5">Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!</div>