Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous generation in Outlines #667

Open
rlouf opened this issue Feb 15, 2024 · 8 comments
Open

Continuous generation in Outlines #667

rlouf opened this issue Feb 15, 2024 · 8 comments

Comments

@rlouf
Copy link
Member

rlouf commented Feb 15, 2024

I am opening this issue to roughly sketch the next big milestone for Outlines, tentatively called "continuous generation". There are many rough edges still, and open questions.

The first goal is to allow sampling of sequences like these:

from outlines import generate, models

model = models.transformers("mistalai/Mistral-7B-v0.1")
generator = generate.text(model)

sequence = "What are the most popular types of vehicles?\n"
for i in range(6):
      sequence += f"{i}, "
      sequence += generator(sequence, stop_at=["\n"])
      sequence += "\n"

By "sampling these sequences" I mean being able to run, for instance, beam search and optimize the sequence as a whole rather than each generation separately.

All we have to do is to return a Sequence object instead of a string, with the following attributes and methods:

class Sequence:
    token_ids: torch.Tensor
    weights: torch.Tensor
    kv_cache: Tuple
    tokenizer: Tokenizer

    def __str__(self):
        return tokenizer.decode(token_ids)

Sequence should have the same feel as a string. Besides being able to print it, we should be able to slice it, add it to another string, another sequence, etc. and carry on:

class Sequence:
    ...
    def __getitem__(self, key):
        if isinstance(key, int):
            # Just return the character? There's not much more we can do here.
        if isintance(key, slice):
           # Different behavior depending on whether `start` is 0. If `start = 0` we can 
           # keep part of the KV Cache.  Otherwise we need to re-compute the KV 
           # Cache i.e. consider the `Sequence` as a new prompt. 
           #
           # We will likely need to split tokens. For instance if we call `sequence[:10]` and 
           # 10 is the letter `m` in `formida`. In this case we can encode and append `afor` 
           # to the previous token ids. Edge cases should automatically be handled when 
           # aligning prompt and generation. 
    def __add__(self, other):
         if isinstance(other, str):
            # Signal that KV cache + logprob need to be re-computed
         if isinstance(other, Sequence):
             # Concatenate token_ids
             # Concatenate logprobs
             # Signal that KV Cache after `other` needs to be recomputed

This should be enough to bring Outlines at feature-parity with other DSLs, while not being a DSL.

@cpfiffer
Copy link

It may also be interesting to get the join token likelihood, if available. I'm not super familiar with outlines but I'd love to be able to compare Sequences probabilistic.

@rlouf
Copy link
Member Author

rlouf commented Feb 24, 2024

We could store that in addition to the sequence weights (which can be, but are not necessarily, the log-probability of the sequence).

@jeffreyftang
Copy link

Hi @rlouf, I was directed towards this issue by @lapp0 as a prerequisite issue for #657. I'm interested in contributing, but would like to get a sense of the scope of work involved so that I don't make promises I can't keep.

@miftahmoha
Copy link
Contributor

I'm also interested, currently working on it right now.

@rlouf
Copy link
Member Author

rlouf commented Mar 1, 2024

Great! It is fairly involved and there are many important design decisions that need to be made, and we need to handle computation of the KV cache after concatenating text to a previous generation.

don't hesitate to open a draft PR asap so I can give some feedback early on.

@rlouf
Copy link
Member Author

rlouf commented Mar 1, 2024

would like to get a sense of the scope of work involved so that I don't make promises I can't keep.

It is fairly involved, interleaving function calls should be easier to implement though.

@lucasavila00
Copy link

LmScript, a graphical interface for Outlines programs, makes heavy usage of continuous generation.

We currently re-send the accumulated prompt for every generation call and handle the chat template on our end.

Better performance for continuous generation would be highly appreciated

@roberthoenig
Copy link

Super excited for this feature!

One note: It'd be great if continuous generation is implemented so that intermediate outputs can be processed and reused during generation:

sequence = "What are the most popular names of vehicles and the length of their names?\n"
for i in range(6):
      sequence += f"{i}, "
      vehicle_name_gen = generator(sequence, stop_at=["\n"])
      name_len = process(len, vehicle_name_gen)   # `process` would be part of the outlines API and execute the given function during generation
      sequence += vehicle_name_gen + ",  " + name_len + " characters long."  
      sequence += "\n"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants