Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Create LLM Engine Core Functionality #1178

Closed
2 tasks done
Tracked by #1305 ...
mdemoret-nv opened this issue Sep 8, 2023 · 4 comments · Fixed by #1261 or #1292
Closed
2 tasks done
Tracked by #1305 ...

[FEA]: Create LLM Engine Core Functionality #1178

mdemoret-nv opened this issue Sep 8, 2023 · 4 comments · Fixed by #1261 or #1292
Labels
feature request New feature or request sherlock Issues/PRs related to Sherlock workflows and components

Comments

@mdemoret-nv
Copy link
Contributor

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem this feature solves

To support working with LLMs directly, an LLM Engine should be added to Morpheus to support executing complex LLM functions directly within a pipeline.

Describe your ideal solution

The LLM engine will compose of 3 components:

  • Prompt Generator
  • LLM Service
  • Task Generator

The Prompt Generator is responsible for taking some input message and converting it into a prompt to be fed into the LLM service. The Prompt Generator can be as simple as a basic templating engine, and can be as complex as a full Langchain or Llamaindex pipeline.

The LLM Service will be responsible for executing queries with an LLM. It will be generic to allow multiple LLM services to be used but initially should target the service created by #1130.

The Task Generator is responsible for taking the output from the LLM service and converting it into a task to be processed. This task can either be fed back into the Prompt Generator or exit the LLM engine to be processed by the pipeline.

This issue will contain the design discussion for the above components since the API has not been finalized.

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@mdemoret-nv
Copy link
Contributor Author

mdemoret-nv commented Sep 12, 2023

LLM Engine Core API

The LLMEngine class will utilize the following comonents:

  1. LLMPropmtGenerator
    1. For a given incoming task, the prompt generator is responsible for converting the input ControlMessage into a prompt to be run by the model
  2. LLMService
    1. Primary responsibility is configuring LLMClient objects to be able to process LLM requests given input prompts
    2. Configures what type of service, which model, parameters to use with that model, options, etc.
    3. Takes the output from the LLMPromptGenerator and sends it to the LLM service for inference
  3. LLMTaskHandler
    1. The task handler is responsible for turning the output of the model into tasks to be further processed.
    2. Task handlers can generate additional tasks to be run through the LLMEngine again, or new tasks to be run by the rest of the pipeline.

LLMEngine API

The LLMEngine is mainly a container for the prompt generators, LLM service, and task handlers. Users can configure the functionality of the engine by adding different generators and handlers depending on their desired functionality. For example, a simple use case might add a TemplatePromptGenerator which is capable of taking input tasks and building prompts using a pre-defined template. Or a user might add a SpearPhishingQualityCheck task handler to ensure that the output from an LLM generates emails that pass some basic checks, and if the checks dont pass, it could try regenerating the emails in a second pass.

The public facing API for the LLMEngine is as follows:

class LLMEngine:
    def __init__(self, llm_service: LLMService):
        ...

    @property
    def llm_service(self) -> LLMService:
        ...

    # Managing Prompt Generators
    def add_prompt_generator(self, prompt_generator: LLMPropmtGenerator):
        ...
    def insert_prompt_generator(self, prompt_generator: LLMPropmtGenerator, index: int):
        ...
    def remove_prompt_generator(self, prompt_generator: LLMPropmtGenerator) -> bool:
        ...

    # Managing Task Handlers
    def add_task_handler(self, task_handler: LLMTaskHandler):
        ...
    def insert_task_handler(self, task_handler: LLMTaskHandler, index: int):
        ...
    def remove_task_handler(self, task_handler: LLMTaskHandler) -> bool:
        ...

    # Executing a task with a given input message
    def run(self, message: ControlMessage) -> list[ControlMessage]:
        ...

Notes:

  • The LLMEngine requires a LLMService when it is created. This is necessary to allow prompt generators and task handlers to access the LLMService when they are added to the engine.
  • Prompt generators and task handlers are ordered by priority. Not every prompt generator and task handler may be able to handle a specific task. When this happens, it passes the message on to the next handler/generator until one is capable of handling the task
  • If no handlers/generators can process a task, an error is generated
  • The run call only works with ControlMessage since it needs the specific task llm_engine on the control message to function properly
  • Multiple messages can be returned by run since some rows may need to be reprocessed and others are complete.

LLMPromptGenerator API

The LLMPromptGenerator only has a single purpose, to generate prompts from input messages. While the API is quite simple, the prompt generator can be as basic or complex as needed. For example, its possible to run entire Langchain pipelines inside of the prompt generator if needed and not use any additional task handlers. This approach would make it easy to load LLM functionality from external libraries at the cost of performance (one request at a time).

The public facing API is as follows:

class LLMPropmtGenerator(ABC):

    @abstractmethod
    def try_handle(self, engine: "LLMEngine", input_task: dict,
                   input_message: ControlMessage) -> LLMGeneratePrompt | LLMGenerateResult | None:
        pass

Notes:

  • The try_handle function takes in the LLMEngine to allow it to run queries itself using the engines LLMService. This is necessary to allow things like Langchain to operate completely inside of the engine without modification.
  • The try_handle returns None if it cannot handle the current task/message.
  • The try_handle function can return either LLMGeneratePrompt, to have the engine execute the LLM in the next step, or LLMGenerateResult if the LLM model does not need to be run
    • LLMGenerateResult may be returned if the prompt generator has already run the model, or determines it does not need to be run.
  • The input_task is the result of input_message.remove_task("llm_engine"). This could be left on the message, but since this may return None, it seemed safer to pull this off the message before passing it to the handlers.

LLMService API

The LLMService object is an abstraction around running and executing LLM queries/completions using any external service or model type. It has taken inspiration from the LLM libraries of Langchain and Llamaindex. The LLMService does not have actual methods for running LLM models on it. Instead, you request an LLMClient from the service and use this object for executing prompts.

Using the service and client architecture allows multiple models to be used within the LLMEngine all with a single configuration. Additionally, clients can work with their service to provide things like dynamic batching, or caching.

The public facing API is as follows:

class LLMClient(ABC):

    @typing.overload
    @abstractmethod
    def generate(self, prompt: str) -> str:
        ...

    @typing.overload
    @abstractmethod
    def generate(self, prompt: list[str]) -> list[str]:
        ...

    @abstractmethod
    def generate(self, prompt: str | list[str]) -> str | list[str]:
        pass


class LLMService(ABC):

    @abstractmethod
    def get_client(self, model_name: str, model_kwargs: dict | None = None) -> LLMClient:
        pass

Notes:

  • A new LLMClient needs to be created for each model/model_kwargs combination. This is necessary to allow for multiple requests to be dynamically batched by forcing the user to specify these properties before running inference.
  • Multiple clients, even for the same model, can be created from a single service.
  • Client objects are not thread safe. Instead, each thread should use its own copy of the client.

LLMTaskHandler API

Finally, the LLMTaskHandler needs to determine what to do with the output of the LLMService. Task handlers will vary from application to application and can be very basic or very complex. Basic applications might convert every response into a new ControlMessage to be processed by the pipeline. Complex applications may perform tasks such as evaluating arbitrary code or running LLM agents. The only requirement is that the output is turned back into a ControlMessage.

The public facing API is as follows:

class LLMTaskHandler(ABC):

    @abstractmethod
    def try_handle(self, engine: "LLMEngine", input_task: dict, input_message: ControlMessage,
                   responses: LLMGenerateResult) -> list[ControlMessage] | None:
        pass

Notes:

  • Much like the LLMPromptGenerator the LLMTaskHandler returns None from try_handle if it cannot process this message
  • The input responses is always of type LLMGenerateResult since this will be output by either the prompt generator or the LLM service
  • The input_message is passed to the handlers to allow for reuse of either the payload, the message, or both. This eliminates overhead in creating new messages for each loop.

Outstanding Questions

Given the above design, there are still a few outstanding questions that need to be resolved:

  • How will we avoid inheritance slicing in the LLMEngine for python derived classes from LLMPromptGenerator and LLMTaskGenerator?
    • Ideally, the LLMEngine would be fully implemented in C++ and maintain a list of std::shared_ptr<LLMPromptGenerator> and std::shared_ptr<LLMTaskGenerator>. However, this may suffer from inheritance slicing and cause the python portion to be dropped.
    • Theoretically, since the LLMEngine should exceed the lifetime of any of the generators/handlers, we could work around this issue by maintaining a reference to any of the objects passed in to add_prompt_generator/add_task_handler.
  • How do we handle asyncio calls from the LLMEngine?
    • Most of the LLM libraries have async versions of each method allowing for efficient dispatching of requests without blocking.
    • We should try to take advantage of this where possible to maximize efficiency with python implementations.

@mdemoret-nv mdemoret-nv added this to the 23.11 - Sherlock milestone Sep 12, 2023
@mdemoret-nv
Copy link
Contributor Author

After some initial prototyping, the above outstanding questions have been resolved:

  • While inheritance slicing is an issue, unlike our message classes, ownership of the task handler and prompt generators is very clear. So we can ensure the lifetime of the python objects by simply creating a pybind11 trampoline class and saving the python object in add_task_handler and add_prompt_generator
  • From pybind11, its possible to start a C++ thread that can be used as an asyncio loop and if the LLMPromptGenerator or LLMTaskHandler return a coroutine from try_handle, that coroutine can be run async via asyncio.run_coroutine_threadsafe
    • The big benefit here is while we await for the returned future to be complete, we can call boost::this_fiber::yield(); which works perfectly to bridge the C++ async code with the Python async code

@mdemoret-nv
Copy link
Contributor Author

Below is an example showing how all of the components would be used together to run a LangChain agent executor with the LLMEngine:

# Simple class which uses the sync implementation of try_handle but always fails
class AlwaysFailPromptGenerator(LLMPromptGenerator):

    def try_handle(self, engine: LLMEngine, task: LLMTask,
                   message: ControlMessage) -> typing.Optional[typing.Union[LLMGeneratePrompt, LLMGenerateResult]]:
        # Always return None to skip this generator
        return None


# Prompt generator wrapper around a LangChain agent executor
class LangChainAgentExectorPromptGenerator(LLMPromptGenerator):

    def __init__(self, agent_executor: AgentExecutor):
        self._agent_executor = agent_executor

    async def try_handle(self, input_task: LLMTask,
                         input_message: ControlMessage) -> LLMGeneratePrompt | LLMGenerateResult | None:

        if (input_task["task_type"] != "dictionary"):
            return None

        input_keys = input_task["input_keys"]

        with input_message.payload().mutable_dataframe() as df:
            input_dict: list[dict] = df[input_keys].to_dict(orient="records")

        results = []

        for x in input_dict:
            # Await the result of the agent executor
            single_result = await self._agent_executor.arun(**x)

            results.append(single_result)

        return LLMGenerateResult(model_name=input_task["model_name"],
                                 model_kwargs=input_task["model_kwargs"],
                                 prompts=[],
                                 responses=results)


class SimpleTaskHandler(LLMTaskHandler):

    def try_handle(self, engine: LLMEngine, task: LLMTask, message: ControlMessage,
                   result: LLMGenerateResult) -> typing.Optional[list[ControlMessage]]:

        with message.payload().mutable_dataframe() as df:
            df["response"] = result.responses

        return [message]


# Create the NeMo LLM Service using our API key and org ID
llm_service = NeMoLLMService(api_key="my_api_key", org_id="my_org_id")

engine = LLMEngine(llm_service=llm_service)

# Create a LangChain agent executor using the NeMo LLM Service and specified tools
agent = initialize_agent(tools,
                         NeMoLangChainWrapper(engine.llm_service.get_client(model_name="gpt-43b-002")),
                         agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
                         verbose=True)

# Add 2 prompt generators to test fallback
engine.add_prompt_generator(AlwaysFailPromptGenerator())
engine.add_prompt_generator(LangChainAgentExectorPromptGenerator(agent))

# Add our task handler
engine.add_task_handler(SimpleTaskHandler())

# Create a control message with a single task which uses the LangChain agent executor
message = ControlMessage()

message.add_task("llm_query", LLMDictTask(input_keys=["input"], model_name="gpt-43b-002").dict())

payload = cudf.DataFrame({
    "input": [
        "What is the product of the 998th, 999th and 1000th prime numbers?",
        "What is the product of the 998th and 1000th prime numbers?"
    ],
})
message.payload(MessageMeta(payload))

# Finally, run the engine
result = engine.run(message)

Another easier example showing simple templating with verification of the output:

class TemplatePromptGenerator(LLMPromptGenerator):

    def try_handle(self, input_task: dict,
                    input_message: ControlMessage) -> LLMGeneratePrompt | LLMGenerateResult | None:

        if (input_task["task_type"] != "template"):
            return None

        input_keys = input_task["input_keys"]

        with input_message.payload().mutable_dataframe() as df:
            input_dict: list[dict] = df[input_keys].to_dict(orient="records")

        template: str = input_task["template"]

        prompts = [template.format(**x) for x in input_dict]

        return LLMGeneratePrompt(model_name=input_task["model_name"],
                                    model_kwargs=input_task["model_kwargs"],
                                    prompts=prompts)

class QualityCheckTaskHandler(LLMTaskHandler):

    def _check_quality(self, response: str) -> bool:
        # Some sort of check here
        pass

    def try_handle(self, engine: LLMEngine, task: LLMTask, message: ControlMessage,
                    result: LLMGenerateResult) -> typing.Optional[list[ControlMessage]]:

        # Loop over all responses and check if they pass the quality check
        passed_check = [self._check_quality(r) for r in result.responses]

        with message.payload().mutable_dataframe() as df:
            df["emails"] = result.responses

            if (not all(passed_check)):
                # Need to separate into 2 messages
                good_message = ControlMessage()
                good_message.payload(MessageMeta(df[passed_check]))

                bad_message = ControlMessage()
                bad_message.payload(MessageMeta(df[~passed_check]))

                # Set a new llm_engine task on the bad message
                bad_message.add_task("llm_query",
                                        LLMDictTask(input_keys=["input"], model_name="gpt-43b-002").dict())

                return [good_message, bad_message]

            else:
                # Return a single message
                return [message]

llm_service = NeMoLLMService(api_key="my_api_key", org_id="my_org_id")

engine = LLMEngine(llm_service=llm_service)

# Add a templating prompt generator to convert our payloads into prompts
engine.add_prompt_generator(
    TemplatePromptGenerator(
        template=("Write a brief summary of the email below to use as a subject line for the email. "
                    "Be as brief as possible.\n\n{body}")))

# Add our task handler
engine.add_task_handler(QualityCheckTaskHandler())

# Create a control message with a single task which uses the LangChain agent executor
message = ControlMessage()

message.add_task("llm_query", LLMDictTask(input_keys=["body"], model_name="gpt-43b-002").dict())

payload = cudf.DataFrame({
    "body": [
        "Email body #1...",
        "Email body #2...",
    ],
})
message.payload(MessageMeta(payload))

# Finally, run the engine
result = engine.run(message)

print(result)

@mdemoret-nv
Copy link
Contributor Author

Fixed by #1292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request sherlock Issues/PRs related to Sherlock workflows and components
Projects
Status: Done
1 participant