# Deploying LLMs in Production

With the exploding usage of LLMs in today's day and age it necessary to think about how to deploy them in production. So far, there has been two classes of LLM interfaces:

One is the use-cases that rely on external LLM providers (such as OpenAI, Antropic, etc) and build a system around these services. For these types of deployments the majority of the bulk of compute is taking place outside of what you are building and and your service will focus on doing other logic that goes around it. LangChain makes the implementation of the business logic super easy. Things like prompt templating, building chat messages, caching, building vector embedding databases, preprocessing, etc.

There is a second class of systems that is becoming more atractive because of privacy and cost issues. If you build a scalable business with LLMs the chances are that the cost of querying LLM provider APIs will quickly add up and become a huge expense for your product. So it may be worthwhile using self-hosted models instead of using LLM providers. Also if you are building a system that is handling sensitive data such as medical data, contracts or documents you do not want to send them to LLM providers. With the increasing quality of open-source models there is another alternative and that is self-hosting LLMs. LangChain already provides integrations for interfacing self-hosted models (e.g. huggingface) and you can leverage it to build your product.

In this notebook, we will cover the general concepts related to deployment of LLMs. In praticular:

- We will cover the general requirements for serving both self-hosted and externally hosted LLMs
- We will talk about general solutions for reliability, availability, and flexibility.

## General Architecture and Requirements

The general architecture is simple. The server application can consist of different end-points that serve different needs. Each end-point can abstract a variety of python logic and model inference. For each end-point the incoming request should be parsed and routed to the correct logic while being able to serve other incoming requests. You need to incoporate some logic to select the re-routing of the incoming request to the end-point that can serve that request (e.g. translating French vs. Spanish may need different prompts, and therefore different end-points)

Each end-point can be cloned several times to create replicas that can serve increasing amounts of request while being reliable in case of a failure. At the front of the server there is usually a load balancer that handles dispatching incoming (possibly large) traffic of requests to the end-point and from there to the replicas of the logic. 

In the case of deploying self-hosted models, it is very important to be able to dynamically scale up and down and allocate the right resources for each model to handle the request load while saving cost. Therefore, it is imperative to have an auto-scaling mechanism to provision replicas as the demand for your end-point changes. 



![Alt text](assets/deploy_llm/Langchain%20+%20ray%20joint%20docs.png)

## Load Balancing

The load balancer sits in front of the model instances and receives incoming requests from clients. The load balancer then determines which instance of the model should handle the request based on a set of rules or algorithms. These rules or algorithms take into account factors such as the current load on each instance, the geographic location of the client, and any other relevant factors.

Once the load balancer has determined which instance of the model should handle the request, it forwards the request to that instance. The model instance then processes the request and sends the result back to the load balancer, which in turn sends it back to the client. 

## Fault Tolerance

Application errors like exceptions in your model inference or business logic code can cause your application to fail and not be able to serve traffic. Another reason for failure might be the machine that you are running your application on breaks due to some random reason. One way to battle these is to increase redundancy by increasing the replicas and provide recovery mechanism for failed replicas. But failure inside a model replica is not the only point of failure and we need to make sure we have sufficient tolerance to different kinds of failures that could happen anywhere in the stack. 


## Managing Resources

If you do any compute intensive logic in your application, you need a way to allocate the right amount of resources to each component of your application. For example if part of your traffic is served by OpenAI end-point and some other part is served by a self-hosted model it is crucial to be able to express how much resources I need for each. For self-hosted model I will need at a few GPUs while for querying external APIs having one CPU might be sufficient. 

Also you need to be able to scale up or down the resource allocation as a function of traffic. This is often refered to as auto-scaling and can have significant implications on the cost of running your application. Having a good auto-scaling system is about trading off cost, with responsiveness. You do not want to sacrifice responsiveness of your application for saving costs. At the same time you do not want to over-provision your resources while leaving them idle. 

There are many different strategies that could fit your traffic pattern and it is crucial to be able to adapt these different methods depending on your needs. For example you may decide that as I accumulate more incoming traffic in my buffer I want to spawn more machines to keep the latency under some threshold. Another way to decide when to auto-scale could be for example based on some external signals from other parts of your application that can be used as a proxy on when to scale up or down (like user logins for example).


## Flexibility

Different LLM applications have different requirements. So it is crucial to work with a deployment system that provides enough flexibility when it comes to serving LLMs, similar to composibility of LangChain. There are several key features that ties back to flexibility.



### Model composition

To deploy systems like LangChain it is crucial to be able to put together different models to and connect them with some logic. For example let's say you want to build a SQL query engine from natual language inputs. Being able to query an LLM and get the SQL command is only part of the system. You need to take the input query, have some logic that extracts meta data from the connected database, construct a prompt that the LLM can derive the SQL query for, possibly running the SQL query on a SQL engine and collect the response and keep feeding back the repsponse to the LLM while the requested query runs and then present the results to the user. In this simple example, you can see that I should be able to quickly stictch together different complex components that are all built in python and put them in a dynamic complext chain of logical blocks that can be served together.

### Independent scalability of sub-components

If you are self-hosting your models it is important to be able to independently scale them. For example if I have two translation models, one is fine-tuned for french and another one for spanish. Depending on the incoming request you may want to be able to scale those two deployments indendently. 

### Batching requests 

LLMs can serve batches of quries efficiently. One way to leverage this efficiency is to accrue your incoming traffic and collect their responses in batches and then dispatch the results back to clients. This will greatly increase your utilization of your resources and effectively will reduce cost. 

## Serving platforms

Overall there are a lot of complexities that may arise during serving LLMs even those that you are using LLM providers for. It is important to know the trade-offs and what to look for when you are asssing serving frameworks. Here is a list of frameworks that can help you with productionizing your LLM applications. 

- [Ray Serve](<link_to_ecosystem>)
- [BentoChain](<link_to_ecosystem>)
- [Modal](<link_to_ecosystem>)