## OpenLLM


With OpenLLM, you can run inference with any open-source large-language models(LLMs), deploy to the cloud or on-premises, and build powerful AI apps.

🚂 SOTA LLMs: built-in supports a wide range of open-source LLMs and model runtime, including StableLM, Falcon, Dolly, Flan-T5, ChatGLM, StarCoder and more.

🔥 Flexible APIs: serve LLMs over RESTful API or gRPC with one command, query via WebUI, CLI, our Python/Javascript client, or any HTTP client.

⛓️ Freedom To Build: First-class support for LangChain and BentoML allows you to easily create your own AI apps by composing LLMs with other models and services.

🎯 Streamline Deployment: Automatically generate your LLM server Docker Images or deploy as serverless endpoint via ☁️ BentoCloud.

🤖️ Bring your own LLM: Fine-tune any LLM to suit your needs with LLM.tuning(). (Coming soon)



https://github.com/bentoml/OpenLLM

In [None]:
 !pip install openllm
!pip install mlflow
!pip install pyngrok

In [None]:
#import pyngrok

from pyngrok import ngrok

# Terminate open tunnels if exist
ngrok.kill()

In [None]:
NGROK_AUTH_TOKEN = "1xiKn1eTJOmwpwdB4DtuzRRMXZf_6KBaaCrekZX8Vn7HQjQRP"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)



In [None]:
ngrok_tunnel = ngrok.connect(addr="3000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)



MLflow Tracking UI: https://e749-35-187-254-110.ngrok-free.app


In [None]:
! openllm -h

Usage: openllm [OPTIONS] COMMAND [ARGS]...

   ██████╗ ██████╗ ███████╗███╗   ██╗██╗     ██╗     ███╗   ███╗
  ██╔═══██╗██╔══██╗██╔════╝████╗  ██║██║     ██║     ████╗ ████║
  ██║   ██║██████╔╝█████╗  ██╔██╗ ██║██║     ██║     ██╔████╔██║
  ██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║██║     ██║     ██║╚██╔╝██║
  ╚██████╔╝██║     ███████╗██║ ╚████║███████╗███████╗██║ ╚═╝ ██║
   ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝╚══════╝╚══════╝╚═╝     ╚═╝

  An open platform for operating large language models in production.
  Fine-tune, serve, deploy, and monitor any LLMs with ease.

Options:
  -v, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  build            Package a given models...
  download-models  Setup LLM interactively.
  models           List all supported models.
  prune            Remove all saved models...
  query            Ask a LLM interactively,...
  start            Start any LLM as a REST...
  start-grpc       Start any LLM as a gRPC...


## Starting an LLM Server
To start an LLM server, use openllm start. For example, to start a dolly-v2 server:

    openllm start dolly-v2
Following this, a Web UI will be accessible at http://localhost:3000 where you can experiment with the endpoints and sample input prompts.

OpenLLM provides a built-in Python client, allowing you to interact with the model. In a different terminal window or a Jupyter notebook, create a client to start interacting with the model:

    >>> import openllm
    >>> client = openllm.client.HTTPClient('http://localhost:3000')
    >>> client.query('Explain to me the difference between "further" and "farther"')
You can also use the openllm query command to query the model from the terminal:

    export OPENLLM_ENDPOINT=http://localhost:3000
    openllm query 'Explain to me the difference between "further" and "farther"'
    Visit http://localhost:3000/docs.json for OpenLLM's API specification.

Users can also specify different variants of the model to be served, by providing the --model-id argument, e.g.:

openllm start flan-t5 --model-id google/flan-t5-large
Use the openllm models command to see the list of models and their variants supported in OpenLLM.



In [None]:
!openllm start dolly-v2

A new version of the following files was downloaded from https://huggingface.co/databricks/dolly-v2-3b:
- instruct_pipeline.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
2023-06-15T21:27:29+0000 [INFO] [cli] Environ for worker 0: set CUDA_VISIBLE_DEVICES to 0
2023-06-15T21:27:29+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-06-15T21:27:30+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-06-15T21:28:46+0000 [INFO] [api_server:llm-dolly-v2-service:2] 136.226.49.3:0 (scheme=https,method=GET,path=/,type=,length=) (status=200,type=text/html; charset=utf-8,length=2859) 0.506ms (trace=9430b4af100ceea75bb222a865f604a0,span=8ce872da8273e89b,sampled=1,service.name=llm-dolly-v2-service)
2023-06-15T21:28:46+0000 [IN

In [None]:
import openllm
client = openllm.client.HTTPClient('http://localhost:3000')
client.query("Explain to me the difference between further and farther")

In [None]:
!export OPENLLM_ENDPOINT=http://localhost:3000
!openllm query 'Explain to me the difference between "further" and "farther"'