> **NOTE**: This tutorial will demonstrate how to utilize the GPU provided by Colab to run LLM with Xinference local server, and how to interact with the model in different ways (OpenAI-Compatible endpoints/Xinference's builtin Client/LangChain).


# Xinference

Xorbits Inference (Xinference) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications.




* [Docs](https://inference.readthedocs.io/en/latest/index.html)
* [Built-in Models](https://inference.readthedocs.io/en/latest/models/builtin/index.html)
* [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)
* [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)
* [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)


## Set up the environment

> **NOTE**: We recommend you run this demo on a GPU. To change the runtime type: In the toolbar menu, click **Runtime** > **Change runtime typ**e > **Select the GPU (T4)**


### Check memory and GPU resources

In [9]:
import psutil
import torch
import subprocess  # 用于执行系统命令

# 检查内存信息
ram = psutil.virtual_memory()
ram_total = ram.total / (1024**3)
print(f'RAM Total: {ram_total:.2f} GB')

# 检查GPU信息
print('\n============= GPU INFO =============')
if torch.cuda.is_available():
    try:
        # 执行nvidia-smi命令（不需要指定路径，使用系统环境变量）
        result = subprocess.run(
            ["nvidia-smi"],
            check=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        print(result.stdout)
    except FileNotFoundError:
        print("NVIDIA-SMI 工具未安装")
    except subprocess.CalledProcessError as e:
        print(f"执行错误: {e.stderr}")
    except Exception as e:
        print(f"未知错误: {str(e)}")
else:
    print('GPU NOT available 或未安装CUDA驱动')

# 补充PyTorch的GPU详细信息（可选）
if torch.cuda.is_available():
    print(f'\n[PyTorch检测] 可用GPU数量: {torch.cuda.device_count()}')
    for i in range(torch.cuda.device_count()):
        print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
        print(f'显存总量: {torch.cuda.get_device_properties(i).total_memory/1024**3:.2f} GB')


RAM Total: 12.67 GB

Mon Feb 17 05:24:44 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                           

In [10]:
import subprocess
import sys

try:
    import torch
except ImportError:
    print("PyTorch 未安装，请先执行: pip install torch")
    sys.exit(1)

def get_cuda_version_from_driver():
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=driver_version", "--format=csv,noheader"],
            capture_output=True,
            text=True,
            check=True
        )
        driver_version = result.stdout.strip().split('\n')[0]

        # 获取完整的CUDA版本信息
        full_info = subprocess.run(
            ["nvidia-smi"],
            capture_output=True,
            text=True,
            check=True
        )
        cuda_version = "未知"
        for line in full_info.stdout.split('\n'):
            if "CUDA Version" in line:
                cuda_version = line.split()[-2]
                break
        return driver_version, cuda_version
    except Exception as e:
        return f"无法获取驱动信息 ({str(e)})", "未知"

# 主程序
print("\n======= PyTorch 版本信息 =======")
print(f"PyTorch 版本: {torch.__version__}")

if torch.cuda.is_available():
    print("\n======= CUDA 信息 =======")
    # PyTorch检测到的CUDA版本
    print(f"PyTorch实际使用的CUDA版本: {torch.version.cuda}")

    # 从驱动获取信息
    driver_ver, cuda_ver = get_cuda_version_from_driver()
    print(f"驱动支持的最高CUDA版本: {cuda_ver}")
    print(f"NVIDIA 驱动版本: {driver_ver}")

    # 设备详细信息
    print("\n======= 设备信息 =======")
    print(f"当前使用的GPU: {torch.cuda.get_device_name(0)}")
    print(f"显存总量: {torch.cuda.get_device_properties(0).total_memory/1024**3:.1f} GB")
else:
    print("\n======= CUDA 状态 =======")
    print("CUDA 不可用")
    print("可能原因：")
    print("1. 未安装NVIDIA驱动")
    print("2. PyTorch未安装CUDA版本 (安装时请使用: pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cuXXX)")



PyTorch 版本: 2.5.1+cu124

PyTorch实际使用的CUDA版本: 12.4
驱动支持的最高CUDA版本: 12.4
NVIDIA 驱动版本: 550.54.15

当前使用的GPU: Tesla T4
显存总量: 14.7 GB


In [7]:
%pip install "xinference[vllm]"



### Install Xinference and dependencies

In [9]:
%pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4

Looking in indexes: https://flashinfer.ai/whl/cu124/torch2.4
Collecting flashinfer
  Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.0.post1/flashinfer-0.2.0.post1%2Bcu124torch2.4-cp311-cp311-linux_x86_64.whl (420.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m420.9/420.9 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hINFO: pip is looking at multiple versions of flashinfer to determine which version is compatible with other requirements. This could take a while.
  Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.0/flashinfer-0.2.0%2Bcu124torch2.4-cp311-cp311-linux_x86_64.whl (418.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m418.6/418.6 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6%2Bcu124torch2.4-cp311-cp311-linux_x86_64.whl (1327.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
import subprocess

result = subprocess.run(
    ['xinference-local', '--host' ,'0.0.0.0', '--port', '9997'],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)

print("输出:", result.stdout)
print("错误:", result.stderr)



--------------------------cht----------------------------

In [1]:
%pip install -U -q xinference[transformers] openai langchain

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.2/40.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.3/472.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [2]:
!pip show xinference

Name: xinference
Version: 1.2.2
Summary: Model Serving Made Easy
Home-page: https://github.com/xorbitsai/inference
Author: Qin Xuye
Author-email: qinxuye@xprobe.io
License: Apache License 2.0
Location: /usr/local/lib/python3.11/dist-packages
Requires: aioprometheus, async-timeout, click, fastapi, gradio, huggingface-hub, modelscope, nvidia-ml-py, openai, passlib, peft, pillow, pydantic, python-jose, requests, setproctitle, sse-starlette, tabulate, timm, torch, tqdm, typing-extensions, uvicorn, xoscar
Required-by: 


## A Quick Start Demo
### Start Local Server


To start a local instance of Xinference, run `xinference` in the background via `nohup`:

In [3]:
!nohup xinference-local  > xinference.log 2>&1 &

Congrats! You now have Xinference running in Colab machine. The default host and ip is 127.0.0.1 and 9997 respectively.


Once Xinference is running, there are multiple ways we can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client.

The command line tool is `xinference`. You can list the commands that can be used by running:

In [4]:
!xinference --help

Usage: xinference [OPTIONS] COMMAND [ARGS]...

  Xinference command-line interface for serving and deploying models.

Options:
  -v, --version       Show the current version of the Xinference tool.
  --log-level TEXT    Set the logger level. Options listed from most log to
                      (Default level is INFO)
  -H, --host TEXT     Specify the host address for the Xinference server.
  -p, --port INTEGER  Specify the port number for the Xinference server.
  --help              Show this message and exit.

Commands:
  cached         List all cached models in Xinference.
  cal-model-mem  calculate gpu mem usage with specified model size and...
  chat           Chat with a running LLM.
  engine         Query the applicable inference engine by model name.
  generate       Generate text using a running LLM.
  launch         Launch a model with the Xinference framework with the...
  list           List all running models in Xinference.
  login          Login when the cluster is authen

### Run Qwen-Chat

Xinference supports a variety of LLMs. Learn more in https://inference.readthedocs.io/en/latest/models/builtin/.

Let’s start by running a built-in model: `Qwen-1_8B-Chat`.


We can specify the model’s UID using the `--model-uid` or `-u` flag. If not specified, Xinference will generate it. This create a new model instance with unique ID `my-llvm`:


In [5]:
!xinference launch -u my-llm -n qwen-chat -s 1_8 -f pytorch -en transformers

Launch model name: qwen-chat with kwargs: {}
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.11/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py", line 493, in _make_request
    conn.request(
  File "/usr/local/lib/python3.11/dist-packag

When you start a model for the first time, Xinference will download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights. We cache the model files locally, so there’s no need to redownload them for subsequent starts.


## Interact with the running model

Congrats! You now have the model running by Xinference. Once the model is running, we can try it out either command line, via cURL, or via Xinference’s python client:



### 1.Use the OpenAI compatible endpoint

Xinference provides OpenAI-compatible APIs for its supported models, so you can use Xinference as a local drop-in replacement for OpenAI APIs. For example:


In [None]:
import openai

messages=[
    {
        "role": "user",
        "content": "Who are you?"
    }
]

client = openai.Client(api_key="empty", base_url=f"http://0.0.0.0:9997/v1")
client.chat.completions.create(
    model="my-llm",
    messages=messages,
)

ChatCompletion(id='chatff300498-10d7-11ef-97ae-0242ac1c000c', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="As an AI language model, I don't have personal experiences or feelings, but I'm designed to assist and provide information on various topics. How can I help you today?", role='assistant', function_call=None, tool_calls=None))], created=1715570535, model='my-llm', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=35, prompt_tokens=23, total_tokens=58))

### 2. Send request using curl




In [None]:
!curl -k -X 'POST' -N \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ "model": "my-llm", "messages": [ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is the largest animal?"} ]}'

{"id":"chat07fe7c94-10d8-11ef-88d5-0242ac1c000c","object":"chat.completion","created":1715570550,"model":"my-llm","choices":[{"index":0,"message":{"role":"assistant","content":"The largest animal in the world is the blue whale (Balaenoptera musculus). The blue whale can grow up to 100 feet long and weigh as much as 200 tons. It is the deepest-diving mammal in the world, capable of diving to depths of over 35,000 feet without using its breathing tube. Despite its massive size, the blue whale is relatively small compared to other marine animals, and it spends most of its life in the open ocean."},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":105,"total_tokens":130}}

### 3. Use Xinference's Python client

In [None]:
from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model = client.get_model("my-llm")
model.chat(
    prompt="hello",
    chat_history=[
    {
        "role": "user",
        "content": "What is the largest animal?"
    }]
)

{'id': 'chat1200b1f8-10d8-11ef-88d5-0242ac1c000c',
 'object': 'chat.completion',
 'created': 1715570567,
 'model': 'my-llm',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'Hello! How can I assist you today?'},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 31, 'completion_tokens': 9, 'total_tokens': 40}}

### 4. Langchain intergration

In [None]:
from langchain.llms import Xinference
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = Xinference(server_url='http://127.0.0.1:9997', model_uid='my-llm')

template = 'What is the largest {kind} on the earth?'

prompt = PromptTemplate(template=template, input_variables=['kind'])

llm_chain = LLMChain(prompt=prompt, llm=llm)

generated = llm_chain.run(kind='plant')
print(generated)

  warn_deprecated(
  warn_deprecated(


?
The answer is:
The tallest plant on Earth is a tree named Ginkgo biloba. It stands at a height of approximately 83 meters (276 feet) and has roots that reach deep into the ground. Ginkgo biloba is native to China but has been planted in gardens around the world as well. It is known for its beautiful, delicate leaves and its ability to survive even in harsh environments. Despite being over 100 years old, it continues to grow and thrive today.
