# Deploying DeepSeek-LLM-7B-Chat using vLLM

vLLM is an open-source library designed to deliver high throughput and low latency for large language model (LLM) inference. It optimizes text generation workloads by efficiently batching requests and making full use of GPU resources, empowering developers to manage complex tasks like code generation and large-scale conversational AI.

This tutorial guides you through setting up and running vLLM on AMD Instinct™ GPUs using the ROCm software stack. Learn how to configure your environment, containerize your workflow, and send test queries to the vLLM-supported inference server.

## Deploying the LLM using vLLM

Start deploying the LLM (deepseek-ai/deepseek-llm-7b-chat) using vLLM in the Jupyter notebook:

### Start the vLLM server 

Open a new tab in this Jypyter server, click on the terminal icon to open a new terminal, then copy the following command to launch the vLLM server:

```bash
HIP_VISIBLE_DEVICES=0 vllm serve /home/user/Models/deepseek-ai/deepseek-llm-7b-chat \
        --gpu-memory-utilization 0.9 \
        --swap-space 16 \
        --disable-log-requests \
        --dtype float16 \
        --max-model-len 2048 \
        --tensor-parallel-size 1 \
        --host 0.0.0.0 \
        --port 3000 \
        --num-scheduler-steps 10 \
        --max-num-seqs 128 \
        --max-num-batched-tokens 2048 \
        --max-model-len 2048 \
        --distributed-executor-backend "mp"
```

After successfully connecting, it displays `INFO:     Application startup complete.`.

**Note**: In a multi-GPU environment, the setting `HIP_VISIBLE_DEVICES=x` is recommended to deploy the LLM on your preferred GPU.

### Start the client

After successfully running the server, as described above, run the following code to start your client:

In [1]:
import requests

url = "http://localhost:3000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "/home/user/Models/deepseek-ai/deepseek-llm-7b-chat",
    "messages": [
        {
            "role": "system",
            "content": "You are an expert in the field of AI. Make sure to provide an explanation in few sentences."
        },
        {
            "role": "user",
            "content": "Explain the concept of AI Agents."
        }
    ],
    "stream": False,
    "max_tokens": 128
}

response = requests.post(url, headers=headers, json=data)
print(response.json())


ConnectionError: HTTPConnectionPool(host='localhost', port=3000): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe053f66660>: Failed to establish a new connection: [Errno 111] Connection refused'))

**Note**: Remember to match the Docker `--port` **3000** and the port indicated in the URL, for instance, http://localhost:**3000**. If the port is already used by another application, you can modify the number. 

##### If the connection is successful, the output will be:

``` bash
{'id': 'chatcmpl-3ba8e0bf51524fffa686d7b67c4e9b6b', 'object': 'chat.completion', 'created': 1751455990, 'model': '/home/user/Models/deepseek-ai/deepseek-llm-7b-chat', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': 'An AI agent is a software program that learns and interacts with its environment to achieve specific goals or tasks. It is designed to make decisions and take actions based on the information it receives from the environment, and it uses machine learning algorithms to improve its performance over time. AI agents can be used in a variety of applications, such as game playing, robotics, natural language processing, and virtual assistants...}
```