<a href="https://colab.research.google.com/github/nhut-ngnn/DeepSeek_Inferencce/blob/master/deepseek_r1_distill_qwen_fast_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Running Deepseek R1 with distill Qwen 1.5B_AWQ

### Install PIP Needed

In [None]:
!pip install fastapi nest-asyncio pyngrok uvicorn
!pip install vllm # you could pass if you don't want to be prompted to restart runtime !pip install --quiet vllm

Collecting fastapi
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting pyngrok
  Downloading pyngrok-7.2.3-py3-none-any.whl.metadata (8.7 kB)
Collecting uvicorn
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting starlette<0.47.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.46.1-py3-none-any.whl.metadata (6.2 kB)
Downloading fastapi-0.115.11-py3-none-any.whl (94 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyngrok-7.2.3-py3-none-any.whl (23 kB)
Downloading uvicorn-0.34.0-py3-none-any.whl (62 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading starlette-0.46.1-py3-none-any.whl (71 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uvicorn, pyngrok, s

### Load and Run Model in the background



In [None]:
import subprocess
model = 'jakiAJK/DeepSeek-R1-Distill-Qwen-1.'

vllm_process = subprocess.Popen([
    'vllm',
    'serve',
    model,
    '--trust-remote-code',
    '--dtype', 'half',
    '--max-model-len', '16384',
    '--enable-chunked-prefill', 'false',
    '--tensor-parallel-size', '1'
], stdout=subprocess.PIPE, stderr=subprocess.PIPE, start_new_session=True)

### Check and Test vllm


In [None]:
import requests
import time
from typing import Tuple
import sys

def check_vllm_status(url: str = "http://localhost:8000/health") -> bool:
    """Check if VLLM server is running and healthy."""
    try:
        response = requests.get(url)
        return response.status_code == 200
    except requests.exceptions.ConnectionError:
        return False

def monitor_vllm_process(vllm_process: subprocess.Popen, check_interval: int = 5) -> Tuple[bool, str, str]:
    """
    Monitor VLLM process and return status, stdout, and stderr.
    Returns: (success, stdout, stderr)
    """
    print("Starting VLLM server monitoring...")

    while vllm_process.poll() is None:
        if check_vllm_status():
            print("✓ VLLM server is up and running!")
            return True, "", ""

        print("Waiting for VLLM server to start...")
        time.sleep(check_interval)

        if vllm_process.stdout.readable():
            stdout = vllm_process.stdout.read1().decode('utf-8')
            if stdout:
                print("STDOUT:", stdout)

        if vllm_process.stderr.readable():
            stderr = vllm_process.stderr.read1().decode('utf-8')
            if stderr:
                print("STDERR:", stderr)

    stdout, stderr = vllm_process.communicate()
    return False, stdout.decode('utf-8'), stderr.decode('utf-8')

In [None]:
try:
    success, stdout, stderr = monitor_vllm_process(vllm_process)

    if not success:
        print("\n❌ VLLM server failed to start!")
        print("\nFull STDOUT:", stdout)
        print("\nFull STDERR:", stderr)
        sys.exit(1)

except KeyboardInterrupt:
    print("\n⚠️ Monitoring interrupted by user")
    # # This should just exited the process of probing, not the vllm, if you want it then you coul uncomment this.
    # vllm_process.terminate()
    # try:
    #     vllm_process.wait(timeout=5)
    # except subprocess.TimeoutExpired:
    #     vllm_process.kill()

    stdout, stderr = vllm_process.communicate()
    if stdout: print("\nFinal STDOUT:", stdout.decode('utf-8'))
    if stderr: print("\nFinal STDERR:", stderr.decode('utf-8'))
    sys.exit(0)

Starting VLLM server monitoring...
Waiting for VLLM server to start...
STDOUT: INFO 03-15 14:58:07 __init__.py:207] Automatically detected platform cuda.

STDERR: 2025-03-15 14:57:58.503791: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742050678.800372    2207 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742050678.891522    2207 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-15 14:57:59.527630: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild Te

### Running Test and defining function

In [None]:
import requests
import json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from fastapi.responses import StreamingResponse
import requests

class QuestionRequest(BaseModel):
    question: str


def ask_model(question: str):
    """
    Sends a request to the model server and fetches a response.
    """
    url = "http://localhost:8000/v1/chat/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": question
            }
        ]
    }

    response = requests.post(url, headers=headers, json=data)
    response.raise_for_status()
    return response.json()

result = ask_model("What is the capital of France?")
print(json.dumps(result, indent=2))

def stream_llm_response(question:str):
    url = "http://localhost:8000/v1/chat/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": model,
        "messages": [{"role": "user", "content": question}],
        "stream": True  # 🔥 Enable streaming
    }

    with requests.post(url, headers=headers, json=data, stream=True) as response:
        for line in response.iter_lines():
            if line:
                # OpenAI-style streaming responses are prefixed with "data: "
                decoded_line = line.decode("utf-8").replace("data: ", "")
                yield decoded_line + "\n"

{
  "id": "chatcmpl-9aa06ff1f7a84d57b4583ab0e543b49f",
  "object": "chat.completion",
  "created": 1742050810,
  "model": "jakiAJK/DeepSeek-R1-Distill-Qwen-1.5B_AWQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "<think>\n\n</think>\n\nThe capital of France is Paris.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 22,
    "completion_tokens": 12,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}


### Running FastAPI and Pathing

In [None]:
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import nest_asyncio
from pyngrok import ngrok
import uvicorn
import getpass

from pyngrok import ngrok, conf

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=['*'],
    allow_credentials=True,
    allow_methods=['*'],
    allow_headers=['*'],
)

@app.get('/')
async def root():
    return {'hello': 'world'}
@app.post("/api/v1/generate-response")
def generate_response(request: QuestionRequest):
    """
    API endpoint to generate a response from the model.
    """
    try:
        response = ask_model(request.question)
        return {"response": response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/api/v1/generate-response-stream")
def stream_response(request:QuestionRequest):

  try:
    response = stream_llm_response(request.question)
    return StreamingResponse(response, media_type="text/event-stream")
  except Exception as e:
    raise HTTPException(status_code=500, detail=str(e))

### Ngrok auth

In [None]:
! ngrok config add-authtoken <add your token>

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


### Running NGROK and Service.


In [None]:
port = 8081
public_url = ngrok.connect(port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{port}\"")

 * ngrok tunnel "https://b988-35-204-31-160.ngrok-free.app" -> "http://127.0.0.1:8081"


In [None]:
nest_asyncio.apply()
uvicorn.run(app, port=port)

INFO:     Started server process [312]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8081 (Press CTRL+C to quit)


INFO:     116.110.42.60:0 - "GET / HTTP/1.1" 200 OK
INFO:     116.110.42.60:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     116.110.42.60:0 - "GET /api/v1/generate-response HTTP/1.1" 405 Method Not Allowed


### Example CURL would be something like this

In [None]:
curl --location 'https://b988-35-204-31-160.ngrok-free.app/api/v1/generate-response' \
--header 'Content-Type: application/json' \
--data '{
    "question": "Add the question in here",
}'

#### Using Response Total

In [None]:
{
  "id": "chatcmpl-680bc07cd6de42e7a00a50dfbd99e833",
  "object": "chat.completion",
  "created": 1738129381,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, so I'm trying to find out what the capital of France is. Hmm, I remember hearing a few cities named after the myths or something. Let me think. I think Neuch portfolio is where the comma was named. Yeah, that's right, until sometimes they changed it, but I think it's still there now. Then there's Charles-de-Lorraine. I've seen that name written before in various contexts, maybe managers or something. And then I think there's Saint Mal\u25e6e as a significant city in France. Wait, I'm a bit confused about the last one. Is that the capital or somewhere else? I think the capital blew my mind once, and I still don't recall it. Let me think of the names that come to mind. Maybe Paris? But is there something else? I've heard about places likequalification, Guiness, and Agoura also named after mythological figures, but are they capitals? I don't think so. So among the prominent ones, maybe Neuch portfolio, Charles-de-Lorraine, and Saint Mal\u25e6e are the names intended for the capital, but I'm unsure which one it is. Wait, I think I might have confused some of them. Let me try to look up the actual capital. The capital of France is a city in the eastern department of\u5c55\u51fa. Oh, right, there's a special place called Place de la Confluense. Maybe that's where the capital is. So I think the capital is Place de la Confluense, not the city name. So the capital isn't the town; it's quite a vein-shaped area. But I'm a bit confused because some people might refer to just the town as the capital, but in reality, it's a larger area. So to answer the question, the capital of France is Place de la Confluense, and its formal name is la Confluense. I'm not entirely certain if there are any other significant cities or names, but from what I know, the others I listed might be historical places but not exactly capitals. Maybe the\u6bebot\u00e9 family name is still sometimes used for the capital, but I think it's not the actual name. So putting it all together, the capital is Place de la Confluense, and the correct name is \"la Confluense.\" The other names like Neuch portfolio are places, not capitals. So, overall, my answer would be the capital is la Confluense named at Place de la Confluense.\n</think>\n\nThe capital of France is called Place de la Confluense. Its official name is \"la Confluense.\"",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 550,
    "completion_tokens": 540,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

### Kill the VLLM

In [None]:
vllm_process.terminate()
vllm_process.wait()  # Wait for process to terminate

0