# Menjalankan LLM dengan Colab

Percobaan disini bisa digunakan juga dengan model LLM lainnya, selain deepseek.  

Minimal penggunaan GPU adalah T4 GPU, ini saja sudah mepet. GPU Usage 13/15 GB.

Ref:

https://www.google.com/url?q=https%3A%2F%2Fmedium.com%2F%40hakimnaufal%2Ftrying-out-vllm-deepseek-r1-in-google-colab-a-quick-guide-a4fe682b8665


## 1. Install PIP yang dibutuhkan


In [2]:
!pip install fastapi nest-asyncio pyngrok uvicorn
!pip install vllm # you could pass if you don't want to be prompted to restart runtime !pip install --quiet vllm
!pip install fastai

Collecting pyngrok
  Downloading pyngrok-7.4.0-py3-none-any.whl.metadata (8.1 kB)
Downloading pyngrok-7.4.0-py3-none-any.whl (25 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.4.0
Collecting vllm
  Downloading vllm-0.10.2-cp38-abi3-manylinux1_x86_64.whl.metadata (16 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (217 bytes)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Collecting lm-format-enforcer==0.11.3 (from vllm)
  Downloading lm_format_enforcer-0.11.3-py3-none-any.whl.metadata (17 kB)
Collecting llguidance<0.8.0,>=0.7.11 (from vllm)
  Downloading llguidance-0.7.30-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting outlines_core==0.2.11 (from vllm)
  Downloading outlines_core-0.2.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_



Penggunaan PIP:

- FastApi: adalah python web framework untuk membuat API. Dimana disini user bisa mengirimkan data dan mendapatkan respon dari model

- uvicorn: adalah ASGI ( Asynchronous Server Gateway Interface ) server. Uvicor akan melengkapi aplikasi web yang dibuat olef FastApi dan menjalankannya. Secara sederhana, FastApi adalah protokol dan Uvicorn adalah yang menjalankan protokol.

- nest-asyncio: Ini adalah alat untuk membantu kita untuk menjalankan dua proses secara sekaligus. Ini nanti akan kita pakai untuk menjalankan loop dari FastAPI ataupun Uvicorn. karena mereka berdua sama-sama memiliki loop dan saling ketergantungan.

- pyngrok: Ini adalah wrapper untuk ngrok. Ngrok adalah alat yang dapat membantu kita untuk mengirimkan data ke luar dari komputer, seperti di dunia internet. ini sangat membantu untuk pengembangan, jadi kita tidak perlu repot-repot menghosting aplikasi, aplikasi lokal kita sudah bisa ada di internet.

- vllm: Ini adalah library python penting untuk menjalankan LLM. Keuntungan adalah meningkatkan kecepatan dan efektifitas, mendorong untuk menerima request yang lebih banyak per detik dan meningkatkan kemampuan memory dari model.

- fastai: adalah library yang dibangun diatas PyTorch. ini akan menyederhakan proses training dan deploying neuiral network. Jika tadi vllm ada untuk meningkatkan penggunaan LLM inference, fastai akan menyediakan alat-alat yang kita butuhkan untuk melakukan training.


**Tetapi** disini kita tidak akan menggunakan ngrok dan uvicorn terlebih dahulu



Kamu bisa coba-coba model lain disini: https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fdeepseek-ai%2FDeepSeek-R1%233-model-downloads



## 2. Menjalankan model di background

In [3]:
# Untuk menjalankan model
import subprocess
# model bisa diambil dari sini: https://huggingface.co/deepseek-ai/DeepSeek-R1#3-model-downloads
model = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B'

# Mulai Jalankan vllm dibagian background komputer
vllm_process = subprocess.Popen([
    'vllm',
    'serve',
    model,
    '--trust-remote-code',
    '--dtype', 'half',
    '--max-model-len', '16384',
    '--tensor-parallel-size', '1' # Subcommand akan mendeskripsikan penggunaan model
], stdout=subprocess.PIPE, stderr=subprocess.PIPE, start_new_session=True)

## 3. Check dan Test vllm

Prossses ini akan menjadi test pertama. kita akan mengetahui apakah vllm berjalan dengan baik jika sudah dijalankan.

In [4]:
import requests
import time
from typing import Tuple
import sys

def check_vllm_status(url: str = "http://localhost:8000/health") -> bool:
    """Untuk mencari tau apakajh LLM berfungsi normal."""
    try:
        response = requests.get(url)
        return response.status_code == 200
    except requests.exceptions.ConnectionError:
        return False

def monitor_vllm_process(vllm_process: subprocess.Popen, check_interval: int = 5) -> Tuple[bool, str, str]:
    """
    Monitoring status vllm dan prosesnya , stdout, and stderr.
    Returns: (success, stdout, stderr)
    """
    print("Starting VLLM server monitoring...")

    while vllm_process.poll() is None:  # While loop selama proses masih berjalan
        if check_vllm_status():
            print("✓ VLLM server is up and running!")
            return True, "", ""

        print("Waiting for VLLM server to start...")
        time.sleep(check_interval)

        # Menampilkan Output jika ditemukan.
        if vllm_process.stdout.readable():
            stdout = vllm_process.stdout.read1().decode('utf-8')
            if stdout:
                print("STDOUT:", stdout)

        if vllm_process.stderr.readable():
            stderr = vllm_process.stderr.read1().decode('utf-8')
            if stderr:
                print("STDERR:", stderr)

    # Jika sampai disini, maka proses telah selesai
    stdout, stderr = vllm_process.communicate()
    return False, stdout.decode('utf-8'), stderr.decode('utf-8')

## 4. Buat persimpangan jika VLLM sukses dan tidak

In [5]:
try:
    success, stdout, stderr = monitor_vllm_process(vllm_process)

    if not success:
        print("\n❌ VLLM server failed to start!")
        print("\nFull STDOUT:", stdout)
        print("\nFull STDERR:", stderr)
        sys.exit(1)

except KeyboardInterrupt:
    print("\n⚠️ Monitoring interrupted by user")
    # # This should just exited the process of probing, not the vllm, if you want it then you coul uncomment this.
    # vllm_process.terminate()
    # try:
    #     vllm_process.wait(timeout=5)
    # except subprocess.TimeoutExpired:
    #     vllm_process.kill()

    stdout, stderr = vllm_process.communicate()
    if stdout: print("\nFinal STDOUT:", stdout.decode('utf-8'))
    if stderr: print("\nFinal STDERR:", stderr.decode('utf-8'))
    sys.exit(0)

Starting VLLM server monitoring...
Waiting for VLLM server to start...
STDOUT: INFO 10-03 07:53:01 [__init__.py:216] Automatically detected platform cuda.

STDERR: 2025-10-03 07:52:54.589819: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759477974.610806     804 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759477974.616665     804 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1759477974.632021     804 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1759477974.632041     804 computation_placer.cc:177] computation placer already regist

## 5. Jalankan dan fungsi tambahan

In [6]:
import requests
import json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from fastapi.responses import StreamingResponse
import requests

# mengirimkan Request skema untuk input
class QuestionRequest(BaseModel):
    question: str


def ask_model(question: str):
    """
    Kirimkan request ke model dan dapatkan respon.
    """
    url = "http://localhost:8000/v1/chat/completions"  # Atur kembali jika kamu mendapati URL yang berbeda
    headers = {"Content-Type": "application/json"}
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": question
            }
        ]
    }

    response = requests.post(url, headers=headers, json=data)
    response.raise_for_status()  # Kirimkan informasi jika ada error HTTP
    return response.json()

## 6. Coba gunakan modelnya



In [7]:
question = """
the first president of america?
"""

# Umpamakan ask_model berasal dari cell Wt2lqQ_vfrdn tersedia
try:
    result = ask_model(question)
    import json
    print(json.dumps(result, indent=2))
    abc = result['choices'][0]['message']['content']
    print(abc)
except requests.exceptions.RequestException as e:
    print(f"Error sending request: {e}")

{
  "id": "chatcmpl-0173672a9aa64e4bb3314dee93b2e3fc",
  "object": "chat.completion",
  "created": 1759478183,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Okay, so I need to figure out who the first president of America was. I remember hearing that it was George Washington, but I'm not entirely sure. Let me think through this step by step.\n\nFirst, George Washington was born on February 22, 1732, in Virginia. He was the second president of the United States, so the first one must have been someone before him. I think it was John Adams, but I'm not 100% certain. Maybe I should look into the historical timeline.\n\nIn the 1700s, there was a time when the U.S. was a colony of England, and the first president would have been someone from England. That would make sense because England was a neighboring country at that time. So, maybe John Adams was the first president. He 