# Menjalankan LLM dengan Colab

Percobaan disini bisa digunakan juga dengan model LLM lainnya, selain deepseek.  

Minimal penggunaan GPU adalah T4 GPU, ini saja sudah mepet. GPU Usage 13/15 GB.

Ref:

https://www.google.com/url?q=https%3A%2F%2Fmedium.com%2F%40hakimnaufal%2Ftrying-out-vllm-deepseek-r1-in-google-colab-a-quick-guide-a4fe682b8665


Tujuan:
- komunikasi dengan LLM
- analisis file pdf

## 1. Install PIP yang dibutuhkan


In [1]:
!pip install vllm # you could pass if you don't want to be prompted to restart runtime !pip install --quiet vllm
!pip install fastai

Collecting vllm
  Downloading vllm-0.11.0-cp38-abi3-manylinux1_x86_64.whl.metadata (17 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (217 bytes)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Collecting lm-format-enforcer==0.11.3 (from vllm)
  Downloading lm_format_enforcer-0.11.3-py3-none-any.whl.metadata (17 kB)
Collecting llguidance<0.8.0,>=0.7.11 (from vllm)
  Downloading llguidance-0.7.30-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting outlines_core==0.2.11 (from vllm)
  Downloading outlines_core-0.2.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting diskcache==5.6.3 (from vllm)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting lark==1.2.2 (from vllm)
  Downloading lark-1.2.2-py3-none-any.whl.met



Penggunaan PIP:

- FastApi: adalah python web framework untuk membuat API. Dimana disini user bisa mengirimkan data dan mendapatkan respon dari model

- uvicorn: adalah ASGI ( Asynchronous Server Gateway Interface ) server. Uvicor akan melengkapi aplikasi web yang dibuat olef FastApi dan menjalankannya. Secara sederhana, FastApi adalah protokol dan Uvicorn adalah yang menjalankan protokol.

- nest-asyncio: Ini adalah alat untuk membantu kita untuk menjalankan dua proses secara sekaligus. Ini nanti akan kita pakai untuk menjalankan loop dari FastAPI ataupun Uvicorn. karena mereka berdua sama-sama memiliki loop dan saling ketergantungan.

- pyngrok: Ini adalah wrapper untuk ngrok. Ngrok adalah alat yang dapat membantu kita untuk mengirimkan data ke luar dari komputer, seperti di dunia internet. ini sangat membantu untuk pengembangan, jadi kita tidak perlu repot-repot menghosting aplikasi, aplikasi lokal kita sudah bisa ada di internet.

- vllm: Ini adalah library python penting untuk menjalankan LLM. Keuntungan adalah meningkatkan kecepatan dan efektifitas, mendorong untuk menerima request yang lebih banyak per detik dan meningkatkan kemampuan memory dari model.

- fastai: adalah library yang dibangun diatas PyTorch. ini akan menyederhakan proses training dan deploying neuiral network. Jika tadi vllm ada untuk meningkatkan penggunaan LLM inference, fastai akan menyediakan alat-alat yang kita butuhkan untuk melakukan training.


**Tetapi** disini kita tidak akan menggunakan ngrok dan uvicorn terlebih dahulu



Kamu bisa coba-coba model lain disini: https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fdeepseek-ai%2FDeepSeek-R1%233-model-downloads



## 2. Menjalankan model di background

In [1]:
# Untuk menjalankan model
import subprocess
# model bisa diambil dari sini: https://huggingface.co/deepseek-ai/DeepSeek-R1#3-model-downloads
model = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B'
#model = 'ibm-granite/granite-4.0-h-small'

# Mulai Jalankan vllm dibagian background komputer
vllm_process = subprocess.Popen([
    'vllm',
    'serve',
    model,
    '--trust-remote-code',
    '--dtype', 'half',
    '--max-model-len', '16384',
    '--tensor-parallel-size', '1' # Subcommand akan mendeskripsikan penggunaan model
], stdout=subprocess.PIPE, stderr=subprocess.PIPE, start_new_session=True)

## 3. Check dan Test vllm

Prossses ini akan menjadi test pertama. kita akan mengetahui apakah vllm berjalan dengan baik jika sudah dijalankan.

In [2]:
import requests
import time
from typing import Tuple
import sys

def check_vllm_status(url: str = "http://localhost:8000/health") -> bool:
    """Untuk mencari tau apakajh LLM berfungsi normal."""
    try:
        response = requests.get(url)
        return response.status_code == 200
    except requests.exceptions.ConnectionError:
        return False

def monitor_vllm_process(vllm_process: subprocess.Popen, check_interval: int = 5) -> Tuple[bool, str, str]:
    """
    Monitoring status vllm dan prosesnya , stdout, and stderr.
    Returns: (success, stdout, stderr)
    """
    print("Starting VLLM server monitoring...")

    while vllm_process.poll() is None:  # While loop selama proses masih berjalan
        if check_vllm_status():
            print("✓ VLLM server is up and running!")
            return True, "", ""

        print("Waiting for VLLM server to start...")
        time.sleep(check_interval)

        # Menampilkan Output jika ditemukan.
        if vllm_process.stdout.readable():
            stdout = vllm_process.stdout.read1().decode('utf-8')
            if stdout:
                print("STDOUT:", stdout)

        if vllm_process.stderr.readable():
            stderr = vllm_process.stderr.read1().decode('utf-8')
            if stderr:
                print("STDERR:", stderr)

    # Jika sampai disini, maka proses telah selesai
    stdout, stderr = vllm_process.communicate()
    return False, stdout.decode('utf-8'), stderr.decode('utf-8')

## 4. Buat persimpangan jika VLLM sukses dan tidak

In [3]:
try:
    success, stdout, stderr = monitor_vllm_process(vllm_process)

    if not success:
        print("\n❌ VLLM server failed to start!")
        print("\nFull STDOUT:", stdout)
        print("\nFull STDERR:", stderr)
        sys.exit(1)

except KeyboardInterrupt:
    print("\n⚠️ Monitoring interrupted by user")
    # # This should just exited the process of probing, not the vllm, if you want it then you coul uncomment this.
    # vllm_process.terminate()
    # try:
    #     vllm_process.wait(timeout=5)
    # except subprocess.TimeoutExpired:
    #     vllm_process.kill()

    stdout, stderr = vllm_process.communicate()
    if stdout: print("\nFinal STDOUT:", stdout.decode('utf-8'))
    if stderr: print("\nFinal STDERR:", stderr.decode('utf-8'))
    sys.exit(0)

Starting VLLM server monitoring...
Waiting for VLLM server to start...
STDOUT: INFO 10-10 11:43:53 [__init__.py:216] Automatically detected platform cuda.

STDERR: 2025-10-10 11:43:55.232578: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered

Waiting for VLLM server to start...
STDOUT: [1;36m(APIServer pid=743)[0;0m INFO 10-10 11:44:07 [api_server.py:1839] vLLM API server version 0.11.0

E0000 00:00:1760096635.253427     743 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760096635.259439     743 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1760096635.277095     743 computation_placer.cc:177] computation placer already registered. Please check linka

## 5. Jalankan dan fungsi tambahan

In [4]:
import requests
import json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from fastapi.responses import StreamingResponse
import requests

# mengirimkan Request skema untuk input
class QuestionRequest(BaseModel):
    question: str


def ask_model(question: str):
    """
    Kirimkan request ke model dan dapatkan respon.
    """
    url = "http://localhost:8000/v1/chat/completions"  # Atur kembali jika kamu mendapati URL yang berbeda
    headers = {"Content-Type": "application/json"}
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": question
            }
        ]
    }

    response = requests.post(url, headers=headers, json=data)
    response.raise_for_status()  # Kirimkan informasi jika ada error HTTP
    return response.json()

## 6. Coba gunakan modelnya



In [5]:
question = """
who is first president of indonesia?
"""

# Umpamakan ask_model berasal dari cell Wt2lqQ_vfrdn tersedia
try:
    result = ask_model(question)
    import json
    print(json.dumps(result, indent=2))
    abc = result['choices'][0]['message']['content']
    print(abc)
except requests.exceptions.RequestException as e:
    print(f"Error sending request: {e}")

{
  "id": "chatcmpl-c9d2dbb7ebde44d5bb32a723a6648c76",
  "object": "chat.completion",
  "created": 1760096828,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Okay, so I need to figure out who the first president of Indonesia is. I remember hearing that Indonesia was formed in 1904, but I'm not exactly sure who the first president was. I think it was someone named Riaih Fani, but I'm not certain. Let me try to piece this together.\n\nFirst, I should probably start by recalling some basic facts about Indonesia. It's a country in Southeast Asia, known for its rich culture, especially with its traditional dance form, kaya. The government is the Indonesian People's RepUBLIC, and they have a president and a prime minister. I believe there are some political changes over the years, so maybe the first president was someone before the country was established.\n\nI think Indonesia 

## Dibawah ini untuk analisa PDF

# Task
Implement a solution to upload a PDF file, extract its text content, and send the text to the DeepSeek model for analysis.

## Install necessary libraries

### Subtask:
Install libraries for handling PDF files in Python.


**Reasoning**:
Install the PyMuPDF library using pip.



In [None]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m100.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.4


## Upload pdf file

### Subtask:
Create a mechanism to upload a PDF file to the Colab environment.


**Reasoning**:
Import the necessary module and use the upload function to allow the user to upload a PDF file.



In [None]:
from google.colab import files

## Integrate pdf analysis into the workflow

### Subtask:
Combine the PDF processing and model interaction steps into a cohesive workflow that the user can easily use.


**Reasoning**:
Define the main workflow function to orchestrate the PDF processing and model analysis steps.



In [None]:
import fitz
import requests
import json
import time
import sys
from google.colab import files

# Global variable to store the extracted text and model history
global_extracted_text = ""
global_message_history = []

def ask_model(question: str, history: list = []):
    """
    Kirimkan request ke model dan dapatkan respon, dengan mempertahankan history percakapan.
    """
    url = "http://localhost:8000/v1/chat/completions"  # Atur kembali jika kamu mendapati URL yang berbeda
    headers = {"Content-Type": "application/json"}

    # Include the history in the messages
    messages = history + [{"role": "user", "content": question}]

    data = {
        "model": model,
        "messages": messages
    }

    response = requests.post(url, headers=headers, json=data)
    response.raise_for_status()  # Kirimkan informasi jika ada error HTTP
    return response.json()


def analyze_pdf_workflow():
    """
    Orchestrates the PDF analysis workflow: upload, extract, chunk, analyze, display.
    Stores the extracted text and initial analysis in global variables for later use.
    """
    global global_extracted_text
    global global_message_history

    print("Please upload your PDF file.")
    uploaded = files.upload()

    if not uploaded:
        print("No file uploaded. Exiting.")
        return

    for file_name, file_content in uploaded.items():
        print(f"Processing file: {file_name}")

        # Save the uploaded file locally
        with open(file_name, 'wb') as f:
            f.write(file_content)

        # Extract text from PDF
        doc = fitz.open(file_name)
        extracted_text = ""
        for page_num in range(doc.page_count):
            page = doc.load_page(page_num)
            extracted_text += page.get_text()
        doc.close()
        global_extracted_text = extracted_text # Store the extracted text globally
        print(f"Extracted {len(global_extracted_text)} characters from {file_name}.")

        # Create an initial message for the model with the extracted text
        initial_message_content = f"Here is the content of the document for analysis:\n\n{global_extracted_text}"
        # Split initial message if too long (optional, depending on model context window)
        # For now, sending as one message
        global_message_history = [{"role": "user", "content": initial_message_content}]

        print("\n--- Initial Analysis (if applicable) ---")
        # You might want to send a specific prompt for initial analysis here
        # For now, we just store the text for subsequent interactions.
        print("PDF content loaded into model's context. You can now ask questions about it.")


# # Call the main workflow function to execute the process
# analyze_pdf_workflow()

Setelah coba upload pdf diatas, coba upload yang sama disini. tujuan adalah untuk meningat isi konten dari pdf.

In [None]:
# Run this cell to upload a PDF and load its content into the model's memory
analyze_pdf_workflow()

In [None]:
# After uploading the PDF, you can ask questions about it like this:
question = "Can you summarize the main points of the document?"
response = ask_model(question, history=global_message_history)

# Print the model's response
if 'choices' in response and len(response['choices']) > 0 and 'message' in response['choices'][0] and 'content' in response['choices'][0]['message']:
    analysis_content = response['choices'][0]['message']['content']
    print("Model's response:")
    print(analysis_content)
else:
    print("Unexpected response format:", response)