LLMeQueue

A simple distributed system for offloading LLM tasks (embeddings and chat completions) to a local GPU.

What is this?

If you have a machine with a GPU but want to run your main application somewhere else (like AWS), this project lets you:

Submit requests from anywhere via OpenAI-compatible REST API
Process with Ollama on your local GPU (embeddings or chat completions)
Retrieve results when they're ready

The queue server runs in the cloud, your GPU worker runs locally, and they communicate over HTTP.

Architecture

Your App (cloud)          Queue Server (cloud)           Your PC (local GPU)
      │                          │                              │
      ├── Submit request ───────►│                              │
      │                          │◄──── Worker polls for tasks ─┤
      │                          │                              │
      │                          │──── Send task to process ───►│
      │                          │                              ├── Ollama processes
      │                          │◄──── Return result ──────────┤    (embed or chat)
      │                          │                              │
      ◄── Get result ────────────┤                              │

Quick Start

Option A: NVIDIA GPU (Linux/Windows with NVIDIA GPU)

# Start all services (3 workers for parallel processing)
docker compose -f docker-compose.nvidia.yml up --scale worker=3 --build

# Pull models (first time only)
docker exec -it llmequeue-ollama-1 ollama pull nomic-embed-text
docker exec -it llmequeue-ollama-1 ollama pull llama3.2

Option B: Mac (Apple Silicon GPU)

# Install and run Ollama natively for GPU acceleration
brew install ollama
ollama serve  # In a separate terminal

# Pull models
ollama pull nomic-embed-text
ollama pull llama3.2

# Start server and 3 workers (parallel processing)
docker compose -f docker-compose.mac.yml up --scale worker=3 --build

3. Use the API (OpenAI-compatible)

Embeddings

curl:

curl -X POST http://localhost:8000/v1/embeddings \
  -H "Authorization: Bearer your-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"input": "The quick brown fox jumps over the lazy dog", "model": "nomic-embed-text"}'

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    api_key="your-secret-token",
    base_url="http://localhost:8000/v1"
)

response = client.embeddings.create(
    input="The quick brown fox jumps over the lazy dog",
    model="nomic-embed-text"
)

embedding = response.data[0].embedding

Response (OpenAI format):

{
  "object": "list",
  "data": [{"object": "embedding", "embedding": [0.123, -0.456, ...], "index": 0}],
  "model": "nomic-embed-text"
}

The server waits up to 30 seconds for the result. If processing takes longer, it returns a task ID for polling.

Chat Completions

curl:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer your-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "model": "llama3.2"}'

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    api_key="your-secret-token",
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="llama3.2"
)

print(response.choices[0].message.content)

Response (OpenAI format):

{
  "id": "...",
  "object": "chat.completion",
  "choices": [{"message": {"role": "assistant", "content": "Hello! How can I help?"}, "finish_reason": "stop"}],
  "model": "llama3.2"
}

The server waits up to 180 seconds (3 minutes) for the result. If processing takes longer, it returns a task ID:

{"id": "550e8400-e29b-41d4-a716-446655440000"}

Then poll for the result:

curl http://localhost:8000/tasks/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer your-secret-token"

API Endpoints

Method	Endpoint	Description
`POST`	`/v1/embeddings`	OpenAI-compatible embeddings
`POST`	`/v1/chat/completions`	OpenAI-compatible chat completions
`GET`	`/tasks/{id}`	Get task status and result (for polling)
`GET`	`/tasks/{id}/result`	Get only the result
`GET`	`/health`	Health check

All endpoints require Authorization: Bearer your-secret-token header.

OpenAI Endpoint Details

POST /v1/embeddings

Field	Type	Default	Description
`input`	string	required	Text to embed
`model`	string	`nomic-embed-text`	Model name (optional)

POST /v1/chat/completions

Field	Type	Default	Description
`messages`	array	required	Array of message objects with `role` and `content`
`model`	string	`llama3.2`	Model name (optional)
`temperature`	float	`0.7`	Sampling temperature
`max_tokens`	int	null	Maximum tokens to generate

Configuration

Edit .env file to customize:

Variable	Default	Description
`AUTH_TOKEN`	`your-secret-token`	API authentication token
`EMBEDDING_MODEL`	`nomic-embed-text`	Ollama embedding model
`CHAT_MODEL`	`llama3.2`	Ollama chat model
`POLL_INTERVAL`	`2`	Worker poll frequency (seconds)
`SERVER_PORT`	`8000`	Server port
`DB_PATH`	`data/llmequeue.db`	SQLite database path

Requirements

Docker & Docker Compose
NVIDIA GPU with NVIDIA Container Toolkit

Project Structure

LLMeQueue/
├── server/              # FastAPI queue server
│   ├── main.py          # API endpoints
│   ├── database.py      # SQLite operations
│   ├── models.py        # Request/response models
│   └── config.py        # Configuration
├── worker/              # GPU worker client
│   ├── worker.py        # Polling loop
│   ├── embedder.py      # Ollama embeddings client
│   ├── chat.py          # Ollama chat client
│   └── config.py        # Configuration
├── docker-compose.yml
├── lightsail-deploy.json  # AWS Lightsail deployment
├── .env                 # Environment variables
└── data/                # SQLite database (persistent)

Task States

pending - Waiting to be processed
processing - Worker is generating embedding
completed - Embedding ready
failed - Error occurred (check error field)

Deployment

Server (AWS)

Note: The server no longer has default models. Clients must specify the model field in requests.

# On AWS instance
git clone https://github.com/yourusername/LLMeQueue.git
cd LLMeQueue
cp .env.example .env
# Edit .env with your AUTH_TOKEN
docker compose up -d --build

Worker (Local GPU)

Workers should specify default models via environment variables:

# Create .env file with your models
echo "EMBEDDING_MODEL=nomic-embed-text" >> .env
echo "CHAT_MODEL=llama3.2" >> .env
echo "SERVER_URL=http://your-aws-ip:8000" >> .env
echo "AUTH_TOKEN=your-secret-token" >> .env

# Deploy worker
docker compose -f docker-compose.cloud.yml up -d --build

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude		.claude
server		server
worker		worker
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.cloud.yml		docker-compose.cloud.yml
docker-compose.mac.yml		docker-compose.mac.yml
docker-compose.nvidia.yml		docker-compose.nvidia.yml
lightsail-deploy.json		lightsail-deploy.json
simple_stress_test.py		simple_stress_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMeQueue

What is this?

Architecture

Quick Start

Option A: NVIDIA GPU (Linux/Windows with NVIDIA GPU)

Option B: Mac (Apple Silicon GPU)

3. Use the API (OpenAI-compatible)

Embeddings

Chat Completions

API Endpoints

OpenAI Endpoint Details

Configuration

Requirements

Project Structure

Task States

Deployment

Server (AWS)

Worker (Local GPU)

About

Uh oh!

Uh oh!

Languages

gszecsenyi/LLMeQueue

Folders and files

Latest commit

History

Repository files navigation

LLMeQueue

What is this?

Architecture

Quick Start

Option A: NVIDIA GPU (Linux/Windows with NVIDIA GPU)

Option B: Mac (Apple Silicon GPU)

3. Use the API (OpenAI-compatible)

Embeddings

Chat Completions

API Endpoints

OpenAI Endpoint Details

Configuration

Requirements

Project Structure

Task States

Deployment

Server (AWS)

Worker (Local GPU)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages