Skip to content

malgorath/model-training-manager

Repository files navigation

Model Training Manager

A professional-grade application for managing AI model training with worker orchestration, dataset management, and seamless Ollama integration. Supports QLoRA, Unsloth, RAG, and standard fine-tuning workflows through a clean React/TypeScript UI.

Features

  • Dataset management — upload and manage CSV/JSON training datasets
  • Training job orchestration — create, monitor, and manage training jobs
  • Worker pool management — scale training with concurrent worker threads
  • Multiple training types — QLoRA, Unsloth, RAG, and standard fine-tuning
  • Real-time monitoring — track training progress, epochs, and loss metrics
  • Interactive API docs — Swagger UI and ReDoc included

Tech Stack

Layer Technology
Backend Python 3.11+ / FastAPI / SQLAlchemy (SQLite) / Pydantic
Frontend React 18+ / TypeScript / Vite / TailwindCSS / React Query
LLM Backend Ollama (local)

Requirements

  • Python 3.11+
  • Node.js 20+ and npm
  • Ollama running locally
  • 8GB+ RAM recommended

Quick Start

1. Install and start Ollama

curl -fsSL https://ollama.ai/install.sh | sh
ollama serve
ollama pull llama3.2:3b   # in a new terminal

2. Backend

git clone <repository-url>
cd trainers/backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

3. Frontend

cd ../frontend
npm install
npm run dev

Access:

  • Frontend: http://localhost:5173
  • API docs: http://localhost:8000/api/docs
  • ReDoc: http://localhost:8000/api/redoc

Configuration

Create a .env file in backend/:

Variable Default Description
DATABASE_URL sqlite:///./trainers.db Database connection
UPLOAD_DIR ./uploads Dataset upload directory
OLLAMA_BASE_URL http://localhost:11434 Ollama API URL
OLLAMA_MODEL llama3.2:3b Default training model
MAX_WORKERS 8 Maximum concurrent workers
DEBUG false Enable debug mode

Training parameters (configurable via Settings page):

Parameter Range Default
Batch Size 1–64 4
Learning Rate 1e-6 to 1.0 2e-4
Epochs 1–100 3
LoRA Rank 1–256 16
LoRA Alpha 1–512 32
LoRA Dropout 0.0–1.0 0.05

API Reference

Method Endpoint Description
GET /health Health check
POST /api/v1/datasets/ Upload dataset
GET /api/v1/datasets/ List datasets
POST /api/v1/jobs/ Create training job
GET /api/v1/jobs/ List training jobs
GET /api/v1/jobs/{id}/status Get job status
POST /api/v1/jobs/{id}/start Start job manually
POST /api/v1/jobs/{id}/cancel Cancel job
GET /api/v1/config/ Get configuration
PATCH /api/v1/config/ Update configuration
GET /api/v1/workers/ Get worker status
POST /api/v1/workers/ Control workers

Training Types

QLoRA — 4-bit quantization with low-rank adaptation. Best for limited GPU memory scenarios.

Unsloth — Optimized LoRA training with faster throughput and reduced memory usage. Best for efficient fine-tuning.

RAG — Builds a vector index from the dataset and trains retrieval and generation components. Best for knowledge-intensive tasks.

Standard fine-tuning — Traditional supervised fine-tuning with rule-based configuration. Best for straightforward adaptation tasks.


Testing

# Backend
cd backend
source venv/bin/activate
pytest --cov=app --cov-report=term-missing

# Frontend
cd frontend
npm test

Project Structure

trainers/
├── backend/
│   ├── app/
│   │   ├── api/           # API endpoints
│   │   ├── core/          # Configuration, database
│   │   ├── models/        # SQLAlchemy models
│   │   ├── schemas/       # Pydantic schemas
│   │   ├── services/      # Business logic
│   │   └── workers/       # Training workers
│   ├── tests/
│   └── requirements.txt
└── frontend/
    ├── src/
    │   ├── components/    # React components
    │   ├── pages/         # Page components
    │   ├── services/      # API client
    │   └── types/         # TypeScript types
    └── package.json

Troubleshooting

Ollama connection refused:

ollama serve
curl http://localhost:11434/api/tags

Database errors:

rm backend/trainers.db   # reset — tables recreate on next start

Port conflicts:

lsof -i :5173   # frontend
lsof -i :8000   # backend
lsof -i :11434  # ollama

Out of memory: Reduce batch size, reduce max workers, or switch to QLoRA training type.

Python import errors: Ensure the virtual environment is activated before running.

Frontend build errors:

cd frontend && rm -rf node_modules package-lock.json && npm install

License

MIT

About

Multi-GPU model training manager with worker orchestration, dataset management, and Ollama integration for fully local LLM workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors