This repository was archived by the owner on Jul 4, 2025. It is now read-only.
  
  
  
  
  
Description
Objective
- Do we need a queue system that scales to thousands of requests
Motivation
Nullpointer Errors?
- Currently, inference requests are handled FIFO
- We are adopting an OpenAI API, which means that we will receive requests across Chat, Audio, Vision etc
- Given that users are on laptops with limited RAM and VRAM, we are likely to have to switch models
Preparing for Cloud Native
- Our long-term future is likely as an enterprise OpenAI-alternative, which will be multi-user and have a queue system
- Should we bake in this abstraction, and use a local file-based queue (which is later swapped out for a more sophisticated queue?)