-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overview of the LMoe Process #26
Comments
The airoboros dataset generation tool inherently generates many separate types of training data via "instructors". Each instructor has it's own prompt/config, and can be used to generate task-specific training data. During the dataset generation process, the output data is labeled with a "category" field corresponding to the instructor that generated it. I then just split the data into the experts by using this category, an example here: One of the routing options is faiss index search, which requires packaging up the fine-tuning data used to train each expert with the adapter. The lmoe package has routing data, training data, and adapters. Routing data here is just the system prompt + instruction, without response. To use faiss index, you specify
This isn't a perfect solution by any means - an embedding model with larger context window would be better - but as it turns out you can actually average the embeddings for multiple chunks and still get reasonable performance, so long as the input document doesn't have a huge variety of topics:
Yes, there is a "function" adapter, that trains on data generated by the "rewoo" style execution planning and basic function calling instructors' data. Function router is loaded by default as the first adapter and used as routing mechanism each time if you specify
I have a vllm inference option, but the output quality is quite low, so something must be off; haven't had a chance to really dig into it. With vllm, you need to adjust the weights each time an adapter is selected, but again something is off here so I wouldn't use this option yet. The other/default API server uses bettertransformers and flash attention to improve inference speed, but it's still fairly slow in comparison (but much higher quality). The last routing option is just manually adding For this proof-of-concept, the LoRAs are all loaded into vram before the API server actually starts up. For the 7b model, for example, the base model load consumes ~13.5GB vram, and with all adapters loaded into vram it consumes ~17.5GB. In practice, this doesn't scale all that well on a single machine, you'd want multiple backend servers, placing the routing in front of those, with each server hosting a handful of adapters. You could dynamically load the adapters from a very fast memory store or something as well instead of caching in vram, but it would be significantly slower. I'm hoping to create a hosted airoboros service where people can contribute to creating many of these adapters so we can indeed have thousands. I can optimize for that use case if and when we get to that point. |
Thank you for taking the time to respond to my question, as well as providing the links to the code. You answered all the questions that I had. I am excited to see the future of this project! |
Good evening,
I have been interested in using a Mixture of Experts for some time. I built a rudimentary version of this scheme using a hierarchical K-means (similar to MoLoRA) but your code is far more advanced. I have several questions about your procedure if you don't mind.
Regarding the FAISS search option: How are the adapters being selected. In my implementation I performed K means on the data, generating a set of clusters. The centroids would be saved inside the FAISS vectorstore and the embeddings of the query would select the K closest centroids which are then combined to build a new adapter. Is your mechanism similar, how did you divvy up the training set to each expert. One thing I struggled with was the size of the embeddings. Many of the embeddings only support a context length of 512, which means large training samples would be truncated. The only embedding models I know with a respectable context length are the OpenAI embeddings.
Regarding the Agent Based Routing: If I understand the "function" agent is a separate LoRA trained on "executive level" function calling of the experts. You must be dynamically swapping between the function agent and whatever expert was selected. Furthermore, what dataset did you use to train your function agent.
Regarding the Inference Server: How were you able to get your dynamic system to work with inference servers such as vLLM? Do you need to restart the inference every time a new LoRA is selected or does the swapping work dynamically. In addition to this, where are the LoRAs stored? Are all of the experts pre loaded into video memory or can you pull them from the disk whenever necessary. If the experts are stored only on the disk and can be loaded as needed, theoretically you could store thousands of specialized adapters on the hard disk, giving you an inconceivable knowledge base.
I will be experimenting with the code during the weekend to understand it more. Thanks for your time
The text was updated successfully, but these errors were encountered: