Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview of the LMoe Process #26

Closed
psych0v0yager opened this issue Sep 4, 2023 · 2 comments
Closed

Overview of the LMoe Process #26

psych0v0yager opened this issue Sep 4, 2023 · 2 comments

Comments

@psych0v0yager
Copy link

Good evening,

I have been interested in using a Mixture of Experts for some time. I built a rudimentary version of this scheme using a hierarchical K-means (similar to MoLoRA) but your code is far more advanced. I have several questions about your procedure if you don't mind.

Regarding the FAISS search option: How are the adapters being selected. In my implementation I performed K means on the data, generating a set of clusters. The centroids would be saved inside the FAISS vectorstore and the embeddings of the query would select the K closest centroids which are then combined to build a new adapter. Is your mechanism similar, how did you divvy up the training set to each expert. One thing I struggled with was the size of the embeddings. Many of the embeddings only support a context length of 512, which means large training samples would be truncated. The only embedding models I know with a respectable context length are the OpenAI embeddings.

Regarding the Agent Based Routing: If I understand the "function" agent is a separate LoRA trained on "executive level" function calling of the experts. You must be dynamically swapping between the function agent and whatever expert was selected. Furthermore, what dataset did you use to train your function agent.

Regarding the Inference Server: How were you able to get your dynamic system to work with inference servers such as vLLM? Do you need to restart the inference every time a new LoRA is selected or does the swapping work dynamically. In addition to this, where are the LoRAs stored? Are all of the experts pre loaded into video memory or can you pull them from the disk whenever necessary. If the experts are stored only on the disk and can be loaded as needed, theoretically you could store thousands of specialized adapters on the hard disk, giving you an inconceivable knowledge base.

I will be experimenting with the code during the weekend to understand it more. Thanks for your time

@jondurbin
Copy link
Owner

Regarding the FAISS search option: How are the adapters being selected. In my implementation I performed K means on the data, generating a set of clusters. The centroids would be saved inside the FAISS vectorstore and the embeddings of the query would select the K closest centroids which are then combined to build a new adapter. Is your mechanism similar, how did you divvy up the training set to each expert.

The airoboros dataset generation tool inherently generates many separate types of training data via "instructors". Each instructor has it's own prompt/config, and can be used to generate task-specific training data.
https://github.com/jondurbin/airoboros/tree/main/airoboros/instructors

During the dataset generation process, the output data is labeled with a "category" field corresponding to the instructor that generated it. I then just split the data into the experts by using this category, an example here:
https://github.com/jondurbin/airoboros/blob/main/scripts/segment_experts.py

One of the routing options is faiss index search, which requires packaging up the fine-tuning data used to train each expert with the adapter. The lmoe package has routing data, training data, and adapters. Routing data here is just the system prompt + instruction, without response.
https://huggingface.co/jondurbin/airoboros-lmoe-70b-2.1/tree/main

To use faiss index, you specify --router-max-samples (which specifies how many random samples to include in the index for each expert from the routing data, higher values producing better results, but slow to load), and --router-k, which is the k in the approximate knn search. The input system prompt + instruction is used to search against the faiss indices, and the average distance from the knn search is the selection mechanism, lowest score (most similar) wins.

One thing I struggled with was the size of the embeddings. Many of the embeddings only support a context length of 512, which means large training samples would be truncated. The only embedding models I know with a respectable context length are the OpenAI embeddings.

This isn't a perfect solution by any means - an embedding model with larger context window would be better - but as it turns out you can actually average the embeddings for multiple chunks and still get reasonable performance, so long as the input document doesn't have a huge variety of topics:
https://github.com/jondurbin/airoboros/blob/main/airoboros/embeddings.py

Regarding the Agent Based Routing: If I understand the "function" agent is a separate LoRA trained on "executive level" function calling of the experts. You must be dynamically swapping between the function agent and whatever expert was selected. Furthermore, what dataset did you use to train your function agent.

Yes, there is a "function" adapter, that trains on data generated by the "rewoo" style execution planning and basic function calling instructors' data. Function router is loaded by default as the first adapter and used as routing mechanism each time if you specify --agent-router. Dataset is custom synthetic data generated by this tool.

Regarding the Inference Server: How were you able to get your dynamic system to work with inference servers such as vLLM? Do you need to restart the inference every time a new LoRA is selected or does the swapping work dynamically. In addition to this, where are the LoRAs stored? Are all of the experts pre loaded into video memory or can you pull them from the disk whenever necessary. If the experts are stored only on the disk and can be loaded as needed, theoretically you could store thousands of specialized adapters on the hard disk, giving you an inconceivable knowledge base.

I have a vllm inference option, but the output quality is quite low, so something must be off; haven't had a chance to really dig into it.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/vllm.py

With vllm, you need to adjust the weights each time an adapter is selected, but again something is off here so I wouldn't use this option yet.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/lora.py

The other/default API server uses bettertransformers and flash attention to improve inference speed, but it's still fairly slow in comparison (but much higher quality).

The last routing option is just manually adding "expert": "{expert}" in the JSON payload if you know ahead of time which adapter you'd like to use.

For this proof-of-concept, the LoRAs are all loaded into vram before the API server actually starts up. For the 7b model, for example, the base model load consumes ~13.5GB vram, and with all adapters loaded into vram it consumes ~17.5GB. In practice, this doesn't scale all that well on a single machine, you'd want multiple backend servers, placing the routing in front of those, with each server hosting a handful of adapters. You could dynamically load the adapters from a very fast memory store or something as well instead of caching in vram, but it would be significantly slower.

I'm hoping to create a hosted airoboros service where people can contribute to creating many of these adapters so we can indeed have thousands. I can optimize for that use case if and when we get to that point.

@psych0v0yager
Copy link
Author

Thank you for taking the time to respond to my question, as well as providing the links to the code.

You answered all the questions that I had. I am excited to see the future of this project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants