Overview of the LMoe Process #26

psych0v0yager · 2023-09-04T03:48:17Z

Good evening,

I have been interested in using a Mixture of Experts for some time. I built a rudimentary version of this scheme using a hierarchical K-means (similar to MoLoRA) but your code is far more advanced. I have several questions about your procedure if you don't mind.

Regarding the FAISS search option: How are the adapters being selected. In my implementation I performed K means on the data, generating a set of clusters. The centroids would be saved inside the FAISS vectorstore and the embeddings of the query would select the K closest centroids which are then combined to build a new adapter. Is your mechanism similar, how did you divvy up the training set to each expert. One thing I struggled with was the size of the embeddings. Many of the embeddings only support a context length of 512, which means large training samples would be truncated. The only embedding models I know with a respectable context length are the OpenAI embeddings.

Regarding the Agent Based Routing: If I understand the "function" agent is a separate LoRA trained on "executive level" function calling of the experts. You must be dynamically swapping between the function agent and whatever expert was selected. Furthermore, what dataset did you use to train your function agent.

Regarding the Inference Server: How were you able to get your dynamic system to work with inference servers such as vLLM? Do you need to restart the inference every time a new LoRA is selected or does the swapping work dynamically. In addition to this, where are the LoRAs stored? Are all of the experts pre loaded into video memory or can you pull them from the disk whenever necessary. If the experts are stored only on the disk and can be loaded as needed, theoretically you could store thousands of specialized adapters on the hard disk, giving you an inconceivable knowledge base.

I will be experimenting with the code during the weekend to understand it more. Thanks for your time

jondurbin · 2023-09-04T07:59:03Z

Regarding the FAISS search option: How are the adapters being selected. In my implementation I performed K means on the data, generating a set of clusters. The centroids would be saved inside the FAISS vectorstore and the embeddings of the query would select the K closest centroids which are then combined to build a new adapter. Is your mechanism similar, how did you divvy up the training set to each expert.

The airoboros dataset generation tool inherently generates many separate types of training data via "instructors". Each instructor has it's own prompt/config, and can be used to generate task-specific training data.
https://github.com/jondurbin/airoboros/tree/main/airoboros/instructors

During the dataset generation process, the output data is labeled with a "category" field corresponding to the instructor that generated it. I then just split the data into the experts by using this category, an example here:
https://github.com/jondurbin/airoboros/blob/main/scripts/segment_experts.py

One of the routing options is faiss index search, which requires packaging up the fine-tuning data used to train each expert with the adapter. The lmoe package has routing data, training data, and adapters. Routing data here is just the system prompt + instruction, without response.
https://huggingface.co/jondurbin/airoboros-lmoe-70b-2.1/tree/main

To use faiss index, you specify --router-max-samples (which specifies how many random samples to include in the index for each expert from the routing data, higher values producing better results, but slow to load), and --router-k, which is the k in the approximate knn search. The input system prompt + instruction is used to search against the faiss indices, and the average distance from the knn search is the selection mechanism, lowest score (most similar) wins.

One thing I struggled with was the size of the embeddings. Many of the embeddings only support a context length of 512, which means large training samples would be truncated. The only embedding models I know with a respectable context length are the OpenAI embeddings.

This isn't a perfect solution by any means - an embedding model with larger context window would be better - but as it turns out you can actually average the embeddings for multiple chunks and still get reasonable performance, so long as the input document doesn't have a huge variety of topics:
https://github.com/jondurbin/airoboros/blob/main/airoboros/embeddings.py

Regarding the Agent Based Routing: If I understand the "function" agent is a separate LoRA trained on "executive level" function calling of the experts. You must be dynamically swapping between the function agent and whatever expert was selected. Furthermore, what dataset did you use to train your function agent.

Yes, there is a "function" adapter, that trains on data generated by the "rewoo" style execution planning and basic function calling instructors' data. Function router is loaded by default as the first adapter and used as routing mechanism each time if you specify --agent-router. Dataset is custom synthetic data generated by this tool.

Regarding the Inference Server: How were you able to get your dynamic system to work with inference servers such as vLLM? Do you need to restart the inference every time a new LoRA is selected or does the swapping work dynamically. In addition to this, where are the LoRAs stored? Are all of the experts pre loaded into video memory or can you pull them from the disk whenever necessary. If the experts are stored only on the disk and can be loaded as needed, theoretically you could store thousands of specialized adapters on the hard disk, giving you an inconceivable knowledge base.

I have a vllm inference option, but the output quality is quite low, so something must be off; haven't had a chance to really dig into it.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/vllm.py

With vllm, you need to adjust the weights each time an adapter is selected, but again something is off here so I wouldn't use this option yet.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/lora.py

The other/default API server uses bettertransformers and flash attention to improve inference speed, but it's still fairly slow in comparison (but much higher quality).

The last routing option is just manually adding "expert": "{expert}" in the JSON payload if you know ahead of time which adapter you'd like to use.

For this proof-of-concept, the LoRAs are all loaded into vram before the API server actually starts up. For the 7b model, for example, the base model load consumes ~13.5GB vram, and with all adapters loaded into vram it consumes ~17.5GB. In practice, this doesn't scale all that well on a single machine, you'd want multiple backend servers, placing the routing in front of those, with each server hosting a handful of adapters. You could dynamically load the adapters from a very fast memory store or something as well instead of caching in vram, but it would be significantly slower.

I'm hoping to create a hosted airoboros service where people can contribute to creating many of these adapters so we can indeed have thousands. I can optimize for that use case if and when we get to that point.

psych0v0yager · 2023-09-04T23:58:05Z

Thank you for taking the time to respond to my question, as well as providing the links to the code.

You answered all the questions that I had. I am excited to see the future of this project!

jondurbin closed this as completed Dec 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview of the LMoe Process #26

Overview of the LMoe Process #26

psych0v0yager commented Sep 4, 2023

jondurbin commented Sep 4, 2023

psych0v0yager commented Sep 4, 2023

Overview of the LMoe Process #26

Overview of the LMoe Process #26

Comments

psych0v0yager commented Sep 4, 2023

jondurbin commented Sep 4, 2023

psych0v0yager commented Sep 4, 2023