LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase #645

irthomasthomas · 2024-02-28T08:38:50Z

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase

DESCRIPTION:
TL;DR: We’re excited to release LoRA Land, a collection of 25 fine-tuned Mistral-7b models that consistently outperform base models by 70% and GPT-4 by 4-15%, depending on the task. LoRA Land’s 25 task-specialized large language models (LLMs) were all fine-tuned with Predibase for less than $8.00 each on average and are all served from a single A100 GPU using LoRAX, our open source framework that allows users to serve hundreds of adapter-based fine-tuned models on a single GPU. This collection of specialized fine-tuned models–all trained with the same base model–offers a blueprint for teams seeking to efficiently and cost-effectively deploy highly performant AI systems.
Join our webinar on February 29th to learn more!

LLM Benchmarks: 25 fine-tuned Mistral-7b adapters that outperform GPT-4.

The Need for Efficient Fine-Tuning and Serving
With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs) and the emergence of large language models (LLMs) with billions of parameters, it has become increasingly challenging to adapt them to specific downstream tasks, especially in environments with limited computational resources or budgets. Parameter Efficient Fine-Tuning (PEFT) and Quantized Low Rank Adaptation (QLoRA) offer an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning.
Predibase has incorporated these best practices into its fine-tuning platform and, to demonstrate the accessibility and affordability of adapter-based fine-tuning of open-source LLMs, has fine-tuned 25 models for less than $8 each on average in terms of GPU costs.
Fine-tuned LLMs have historically also been very expensive to put into production and serve, requiring dedicated GPU resources for each fine-tuned model. For teams that plan on deploying multiple fine-tuned models to address a range of use cases, these GPU expenses can often be a bottleneck for innovation. LoRAX, the open-source platform for serving fine-tuned LLMs developed by Predibase, enables teams to deploy hundreds of fine-tuned LLMs for the cost of one from a single GPU.

URL: LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4

Suggested labels

{'label-name': 'adapter-based-fine-tuning', 'label-description': 'Efficient approach to fine-tuning large language models using adapters', 'gh-repo': 'https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4', 'confidence': 64.54}

irthomasthomas · 2024-02-28T08:38:52Z

Related issues

#505: LoRAX: Dynamic loading and optimized inference of LoRA adapter models.

### Details

Similarity score: 0.91 - [ ] [LoRAX Docs](https://predibase.github.io/lorax/?h=cpu#features)

LoRAX Docs

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

📖 What is LoRAX?

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

🌳 Features

🚅 Dynamic Adapter Loading: include any fine-tuned LoRA adapter in your request, it will be loaded just-in-time without blocking concurrent requests.
🏋️‍♀️ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
🧁 Adapter Exchange Scheduling: asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
👬 Optimized Inference: high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels (flash-attention, paged attention, SGMV), quantization, token streaming.
🚢 Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation.
🤯 Free for Commercial Use: Apache 2.0 License. Enough said 😎.

URL: https://predibase.github.io/lorax/?h=cpu#features

Suggested labels

{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }

#636: S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe

### Details

Similarity score: 0.9 - [ ] [S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe](https://openpipe.ai/blog/s-lora)

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe

DESCRIPTION:
S-LoRA describes a set of optimizations for running thousands of separate LLMs simultaneously on the same GPU. At OpenPipe we’ve been running S-LoRA in production since January 4th, which critically allowed us to eliminate the cold-start problem for infrequently-used models. I wanted to share some of our learnings from the implementation process here!
But first, here’s the average cold-start response time we’re seeing after enabling the S-LoRA based pipeline:

The Problem of Weights
Modern LLMs require a lot of GPU RAM. A “small” model like Mistral 7B requires 14GB of RAM just to hold the weights, in addition to the working memory required for the KV cache, which can be multiple GB for long sequences. This means that even a very beefy GPU like an A100-40GB only has room to load one or maybe two 7B LLMs in RAM at once. Quantization can reduce the required memory, but it also leads to decreased throughput, and often hurts response quality as well.
This is not really a problem if you’re using one general-purpose model for everything, and just steering its behavior via prompting. In that case you can just load up your model on one GPU and call it a day. But fine-tuning is a far more reliable way of directing model behavior than prompting. Concretely, we’ve found that 7B models fine-tuned on a good dataset consistently outperform prompted GPT-3.5 (20B parameters), and even come within striking distance of GPT-4 (1.7T parameters)!

The downside, of course, is that now you have to figure out how to serve all those task-specific fine-tuned models efficiently. Spinning up a dedicated GPU for each model is a non-starter because it leads to low GPU utilization, which is an existential issue because of how expensive GPU time is ($2+/hr for an A100). How do we square the circle?

Serving all the models everywhere all at once
First, a bit of background: in 2021 a new fine-tuning method called LoRA was published. The key insight is that fine-tuning only a tiny fraction of the base model’s weights can give you similar results to fine-tuning all of them, since you want your fine-tuned model to keep most of the world understanding and reasoning ability of its base. The LoRA technique involves cleverly inserting extra adapter layers in a few carefully-selected locations and only fine-tuning those. These adapters are analogous to a “git diff” that encodes only the difference in weights between the base model and your fine-tune.
These adapters can be tiny. In OpenPipe’s case, our Mistral adapters are 80MB each, only 0.5% the size of the 14GB base model. This immediately points to the shape of the solution: is it possible to load many adapters from the same base model onto one GPU and use them simultaneously, efficiently?
It turns out the answer is “yes”! Two influential papers from late 2023 help define the solution.
Punica implements a clever CUDA kernel that is able to batch-process requests from many LoRA adapters simultaneously. This custom kernel is essential, because the naive approach taken by most libraries pre-Punica required swapping adapters for each request, eliminating the critical throughput increases from serving many requests in parallel.
S-LoRA builds on Punica and adds a tiered caching architecture. It dynamically stores the most-recently-used adapters in GPU RAM, less-recently-used adapters in system RAM, and the least-recently-used adapters on disk. For a typical setup with 10GB of available GPU RAM and 1TB of system RAM, S-LoRA might store 125 adapters on the GPU and over 10K in system RAM. The overhead of restoring an adapter from system RAM to the GPU is negligible in practice; an A100 has 31GB/s of interconnect bandwidth so an 80MB adapter can be transferred in 2.4ms. This can happen in parallel with serving other requests.
This actually works!
On January 4th we deployed an experimental inference pipeline based on a vLLM fork that implements the relevant optimizations. After manually moving a few models over and closely monitoring performance, we enabled the pipeline for all new models on January 10th, and began porting over old models as well.
Over the course of this transition, the average number of GPUs in use has dropped by over 70%, even as the number of requests we serve has continued increasing! Our average response time for models coming up from a cold start (ie weights not already loaded onto a GPU) decreased from 45 seconds to 1 second, giving customers a lot more flexibility to deploy many small specialist models. And ultimately, that’s exactly what we’re here to do. 🙂

URL: https://openpipe.ai/blog/s-lora

Suggested labels

{'label-name': 'GPU-Optimization', 'label-description': 'Optimizing GPU resource utilization for running multiple models efficiently on a single GPU.', 'gh-repo': 'openpipe/openpipe-ai', 'confidence': 54.2}

#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

### Details

Similarity score: 0.87 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models:

Knowledge Distillation
Network Pruning
Quantization
Inference Acceleration
Efficient MOE
Text Compression
Low-Rank Decomposition
Hardware/System Tuning
Survey
Leaderboard
🚀 Updates
Contributing

Inference Acceleration

…
Add your paper here, generate the required format, and submit a pull request.

Updates

Sep 27, 2023: Add tag for papers accepted at NeurIPS'23.
Sep 6, 2023: Add a new subdirectory project/ to organize those projects designed for developing a lightweight LLM.
July 11, 2023: Create a new subdirectory efficient_plm/ for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs.

Contributing

If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py and execute python generate_item.py. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.

URL: https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

#628: LLaVA/README.md at main · haotian-liu/LLaVA

### Details

Similarity score: 0.87 - [ ] [LLaVA/README.md at main · haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/README.md?plain=1)

LLaVA/README.md at main · haotian-liu/LLaVA

🌋 LLaVA: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

📢 LLaVA-NeXT Blog Project Page Demo Data Model Zoo

🤝Community Contributions: llama.cpp Colab 🤗Space Replicate AutoGen BakLLaVA

Improved Baselines with Visual Instruction Tuning Paper HF

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) Paper HF

Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Release

[1/30] 🔥 LLaVA-NeXT (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.
[11/10] LLaVA-Plus is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). Project Page Demo Code Paper
[11/2] LLaVA-Interactive is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. Project Page Demo Code Paper
[10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts) (script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA.
[10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! 🤗 Demo
[10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo. The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here.
[9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project LLavA-RLHF.
[9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.

More

[11/6] Support Intel dGPU and CPU platforms. More details here.
[10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support!
[10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here!
[10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5.
[9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants".

[7/19] We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo!
[6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out Slides Notes YouTube Bilibli.
[6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here.
[6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page.
[5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details.
[5/2] We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details.
[4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here.
[4/17] We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.

Code License

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Sparse computation is increasingly recognized as an important direction in enhancing the computational efficiency of large language models (LLMs). Among various approaches, the mixture-of-experts (MoE) method, exemplified by models like Mixtral, has shown particular promise. MoE works by selectively activating different model components (experts), thus optimizing resource usage.

Recent studies (Zhang el al., 2021; Liu et al., 2023; Mirzadeh et al., 2023) reveal that LLMs inherently exhibit properties conducive to sparse computation when employing the ReLU activation function. This insight opens up new avenues for model efficiency, akin to MoE's selective activation. By dynamically choosing model parameters for computation, we can substantially boost efficiency.

However, the widespread adoption of ReLU-based models in the LLM field remains limited. Referring to the transformation methods from existing works (Zhang el al., 2021; Mirzadeh et al., 2023), we convert existing models to ReLU-activated versions through fine-tuning. We hope these open-source ReLU LLMs could promote the development of sparse LLMs.

#311: Introduction | Mistral AI Large Language Models

### Details

Similarity score: 0.86 - [ ] [Introduction | Mistral AI Large Language Models](https://docs.mistral.ai/)

Mistral AI currently provides two types of access to Large Language Models:

An API providing pay-as-you-go access to our latest models,
Open source models available under the Apache 2.0 License, available on Hugging Face or directly from the documentation.
Where to start?

API Access

Our API is currently in beta to ramp up the load and provide good quality of service. Access the platform to join the waitlist. Once your subscription is active, you can immediately use our chat endpoint:

curl --location "https://api.mistral.ai/v1/chat/completions"
--header 'Content-Type: application/json'
--header 'Accept: application/json'
--header "Authorization: Bearer $MISTRAL_API_KEY"
--data '{
"model": "mistral-tiny",
"messages": [{"role": "user", "content": "Who is the most renowned French painter?"}]
}'

Or our embeddings endpoint:

curl --location "https://api.mistral.ai/v1/embeddings"
--header 'Content-Type: application/json'
--header 'Accept: application/json'
--header "Authorization: Bearer $MISTRAL_API_KEY"
--data '{
"model": "mistral-embed",
"input": ["Embed this sentence.", "As well as this one."]
}'

For a full description of the models offered on the API, head on to the model docs.

For more examples on how to use our platform, head on to our platform docs.

Raw model weights

Raw model weights can be used in several ways:

For self-deployment, on cloud or on premise, using either TensorRT-LLM or vLLM, head on to Deployment
For research, head-on to our reference implementation repository,
For local deployment on consumer grade hardware, check out the llama.cpp project or Ollama.
Get Help

Join our Discord community to discuss our models and talk to our engineers. Alternatively, reach out to our business team if you have enterprise needs, want more information about our products or if there are missing features you would like us to add.

Contributing

Mistral AI is committed to open source software development and welcomes external contributions. Please open a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase #645

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase #645

irthomasthomas commented Feb 28, 2024

irthomasthomas commented Feb 28, 2024

LoRAX Docs

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Suggested labels

{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe

Suggested labels

{'label-name': 'GPU-Optimization', 'label-description': 'Optimizing GPU resource utilization for running multiple models efficiently on a single GPU.', 'gh-repo': 'openpipe/openpipe-ai', 'confidence': 54.2}

Awesome-Efficient-LLM

Inference Acceleration

Updates

Contributing

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

LLaVA/README.md at main · haotian-liu/LLaVA

🌋 LLaVA: Large Language and Vision Assistant

Release

Contents

Suggested labels

Suggested labels

{ "key": "llm-api", "value": "Accessing Large Language Models through the Mistral AI API" }

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase #645

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase #645

Comments

irthomasthomas commented Feb 28, 2024

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase

Suggested labels

{'label-name': 'adapter-based-fine-tuning', 'label-description': 'Efficient approach to fine-tuning large language models using adapters', 'gh-repo': 'https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4', 'confidence': 64.54}

irthomasthomas commented Feb 28, 2024

Related issues

#505: LoRAX: Dynamic loading and optimized inference of LoRA adapter models.

LoRAX Docs

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Suggested labels

{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }

#636: S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe

Suggested labels

{'label-name': 'GPU-Optimization', 'label-description': 'Optimizing GPU resource utilization for running multiple models efficiently on a single GPU.', 'gh-repo': 'openpipe/openpipe-ai', 'confidence': 54.2}

#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

Awesome-Efficient-LLM

Inference Acceleration

Updates

Contributing

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

#628: LLaVA/README.md at main · haotian-liu/LLaVA

LLaVA/README.md at main · haotian-liu/LLaVA

🌋 LLaVA: Large Language and Vision Assistant

Release

Contents

Suggested labels

#174: SparseLLM/ReluLLaMA-7B · Powerinfer - faster CPU inference

#311: Introduction | Mistral AI Large Language Models

Suggested labels

{ "key": "llm-api", "value": "Accessing Large Language Models through the Mistral AI API" }