Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -396,3 +396,8 @@ FodyWeavers.xsd

# JetBrains Rider
*.sln.iml
*.egg-info

# build
build/*
dist/*
121 changes: 121 additions & 0 deletions DOCUMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# LLMLingua Documentation

## Principles

- The most important thing is **the sensitivity to compression varies among different components in a prompt**, such as instructions and questions being more sensitive, while context or documents are less sensitive. Therefore, it is advisable to separate the components within the prompt and input them into demonstrations, instructions, and questions.
- **Divide demonstrations and context into independent granularities**, such as documents in multi-document QA and examples in few-shot learning. This approach will be beneficial for the budget controller and document reordering.
- **Preserving essential characters in the scenario as required by the rule**, we will provide support soon.
- Try experimenting with different target compression ratios or other hyperparameters to optimize the performance.

## Initialization

```python
from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
model_name: str = "NousResearch/Llama-2-7b-hf",
device_map: str = "cuda",
use_auth_token: bool = False,
open_api_config: dict = {},
)
```
### Parameters

- model_name(str), the name of small language model from huggingface. Default set to "NousResearch/Llama-2-7b-hf";
- device_map(str), the device environment for using small models, like 'cuda', 'cpu', 'balanced', 'balanced_low_0', 'auto'. Default set to "cuda";
- use_auth_token(bool, optional), controls the usage of huggingface auto_token. Default set to False;
- open_api_config(dict, optional), the config of openai which use in OpenAI Embedding in coarse-level prompt compression. Default set to {};

## Function Call

```python
compressed_prompt = llm_lingua.compress_prompt(
context: List[str],
instruction: str = "",
question: str = "",
ratio: float = 0.5,
target_token: float = -1,
iterative_size: int = 200,
force_context_ids: List[int] = None,
force_context_number: int = None,
use_sentence_level_filter: bool = False,
use_context_level_filter: bool = True,
use_token_level_filter: bool = True,
keep_split: bool = False,
keep_first_sentence: int = 0,
keep_last_sentence: int = 0,
keep_sentence_number: int = 0,
high_priority_bonus: int = 100,
context_budget: str = "+100",
token_budget_ratio: float = 1.4,
condition_in_question: str = "none",
reorder_context: str = "original",
dynamic_context_compression_ratio: float = 0.0,
condition_compare: bool = False,
add_instruction: bool = False,
rank_method: str = "longllmlingua",
concate_question: bool = True,
)

# > {'compressed_prompt': 'Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 * 300ters in total\nSam then took 5 boxes 6ters0ters.\nHe sold these boxes for 5 *5\nAfterelling these boxes there were 3030 highlighters remaining.\nThese form 330 / 3 = 110 groups of three pens.\nHe sold each of these groups for $2 each, so made 110 * 2 = $220 from them.\nIn total, then, he earned $220 + $15 = $235.\nSince his original cost was $120, he earned $235 - $120 = $115 in profit.\nThe answer is 115',
# 'origin_tokens': 2365,
# 'compressed_tokens': 211,
# 'ratio': '11.2x',
# 'saving': ', Saving $0.1 in GPT-4.'}
```

### Parameters

- **context**(str or List[str]), the context, documents or demonstrations in the prompt, low sensitivity to compression;
- instruction(str), general instruction in the prompt before the context, high sensitivity to compression;
- **question**(str), general question in the prompt after the context, high sensitivity to compression;
- **ratio**(float, optional), target compression ratio, the larger the value, the fewer tokens will be retained, mutually exclusive with **target_token**, default set to 0.5;
- **target_token**(float), target compression token number, mutually exclusive with **ratio**, default set to -1;
- **iterative_size**(int), the segment size in Iterative Token-level Prompt Compression, default set to 200;
- **force_context_ids**(List[int], optional), the index list forcefully retains of **context**, default set to None,
- **force_context_number**(int, optional), the context number forcefully retains in Coarse-level Prompt Compression, default set to None,
- **use_sentence_level_filter**(bool, optional), controls the usage of the sentence-level prompt compression, default set to False;
- **use_context_level_filter**(bool, optional), controls the usage of the coarse-level prompt compression, default set to True;
- **use_token_level_filter**(bool, optional), controls the usage of the token-level prompt compression, default set to True;
- **keep_split**(bool, optional), control whether to retain all the newline separators "\n\n" in the prompt, default set to False;
- **keep_first_sentence**(bool, optional), control whether to retain the first k sentence in each context, default set to False;
- **keep_last_sentence**(bool, optional), control whether to retain the last k sentence in each context, default set to False;
- **keep_sentence_number**(int, optional), control the retain sentence number in each context, default set to 0;
- **high_priority_bonus**(int, optional), control the ppl bonus of the ratin sentence, only use when **keep_first_sentence** or **keep_last_sentence** is True, default set to 100;
- **context_budget**(str, optional), the budget in Coarse-level Prompt Compression, supported operators, like "*1.5" or "+100", default set to "+100";
- **token_budget_ratio**(float, optional), the budget ratio in sentence-level Prompt Compression, default set to 1.4;
- **condition_in_question**(str, optional), control whether use the question-aware coarse-level prompt compression, support "none", "after", "before". In the LongLLMLingua, it is necessary to set to "after" or "before", default set to "none";
- **reorder_context**(str, optional), control whether use the document reordering before compression in LongLLMLingua, support "original", "sort", "two_stage", default set to "original";
- **dynamic_context_compression_ratio**(float, optional), control the ratio of dynamic context compression in LongLLMLingua, default set to 0.0;
- **condition_compare**(bool, optional), control whether use the Iterative Token-level Question-aware Fine-Grained Compression in LongLLMLingua, default set to False,
- **add_instruction**(bool, optional), control whether add the instuct before prompt in Iterative Token-level Question-aware Fine-Grained Compression, default set to False;
- **rank_method**(bool, optional), control the rank method use in Coarse-level Prompt Compression, support "llmlingua", "longllmlingua", "bm25", "gzip", "sentbert", "openai", default set to "llmlingua";
- **concate_question**(bool, optional), control whether include the question in the compressed prompt, default set to True;

### Response

- **compressed_prompt**(str), the compressed prompt;
- **origin_tokens**(int), the token number of original prompt;
- **compressed_tokens**(int), the token number of compressed prompt;
- **ratio**(str), the actual compression ratio;
- **saving**(str), the saving cost in GPT-4.

## Post-precessing

```python
compressed_prompt = llm_lingua.recover(
original_prompt: str,
compressed_prompt: str,
response: str,
)
```

### Parameters

- **original_prompt**(str), the original prompt;
- **compressed_prompt**(str), the compressed prompt;
- **response**(str), the response of the compressed prompt from black-box LLMs;

### Response

- **recovered_response**(str), the recovered response;
15 changes: 15 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
.PHONY: install style_check_on_modified style

export PYTHONPATH = src

PYTHON := python3
CHECK_DIRS := llmlingua

install:
@${PYTHON} setup.py bdist_wheel
@${PYTHON} -m pip install dist/sdtools*

style:
black $(CHECK_DIRS)
isort -rc $(CHECK_DIRS)
flake8 $(CHECK_DIRS)
98 changes: 90 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,96 @@
# Project
<p align="center" width="100%">
<img src="images/LLMLingua_logo.png" alt="LLMLingua" style="width: 20%; min-width: 100px; display: block; margin: auto;">
</p>

> This repo has been populated by an initial template to help get you started. Please
> make sure to update the content to build a great experience for community-building.
# LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models [paper] & LongLLMLingua [paper]

As the maintainer of this project, please make a few updates:
https://github.com/microsoft/LLMLingua/assets/30883354/ef52995c-ef3c-4eac-a9fd-1acb491c325b

- Improving this README.MD file to provide a great experience
- Updating SUPPORT.MD with content about this project's support experience
- Understanding the security reporting process in SECURITY.MD
- Remove this section from the README
You can try the LLMLingua demo in [HF Space](https://huggingface.co/spaces/microsoft/LLMLingua).

## Tl;DR

LLMLingua, that uses a well-trained small language model after alignment, such as GPT2-small or LLaMA-7B, to detect the unimportant tokens in the prompt and enable inference with the compressed prompt in black-box LLMs, achieving up to 20x compression with minimal performance loss.

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (EMNLP 2023).<br>
_Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang and Lili Qiu_

LongLLMLingua is a method that enhances LLMs' ability to perceive key information in long-context scenarios using prompt compression, achieveing up to $28.5 in cost savings per 1,000 samples while also improving performance.

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression (Under Review).<br>
_Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu_

## 🎥 Overview

![image](./images/LLMLingua_motivation.png)

- Have you ever tried to input a long text and ask ChatGPT to summarize it, only to be told that it exceeds the token limit? ​
- Have you ever spent a lot of time fine-tuning the personality of ChatGPT, only to find that it forgets the previous instructions after a few rounds of dialogue? ​
- Have you ever used the GPT3.5/4 API for experiments, and got good results, but also received a huge bill after a few days? ​

Large language models, such as ChatGPT and GPT-4, impress us with their amazing generalization and reasoning abilities, but they also come with some drawbacks, such as the prompt length limit and the prompt-based pricing scheme.​

![image](./images/LLMLingua_framework.png)

Now you can use **LLMLingua** & **LongLLMLingua**!​

A simple and efficient method to compress prompt up to **20x**.​

- 💰 **Saving cost**, not only prompt, but also the generation length;​
- 📝 **Support longer contexts** while delivering enhanced performance;​
- ⚖️ **Robustness**, no need any training for the LLMs;​
- 🕵️ **Keeping** the original prompt knowledge like ICL, reasoning, etc.​
- 📜 **KV-Cache compression**, speedup inference;​

![image](./images/LLMLingua_demo.png)

If you find this repo helpful, please cite the following paper:

```bibtex
@inproceedings{jiang-etal-2023-llmlingua,
title = "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models",
author = "Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang and Lili Qiu",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
publisher = "Association for Computational Linguistics",
}
```
```bibtex
@inproceedings{jiang-etal-2023-longllmlingua,
title = "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression",
author = "Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu",
}
```

## 🎯 Quick Start

Install LLMLingua,

```bash
pip install llmlingua
```

Then, you can use LLMLingua to compress your prompt,

```python
from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)

# > {'compressed_prompt': 'Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 * 300ters in total\nSam then took 5 boxes 6ters0ters.\nHe sold these boxes for 5 *5\nAfterelling these boxes there were 3030 highlighters remaining.\nThese form 330 / 3 = 110 groups of three pens.\nHe sold each of these groups for $2 each, so made 110 * 2 = $220 from them.\nIn total, then, he earned $220 + $15 = $235.\nSince his original cost was $120, he earned $235 - $120 = $115 in profit.\nThe answer is 115',
# 'origin_tokens': 2365,
# 'compressed_tokens': 211,
# 'ratio': '11.2x',
# 'saving': ', Saving $0.1 in GPT-4.'}
```

You can refer to this [document](./DOCUMENT.md) for more recommendations on how to use LLMLingua effectively.

## Frequently Asked Questions

show in [Transparency_FAQ.md](./Transparency_FAQ.md)

## Contributing

Expand Down
14 changes: 1 addition & 13 deletions SUPPORT.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,3 @@
# TODO: The maintainer of this repo has not yet edited this file

**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?

- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.

*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*

# Support

## How to file issues and get help
Expand All @@ -16,9 +6,7 @@ This project uses GitHub Issues to track bugs and feature requests. Please searc
issues before filing new issues to avoid duplicates. For new issues, file your bug or
feature request as a new Issue.

For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
For help and questions about using this project, please refer the [document](./DOCUMENT.md).

## Microsoft Support Policy

Expand Down
41 changes: 41 additions & 0 deletions Transparency_FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# LLMLingua's Responsible AI FAQ

## What is LLMLingua?

- LLMLingua is a simple and efficient method to compress prompt up to 20x and keeping the original prompt knowledge like ICL, reasoning, etc.
- LLMLingua takes user-defined prompts and compression goals as input, and outputs a compressed prompt, which may often result in a form of expression that is difficult for humans to understand.

## What can LLMLingua do?

- LLMLingua can simultaneously reduce the length of prompts and the output of LLMs (20%-30%), thus saving API calls;
- Compressed prompts from LLMLingua can be directly used with black-box LLMs, such as ChatGPT, GPT-4, and Claude;
- By compressing prompts, LLMLingua allows for more information to be included within the original token length, thereby improving model performance;
- LLMLingua relies on a small language model, like GPT-2 or LLaMA-7b, for perplexity calculations, which is a relatively low-cost approach;
- Compressed prompts generated by LLMLingua can be understood by LLMs, preserving their original capabilities in downstream tasks and keeping the original prompt knowledge like ICL, reasoning, etc. LLMs can also recover the essential information from the compressed prompts;
- LLMLingua is a robustness method, no need any training for the LLMs;
- Additionally, LLMLingua can be used to compress KV-Cache, which speeds up inference.

## What is/are LLMLingua’s intended use(s)?

- Users who call black-box LLM APIs similar to GPT-4, those who utilize ChatGPT to handle longer content, as well as model deployers and cloud service providers, can benefit from these techniques.

## How was LLMLingua evaluated? What metrics are used to measure performance?

- In our experiments, we conducted a detailed evaluation of the performance of compressed prompts across various tasks, particularly in those involving LLM-specific capabilities, such as In-Context Learning, reasoning tasks, summarization, and conversation tasks. We assessed our approach using compression ratio and performance loss as evaluation metrics.

## What are the limitations of LLMLingua? How can users minimize the impact of LLMLingua’s limitations when using the system?

- The potential harmful, false or biased responses using the compressed prompts would likely be unchanged. Thus using LLMLingua has no inherent benefits or risks when it comes to those types of responsible AI issues.
- LLMLingua may struggle to perform well at particularly high compression ratios, especially when the original prompts are already quite short.

## What operational factors and settings allow for effective and responsible use of LLMLingua?

- Users can set parameters such as the boundaries between different components (instruction, context, question) in the prompt, compression goals, and the small model used for compression calculations. Afterward, they can input the compressed prompt into black-box LLMs for use.

## What is instruction, context, and question?

In our approach, we divide the prompts into three distinct modules: instruction, context, and question. Each prompt necessarily contains a question, but the presence of context and instruction is not always guaranteed.

- Question: This refers to the directives given by the user to the LLMs, such as inquiries, questions, or requests. Positioned after the instruction and context modules, the question module has a high sensitivity to compression.
- Context: This module provides the supplementary context needed to address the question, such as documents, demonstrations, web search results, or API call results. Located between the instruction and question modules, its sensitivity to compression is relatively low.
- Instruction: This module consists of directives given by the user to the LLMs, such as task descriptions. Placed before the instruction and context modules, the instruction module exhibits a high sensitivity to compression.
Binary file added images/LLMLingua_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/LLMLingua_framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/LLMLingua_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/LLMLingua_motivation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions llmlingua/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# flake8: noqa
from .prompt_compressor import PromptCompressor
from .version import VERSION as __version__


__all__ = ["PromptCompressor"]
Loading