- LLMLingua is a simple and efficient method to compress prompt up to 20x and keeping the original prompt knowledge like ICL, reasoning, etc.
- LLMLingua takes user-defined prompts and compression goals as input, and outputs a compressed prompt, which may often result in a form of expression that is difficult for humans to understand.
- LLMLingua can simultaneously reduce the length of prompts and the output of LLMs (20%-30%), thus saving API calls;
- Compressed prompts from LLMLingua can be directly used with black-box LLMs, such as ChatGPT, GPT-4, and Claude;
- By compressing prompts, LLMLingua allows for more information to be included within the original token length, thereby improving model performance;
- LLMLingua relies on a small language model, like GPT-2 or LLaMA-7b, for perplexity calculations, which is a relatively low-cost approach;
- Compressed prompts generated by LLMLingua can be understood by LLMs, preserving their original capabilities in downstream tasks and keeping the original prompt knowledge like ICL, reasoning, etc. LLMs can also recover the essential information from the compressed prompts;
- LLMLingua is a robustness method, no need any training for the LLMs;
- Additionally, LLMLingua can be used to compress KV-Cache, which speeds up inference.
- Users who call black-box LLM APIs similar to GPT-4, those who utilize ChatGPT to handle longer content, as well as model deployers and cloud service providers, can benefit from these techniques.
- In our experiments, we conducted a detailed evaluation of the performance of compressed prompts across various tasks, particularly in those involving LLM-specific capabilities, such as In-Context Learning, reasoning tasks, summarization, and conversation tasks. We assessed our approach using compression ratio and performance loss as evaluation metrics.
What are the limitations of LLMLingua? How can users minimize the impact of LLMLingua’s limitations when using the system?
- The potential harmful, false or biased responses using the compressed prompts would likely be unchanged. Thus using LLMLingua has no inherent benefits or risks when it comes to those types of responsible AI issues.
- LLMLingua may struggle to perform well at particularly high compression ratios, especially when the original prompts are already quite short.
- Users can set parameters such as the boundaries between different components (instruction, context, question) in the prompt, compression goals, and the small model used for compression calculations. Afterward, they can input the compressed prompt into black-box LLMs for use.
In our approach, we divide the prompts into three distinct modules: instruction, context, and question. Each prompt necessarily contains a question, but the presence of context and instruction is not always guaranteed.
- Question: This refers to the directives given by the user to the LLMs, such as inquiries, questions, or requests. Positioned after the instruction and context modules, the question module has a high sensitivity to compression.
- Context: This module provides the supplementary context needed to address the question, such as documents, demonstrations, web search results, or API call results. Located between the instruction and question modules, its sensitivity to compression is relatively low.
- Instruction: This module consists of directives given by the user to the LLMs, such as task descriptions. Placed before the instruction and context modules, the instruction module exhibits a high sensitivity to compression.
Refer the discussion.
TL;DR: Fine-tuning is beneficial, but the improvement is not very significant.
Our current understanding is that any Language Model can be used to estimate the importance distribution of tokens. And we believe that the higher the compression rate of the LM itself (followed "LM is a compressor"), the more accurate the estimation will be. This is particularly true in terms of the model's exposure to more tokens during the pre-training process.
Therefore, we consider that any LM can potentially serve as a compressor for prompt compression, with different LMs sharing the same essential token distribution. In our previous experiments, we found that alignment might have some impact, but it is minimal – about 1-2 points. Perhaps a more refined alignment method could significantly enhance performance.
Refer the discussion, issue.
Our current understanding is that any Language Model can be used to estimate the importance distribution of tokens. And we believe that the higher the compression rate of the LM itself (followed "LM is a compressor"), the more accurate the estimation will be. This is particularly true in terms of the model's exposure to more tokens during the pre-training process.
Therefore, we consider that any LM can potentially serve as a compressor for prompt compression, with different LMs sharing the same essential token distribution. In our previous experiments, we found that alignment might have some impact, but it is minimal – about 1-2 points. Perhaps a more refined alignment method could significantly enhance performance.
Refer the issue1, issue2, and issue3.
We require an API that can return the logprobs of the input prompt. Currently, we have found that OpenAI and FastChat offer this feature. We plan to support it soon.
logp = openai.Completion.create(
model="davinci-002",
prompt="Please return the logprobs",
logprobs=0,
max_tokens=0,
echo=True,
temperature=0,
)
Out[3]:
<OpenAIObject text_completion id=-at > JSON: {
"id": "",
"object": "text_completion",
"created": 1707295146,
"model": "davinci-002",
"choices": [
{
"text": "Please return the logprobs",
"index": 0,
"logprobs": {
"tokens": [
"Please",
" return",
" the",
" log",
"pro",
"bs"
],
"token_logprobs": [
null,
-6.9668007,
-2.047512,
-8.885729,
-13.960022,
-5.479665
],
"top_logprobs": null,
"text_offset": [
0,
6,
13,
17,
21,
24
]
},
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 6,
"total_tokens": 6
}
}
We release the parameter in the issue1, issue2.
LLMLingua:
prompt = compressor.compress_prompt(
context=xxx,
instruction=xxx,
question=xxx,
ratio=0.75,
iterative_size=100,
context_budget="*2",
)
LongLLMLingua:
compressed_prompt = llm_lingua.compress_prompt(
demonstration.split("\n"),
instruction,
question,
0.55,
use_sentence_level_filter=False,
condition_in_question="after_condition",
reorder_context="sort",
dynamic_context_compression_ratio=0.3, # or 0.4
condition_compare=True,
context_budget="+100",
rank_method="longllmlingua",
)
Experiments in LLMLingua and most experiments in LongLLMLingua were conducted in completion mode, whereas chat mode tends to be more sensitive to token-level compression. However, OpenAI has currently disabled GPT-3.5-turbo's completion; you can use GPT-3.5-turbo-instruction or Azure OpenAI service instead.
LLMLingua-2:
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True, # Whether to use llmlingua-2
)
compressed_prompt = llm_lingua.compress_prompt(prompt, rate=0.33, force_tokens = ['\n', '?'])
## Or use LLMLingua-2-small model
llm_lingua = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
use_llmlingua2=True, # Whether to use llmlingua-2
)
And you can find the details of the LLMLingua-2 experiments at experiments/llmlingua2.
Thanks to the contributions of Ayo Ayibiowu (@thehapyone), (Long)LLMLingua can be seamlessly integrated into LangChain. Here's an example of how to initialize (Long)LLMLingua within LangChain:
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.retrievers.document_compressors import LLMLinguaCompressor
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
compressor = LLMLinguaCompressor(model_name="openai-community/gpt2", device_map="cpu")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.get_relevant_documents(
"What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)
For a more detailed guide, please refer to Notebook.
Thanks to the contributions of Jerry Liu (@jerryjliu), (Long)LLMLingua can be seamlessly integrated into LlamaIndex. Here's an example of how to initialize (Long)LLMLingua within LlamaIndex:
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import CompactAndRefine
from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor
node_postprocessor = LongLLMLinguaPostprocessor(
instruction_str="Given the context, please answer the final question",
target_token=300,
rank_method="longllmlingua",
additional_compress_kwargs={
"condition_compare": True,
"condition_in_question": "after",
"context_budget": "+100",
"reorder_context": "sort", # Enables document reordering
"dynamic_context_compression_ratio": 0.4, # Enables dynamic compression ratio
},
)
For a more detailed guide, please refer to RAGLlamaIndex Example.