## LLMLingua2

<a target="_blank" href="https://colab.research.google.com/github/microsoft/LLMLingua/blob/main/examples/LLMLingua2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<a target="_blank" href="https://arxiv.org/abs/2403.12968">LLMLingua-2</a> focuses on task-agnostic prompt compression for better generalizability and efficiency. It is a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in <b>task-agnostic compression</b>. It surpasses LLMLingua in handling <b>out-of-domain data</b>, offering <b>3x-6x faster</b> performance.

Below, We showcase the usage and compression results of <i>LLMLingua-2</i> on both <b>in-domain</b> and <b>out-of-domain</b> datasets, including various tasks such as single-document QA, multi-document QA, summarization and in-context learning.


In [None]:
from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

### Target LLM Config

In [2]:
!pip install openai==0.28

In [None]:
# Using the OAI
import openai

openai.api_key = "<insert_openai_key>"

# or Using the AOAI
import openai

openai.api_key = "<insert_openai_key>"
openai.api_base = "<insert_openai_base>"
openai.api_type = "azure"
openai.api_version = "2023-05-15"

## In-Domain

Below, we present the results of <i>LLMLingua-2</i> compared to the strong baselines on In-Domain data: test set of <a href="https://aclanthology.org/2023.acl-long.906/">MeetingBank</a>.
Despite the fact that our compressors are much smaller than the LLaMa-2-7B used in the baselines, 
our approach achieves <b>significantly better performance</b> on both the QA and Summary tasks, and <b>comes close to matching the performance of the original prompt</b>.

### MeetingBank


In [117]:
# Download the original prompt and dataset
from datasets import load_dataset

dataset = load_dataset("huuuyeah/meetingbank", split="test")
context = dataset[0]["transcript"]

question = "What is the agenda item three resolution 31669 about?\nAnswer:"
reference = "Encouraging individualized tenant assessment."

In [5]:
# The response from original prompt, using GPT-4-32k
import json

prompt = "\n\n".join([context, question])

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94T49ZkAUgmY2EQQVuzS8EcklBZQO",
    "object": "chat.completion",
    "created": 1710852069,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Agenda item three resolution 31669 is about encouraging the use of an individualized tenant assessment using the Fair Housing Act's discriminatory effect standards to avoid Fair Housing Act violations when criminal history is used as a screening criterion in the Landlord Screening Process. The resolution aims to ensure that landlords understand the law when it comes to making decisions based on criminal history. It also highlights the policies that the Department of Housing and Urban Development (HUD) is currently promoting and the policy direction that the city will be pursuing"
            },
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 1362,
        "complet

In [None]:
# 2000 Compression
compressed_prompt = llm_lingua.compress_prompt(
    context,
    rate=0.33,
    force_tokens=["!", ".", "?", "\n"],
    drop_consecutive=True,
)

In [7]:
# The response from original prompt, using GPT-4-32k
import json

prompt = "\n\n".join([compressed_prompt["compressed_prompt"], question])

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94T4RJZkt4dZz01FQv5gZhNp0qGq5",
    "object": "chat.completion",
    "created": 1710852087,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The agenda item three resolution 31669 is about individualized tenant assessment under the Fair Housing Act. It aims to avoid discriminatory standards and violations in the landlord screening process, particularly in relation to criminal history screening criteria. The resolution also discusses the Certificate of Restoration of Opportunity, a state legislation designed to provide potential employers and housing providers with information about individuals who have served prison time and have been released, to facilitate their reintegration into society."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 444,
        "completion_tokens": 87,
        "tot

## Out-of-Domain

As our model is only trained on meeting transcripts data from MeetingBank, here we explore its generalization ability across various benchmarks of long-context scenarios, reasoning, and in-context learning.
Although the compressor of <i>LLMLingua-2</i> is only trained on MeetingBank data, <i>LLMLingua-2</i> is also effective on <b>out-of-domain</b> data, 
with its performance <b>comparable to or even surpassing</b> the SOTA <i>task-agnostic</i> compression baselines. 

Below, we showcase several compression results on <a href="https://arxiv.org/abs/2308.14508">LongBench</a> and <a href="https://arxiv.org/abs/2110.14168">GSM8K</a>, including single-document QA, multi-document QA, summarization and in-context learning tasks.

### Load LongBench Prompt

In [25]:
dataset2prompt = {
    "narrativeqa": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {context}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
    "gov_report": "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{context}\n\nNow, write a one-page summary of the report.\n\nSummary:",
    "triviaqa": "Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n{context}\n\n{input}",
}

dataset2maxlen = {
    "narrativeqa": 128,
    "gov_report": 512,
    "triviaqa": 32,
}

### Single-Doc QA

In [18]:
task = "narrativeqa"
dataset = load_dataset("THUDM/LongBench", task, split="test")
sample = dataset[3]
context = sample["context"]
reference = sample["answers"]
print(reference)

['To smuggle Socrates out of prison and into a life of exile.']


In [12]:
# The response from original prompt, using GPT-4-32k
import json

prompt_format = dataset2prompt[task]
max_gen = int(dataset2maxlen[task])
prompt = prompt_format.format(**sample)

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": max_gen,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94TFlOkf2ps8qW0Y4jKf6cnsn6Mbi",
    "object": "chat.completion",
    "created": 1710852789,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "To convince Socrates to escape from prison."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9059,
        "completion_tokens": 9,
        "total_tokens": 9068
    },
    "system_fingerprint": null
}


In [None]:
# 3000 Compression
compressed_prompt = llm_lingua.compress_prompt(
    context,
    target_token=3000,
    force_tokens=["!", ".", "?", "\n"],
    drop_consecutive=True,
)

In [14]:
# The response from original prompt, using GPT-4-32k
import json

prompt_format = dataset2prompt[task]
max_gen = int(dataset2maxlen[task])
sample["context"] = compressed_prompt["compressed_prompt"]
prompt = prompt_format.format(**sample)

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": max_gen,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94TG5pKpMAsmKwGKcQy3ZvFjKPM5S",
    "object": "chat.completion",
    "created": 1710852809,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "To persuade Socrates to escape from prison."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 3064,
        "completion_tokens": 9,
        "total_tokens": 3073
    },
    "system_fingerprint": null
}


### Multi-Doc QA

In [105]:
task = "triviaqa"
dataset = load_dataset("THUDM/LongBench", task, split="test")
sample = dataset[0]
context = sample["context"]
reference = sample["answers"]
print(reference)

['The United States of America', 'United States Of Amerca', 'Us of a', 'U.–S.–A.', 'Americaland', 'United States (U.S.A.)', 'Amurika', 'Unite states of america', 'United States of America (redirect)', 'The U S A', 'Unietd States', 'EE UU', 'The U.S.A.', 'U.-S.-A.', 'Usa', 'United Staets of America', 'Unites States', "États-Unis d'Amérique", 'Verenigde State', 'U.–S.', 'The United States of America.', 'The U-S-A', 'EEUU', 'U. S. A.', 'Nagkaisang mga Estado', 'The U. S. of America', 'The USA', 'America (United States)', 'The U. S. A.', 'U S of America', 'UNITED STATES', 'Estados Unidos', 'The U–S', 'American United States', 'US and A', 'Unitd states', 'The US of A', 'EE.UU.', 'U-S', 'The U-S', 'Etymology of the United States', 'U.S.A.)', 'EE. UU.', 'United states of america', 'US of america', 'Verenigde State van Amerika', 'Nited States', 'United-States', 'Unite States', 'Estados Unidos de América', 'UnitedStates', 'Estaos Unios', 'US of America', 'The Usa', 'United states of America', '

In [106]:
# The response from original prompt, using GPT-4-32k
import json

prompt_format = dataset2prompt[task]
max_gen = int(dataset2maxlen[task])
prompt = prompt_format.format(**sample)

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": max_gen,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94U7u0jJgszqemEVuDemWOAA09ivp",
    "object": "chat.completion",
    "created": 1710856146,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "United States"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 5527,
        "completion_tokens": 2,
        "total_tokens": 5529
    },
    "system_fingerprint": null
}


In [None]:
context_list = context.split("\nPassage:")
context_list = ["\nPassage:" + c for c in context_list]

# 2000 Compression
compressed_prompt = llm_lingua.compress_prompt(
    context_list,
    target_token=2000,
    force_tokens=["\nPassage:", ".", "?", "\n"],
    drop_consecutive=True,
    use_context_level_filter=True,
)

In [112]:
# The response from original prompt, using GPT-4-32k
import json

prompt_format = dataset2prompt[task]
max_gen = int(dataset2maxlen[task])
sample["context"] = compressed_prompt["compressed_prompt"]
prompt = prompt_format.format(**sample)

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": max_gen,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94UAXMDXa6LtrDpYz35inJxWPChyc",
    "object": "chat.completion",
    "created": 1710856309,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "United States"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1805,
        "completion_tokens": 2,
        "total_tokens": 1807
    },
    "system_fingerprint": null
}


### Summarization

In [20]:
task = "gov_report"
dataset = load_dataset("THUDM/LongBench", task, split="test")
sample = dataset[0]
context = sample["context"]
reference = sample["answers"]
print(reference)

["Multiyear procurement (MYP) and block buy contracting (BBC) are special contracting mechanisms that Congress permits the Department of Defense (DOD) to use for a limited number of defense acquisition programs. Compared to the standard or default approach of annual contracting, MYP and BBC have the potential for reducing weapon procurement costs by a few or several percent. Under annual contracting, DOD uses one or more contracts for each year's worth of procurement of a given kind of item. Under MYP, DOD instead uses a single contract for two to five years' worth of procurement of a given kind of item without having to exercise a contract option for each year after the first year. DOD needs congressional approval for each use of MYP. There is a permanent statute governing MYP contracting—10 U.S.C. 2306b. Under this statute, a program must meet several criteria to qualify for MYP. Compared with estimated costs under annual contracting, estimated savings for programs being proposed for

In [21]:
# The response from original prompt, using GPT-4-32k
import json

prompt_format = dataset2prompt[task]
max_gen = int(dataset2maxlen[task])
prompt = prompt_format.format(**sample)

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": max_gen,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94TMClPIEoBfTqHqw78DJV2W5r97x",
    "object": "chat.completion",
    "created": 1710853188,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The report discusses the use of multiyear procurement (MYP) and block buy contracting (BBC) by the Department of Defense (DOD) for defense acquisition programs. These special contracting mechanisms, permitted by Congress, have the potential to reduce weapon procurement costs by a few or several percent. The report explores whether MYP and BBC should be used more or less frequently in the future, and whether a permanent statute should be created to govern the use of BBC, similar to the one that exists for MYP. It also discusses whether the Coast Guard should start using MYP and BBC. The report clarifies that MYP and BBC are contracting mechanisms, not funding approaches, and that they can significantly change t

In [22]:
# 3000 Compression
compressed_prompt = llm_lingua.compress_prompt(
    context,
    target_token=3000,
    force_tokens=["!", ".", "?", "\n"],
    drop_consecutive=True,
)

In [24]:
# The response from original prompt, using GPT-4-32k
import json

prompt_format = dataset2prompt[task]
max_gen = int(dataset2maxlen[task])
sample["context"] = compressed_prompt["compressed_prompt"]
prompt = prompt_format.format(**sample)

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": max_gen,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    engine="gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-94TN0NVjsaQRlqjLtyTNQP88shi6f",
    "object": "chat.completion",
    "created": 1710853238,
    "model": "gpt-4-32k",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The report discusses the issues related to multiyear procurement (MYP) and block buy contracting (BBC), special mechanisms used by the Department of Defense (DOD) for certain defense acquisition programs. These mechanisms can potentially reduce weapon procurement costs. However, they also affect defense practices, funding, and the industrial base. The report highlights that most DOD programs use traditional full funding and annual contracting, with a few using incremental funding. MYP and BBC are used in limited DOD programs. \n\nThe report explains that MYP is an alternative to annual contracting, allowing for a single contract for two to five years of procurement without congressional approval. The savings f

### In-Context Learning (GSM8K)

In [62]:
!wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt
prompt_complex = open("./prompt_hardest.txt").read()
gsm8k = load_dataset("gsm8k", "main")
gsm8k_test = gsm8k["test"]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2024-03-19 21:21:01--  https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8464 (8.3K) [text/plain]
Saving to: ‘prompt_hardest.txt’


2024-03-19 21:21:02 (32.6 MB/s) - ‘prompt_hardest.txt’ saved [8464/8464]



Downloading readme: 100%|██████████| 7.94k/7.94k [00:00<00:00, 8.38MB/s]
Downloading data: 100%|██████████| 2.31M/2.31M [00:01<00:00, 2.16MB/s]
Downloading data: 100%|██████████| 419k/419k [00:00<00:00, 1.25MB/s]
Generating train split: 100%|██████████| 7473/7473 [00:00<00:00, 306960.40 examples/s]
Generating test split: 100%|██████████| 1319/1319 [00:00<00:00, 249866.17 examples/s]


In [63]:
# select an example from GSM8K
question, answer = [gsm8k_test[2][key] for key in ["question", "answer"]]
# Ground-truth Answer
print("Question:", question)
print("Answer:", answer)

Question: Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?
Answer: The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000
He increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000
So the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000
So he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000
#### 70000


In [67]:
# The response from original prompt
import json

instruction = "Please reference the following examples to answer the math question,\n"
prompt = instruction + prompt_complex + "\n\nQuestion: " + question

request_data = {
    "prompt": prompt,
    "max_tokens": 400,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": "\n\n",
}
response = openai.Completion.create(
    engine="gpt-35-turbo-instruct",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "cmpl-94TkfUcusW4yGAXSr0mTpc9PURcnJ",
    "object": "text_completion",
    "created": 1710854705,
    "model": "gpt-35-turbo-instruct",
    "choices": [
        {
            "text": "\nLet's think step by step\nThe value of the house increased by 150%, meaning it is now worth 100% + 150% = 250% of its original value.\nIf the original value of the house was $80,000, then the new value is 250% * $80,000 = $200,000.\nJosh spent $80,000 to buy the house and $50,000 on repairs, so his total investment was $80,000 + $50,000 = $130,000.\nHis profit is the new value of the house ($200,000) minus his total investment ($130,000), so his profit is $200,000 - $130,000 = $70,000.\nThe answer is $70,000",
            "index": 0,
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 2428,
        "completion_tokens": 158,
        "total_tokens": 2586
    }
}


In [74]:
# 3000 Compression
compressed_prompt = llm_lingua.compress_prompt(
    prompt_complex.split("\n\n"),
    target_token=150,
    force_tokens=["+", "-", "*", "×", "/", "÷", "=", "The answer is", "\n"],
    drop_consecutive=True,
    force_reserve_digit=True,
    use_context_level_filter=True,
)

{'compressed_prompt': 'Sam bought dozen boxes 30 highlighter pens $10 rearranged five boxes six highlighters sold $3 per package sold rest three pens $2 profit\n Sam bought 12 boxes x $10 = $120 highlighters\n 12 * 30 = 360 highlighters\n 5 boxes × 6 highlighters/box = 30\n sold 5 * $3 = $15\n 5 360 - 30 = 330 highlighters remaining\n 330 / 3 = 110 groups three pens\n sold $2 110 * 2 = $220\n earned $220 + $15 = $235.\n original cost $120 earned $235 - $120 = $115 profit\nThe answer is 115', 'compressed_prompt_list': ['Sam bought dozen boxes 30 highlighter pens $10 rearranged five boxes six highlighters sold $3 per package sold rest three pens $2 profit\n Sam bought 12 boxes x $10 = $120 highlighters\n 12 * 30 = 360 highlighters\n 5 boxes × 6 highlighters/box = 30\n sold 5 * $3 = $15\n 5 360 - 30 = 330 highlighters remaining\n 330 / 3 = 110 groups three pens\n sold $2 110 * 2 = $220\n earned $220 + $15 = $235.\n original cost $120 earned $235 - $120 = $115 profit\nThe answer is 115'], 

In [75]:
instruction = "Please reference the following examples to answer the math question,\n"
prompt = (
    instruction + compressed_prompt["compressed_prompt"] + "\n\nQuestion: " + question
)

request_data = {
    "prompt": prompt,
    "max_tokens": 400,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": "\r\n",
}
response = openai.Completion.create(
    engine="gpt-35-turbo-instruct",
    **request_data,
)
print("Response:", response)

Response: {
  "id": "cmpl-94Tof3oyRFQlgEzjhurEiOoYiYDsR",
  "object": "text_completion",
  "created": 1710854953,
  "model": "gpt-35-turbo-instruct",
  "choices": [
    {
      "text": "\n\nTo find the new value of the house, we need to multiply the original value by 150% and add it to the original value.\n150% of $80,000 = $80,000 * 1.5 = $120,000\nNew value of the house = $80,000 + $120,000 = $200,000\nProfit = New value - (Original value + Repair cost)\n= $200,000 - ($80,000 + $50,000)\n= $200,000 - $130,000\n= $70,000\nJosh made a profit of $70,000.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 211,
    "completion_tokens": 128,
    "total_tokens": 339
  }
}
