# Task 1: Custom Training Pipeline

### I. Import Libraries

In [1]:
# !pip install llmlingua
# !pip install openai==0.28
# !pip install spacy
# !python -m spacy download en_core_web_sm
# !pip install scikit-learn
# !pip install tensorboard

# !pip install datasets

### II. Collect Data

* Format data to a list of dict (idx, prompt)

In [2]:
import json
import os

from datasets import load_dataset

ds = load_dataset("openai/gsm8k", "main", split="train")
# Fig 12 example idx=3715
data = []
for idx, instance in enumerate(ds):
    if idx==500: break
    temp = {}
    temp["idx"] = idx
    temp["prompt"] = "Question: "+instance["question"]+instance["answer"]
    #temp["summary"] = instance["summary"]
    data.append(temp)
    os.makedirs("../../../results/gsm8k/origin/", exist_ok=True)
    json.dump(data, open("../../../results/gsm8k/origin/gsm8k_train_formated.json", "w"),
    indent=4,)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


* instruct GPT-4 to compress the orignial context

In [3]:
#!wget -O compress.py https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/compress.py
!python compress.py --load_origin_from ../../../results/gsm8k/origin/gsm8k_train_formated.json \
 --chunk_size 512 \
 --compressor gpt4 \
 --model_name gpt-4 \
 --save_path ../../../results/gsm8k/origin/compression_gsm8k_train_formated.json


num data: 500
You are an excellent linguist and very good at compressing passages into short expressions by removing unimportant words, while retaining as much information as possible.
Compress some text to short expressions, and such that you (GPT-4) can reconstruct it as close as possible to the original. Unlike the usual text compression, I need you to comply with the 5 conditions below: 1. You can ONLY remove unimportant words. 2. Do not change the order of words. 3. Do not change the original words, e.g. 'asking'->'ask' is NOT OK, 'current'->'now' is NOT OK. 4. Do not use abbreviations or emojis, e.g. 'without'->'w/o' is NOT OK, 'as soon as possible'->'ASAP' is NOT OK. 5. Do not add new words or symbols, this is very important. For example, 'dedicate 3 hours to each chapter'->'3 hours/chapter' is NOT OK because you add new token '/', just compress it into '3 hours each chapter'. '30 eggs plus 20 eggs equals 50 eggs'->'30+20=50' is also NOT OK becuase you add new symbols + and =, j

* assign label and filter out poor compression samples

In [6]:
#!wget -O label_word.py https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/label_word.py
!python label_word.py --load_prompt_from ../../../results/gsm8k/origin/compression_gsm8k_train_formated.json \
--window_size 400 \
--save_path ../../../results/gsm8k/origin/annotation_gsm8k_train_formated.json

106it [00:04, 26.55it/s]Question: John buys 10 packs of magic cards.  Each pack has 20 cards and 1/4 of those cards are uncommon.  How many uncommon cards did he get?Each pack has 20/4 = <<20/4=5>>5 uncommons
So he got 10*5 = <<10*5=50>>50 uncommons
#### 50
--------------------------------------------------
"John buys 10 packs magic cards. Each pack 20 cards, 1/4 uncommon. How many uncommon cards get? Each pack 20/4 = 5 uncommons. So got 10*5 = 50 uncommons. 50."
--------------------------------------------------
John buy 10 pack magic card . each pack 20 card 1/4 uncommon . how many uncommon card pack 20/4 = uncommon so get 10 * 5 = uncommon 50
--------------------------------------------------
['question', ':', 'John', 'buy', '10', 'pack', 'of', 'magic', 'card', '.', ' ', 'each', 'pack', 'have', '20', 'card', 'and', '1/4', 'of', 'those', 'card', 'be', 'uncommon', '.', ' ', 'how', 'many', 'uncommon', 'card', 'do', 'he', 'get?each', 'pack', 'have', '20/4', '=', '<', '<', '20/4=5>>5', '

In [7]:
#!wget -O filter.py https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/filter.py
!python filter.py --load_path ../../../results/gsm8k/origin/annotation_gsm8k_train_formated.pt \
--save_path ../../../results/gsm8k/origin/annotation_kept_gsm8k_train_formated.pt

  res_pt = torch.load(args.load_path)
500


### III. Train

In [9]:
!python train_roberta.py --data_path ../../../results/gsm8k/origin/annotation_kept_gsm8k_train_formated.pt \
    --save_path ../../../results/models/xlm-roberta-large-gsm8k-only

2024-11-22 02:34:23.727415: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-22 02:34:23.745284: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-22 02:34:23.766979: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-22 02:34:23.773454: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-22 02:34:23.789015: I tensorflow/core/platform/cpu_feature_guar

### IV. Inference

In [10]:
from datasets import load_dataset
ds_test = load_dataset("openai/gsm8k", "main", split="test")
original_prompts = []
for idx, instance in enumerate(ds_test):
  original_prompts.append("Question: "+instance['question']+instance['answer'])

In [55]:
from llmlingua import PromptCompressor

model_meetingbank="microsoft/llmlingua-2-xlm-roberta-large-meetingbank"
model_gsm8k = '../../../results/models/xlm-roberta-large-gsm8k-only'

compressor_meetingbank = PromptCompressor(
    model_name=model_meetingbank,
    use_llmlingua2=True
)

compressor_gsm8k = PromptCompressor(
    model_name=model_gsm8k,
    use_llmlingua2=True
)

config.json:   0%|          | 0.00/752 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

In [60]:
def compare(prompt):
  results_meetingbank = compressor_meetingbank.compress_prompt_llmlingua2(
      prompt,
      rate=0.6,
      force_tokens=['\n', '.', '!', '?', ','],
      chunk_end_tokens=['.', '\n'],
      return_word_label=True,
      drop_consecutive=True
  )
  results_gsm8k = compressor_gsm8k.compress_prompt_llmlingua2(
      prompt,
      rate=0.6,
      force_tokens=['\n', '.', '!', '?', ','],
      chunk_end_tokens=['.', '\n'],
      return_word_label=True,
      drop_consecutive=True
  )
  return results_meetingbank['compressed_prompt'], results_gsm8k['compressed_prompt']

In [61]:
prompt = original_prompts[0]
compressed_prompt_before, compressed_prompt_after = compare(prompt)

print("Original prompt:")
print(prompt)

print("Compressed prompt before training")
print(compressed_prompt_before)

print("Compressed prompt after training")
print(compressed_prompt_after)


Original prompt:
Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.
She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.
#### 18
Compressed prompt before training
Janet’s ducks lay 16 eggs day. eats three breakfast bakes muffins friends. sells remainder farmers' market $2 per fresh duck egg. dollars farmers' market?Janet sells 16 - 3 - 4 =-3-4=9>>9 duck eggs.
 makes 9 * 2 = $<9*2=18>>18 farmer’s market.

Compressed prompt after training
Janet’s ducks lay 16 eggs per day. eats three for breakfast morning bakes muffins friends day with four. sells remainder at farmers market daily for $2 per fresh duck egg. dollars market?Janet sells 16 3 4 duck eggs a day.
 makes 9 2 $ every farmer’s 

| keywords | before training | after tranining |
|----------|----------|----------|
| 16 eggs  | yes  | yes  |
| three for breakfast | yes  | yes |
| muffins with four | no  | yes  |
| sells remainder for $2 | yes  | yes  |
| 16-3-4=9 | yes  | no  |
| 9*2=18 | yes  | no  |


In [62]:
prompt = original_prompts[1]
compressed_prompt_before, compressed_prompt_after = compare(prompt)

print("Original prompt:")
print(prompt)

print("Compressed prompt before training")
print(compressed_prompt_before)

print("Compressed prompt after training")
print(compressed_prompt_after)


Original prompt:
Question: A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?It takes 2/2=<<2/2=1>>1 bolt of white fiber
So the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric
#### 3
Compressed prompt before training
robe takes 2 bolts blue half white fiber. How many bolts total takes 2/2=<2/2=1>>1 bolt white fiber
 total fabric 2+1<2+1=3>>3 bolts fabric

Compressed prompt after training
A robe takes 2 bolts of blue fiber half that much white fiber. How many bolts in total take?It takes bolt white fiber
 total amount fabric bolts fabric



| keywords | before training | after tranining |
|----------|----------|----------|
| 2 bolts blue  | yes  | yes  |
| half white | yes  | yes |
| 2/2=1 | yes  | no|
| 2+1=3 | yes  | no  |

In [63]:
prompt = original_prompts[2]
compressed_prompt_before, compressed_prompt_after = compare(prompt)

print("Original prompt:")
print(prompt)

print("Compressed prompt before training")
print(compressed_prompt_before)

print("Compressed prompt after training")
print(compressed_prompt_after)


Original prompt:
Question: Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000
He increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000
So the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000
So he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000
#### 70000
Compressed prompt before training
Josh house. buys house $80, 000 puts $50, 000 repairs. increased value house 150%. profit cost house repairs 80, 000+50, 000=$<80000+50000=130000>>130, 000
 increased value house 80, 000*1. 5=<80000*1. 5=120000>>120, 000
 new value 120, 000+80, 000=$<120000+80000=200000>>200, 000
 profit 200, 000-130, 000=$<<200000-130000=70000>>70, 000

Compressed prompt after training
: Josh decides try flipping house. buys house for $80, 

| keywords | before training | after tranining |
|----------|----------|----------|
| house 80,000  | yes  | yes  |
| $50,000 in repairs | yes  | yes |
| increased by 150% | yes  | yes |
| 80000+50000=130000 | yes  | no|
| 80000*1.5=120000 | yes  | no  |
| 120000+80000=200000 | yes  | no |
| 200000-130000=70000 | yes  | yes|


* This work is to train custom dataset of gsm8k (Q&A of grade school math) following the steps in https://github.com/microsoft/LLMLingua/tree/main/experiments/llmlingua2
* From the test results, we can see that LLMLingua2 works well for out-of-domain dataset. Compressed prompts capture most key information such as numbers and operators from the original prompts.
* Custom trained model does not improve the compressed prompts, probably due to the limited number (500) of training data.