Notebook for uploading PDF, extracting all Text and Pre-Processing using a 1B or 3B model

In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [14]:
pdf_path = './2402.13116v3.pdf'
DEFAULT_MODEL = "meta-llama/Llama-3.2-1B-Instruct"
#DEFAULT_MODEL = "meta-llama/Llama-3.2-1B-Instruct" <- Don't think this would be necessary

In [31]:
from difflib import HtmlDiff
from IPython.display import HTML, display

In [18]:
# Import necessary libraries
import PyPDF2
from typing import Optional
import os
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.notebook import tqdm
import warnings

accelerator = Accelerator()
device = accelerator.device

warnings.filterwarnings('ignore')

In [9]:
def validate_pdf(file_path: str) -> bool:
    if not os.path.exists(file_path):
        print(f"Error: File not found at path: {file_path}")
        return False
    if not file_path.lower().endswith('.pdf'):
        print("Error: File is not a PDF")
        return False
    return True

In [10]:
def extract_text_from_pdf(file_path: str, max_chars: int = 100000) -> Optional[str]:
    if not validate_pdf(file_path):
        return None
    
    try:
        with open(file_path, 'rb') as file:
            # Create PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)
            
            # Get total number of pages
            num_pages = len(pdf_reader.pages)
            print(f"Processing PDF with {num_pages} pages...")
            
            extracted_text = []
            total_chars = 0
            
            # Iterate through all pages
            for page_num in range(num_pages):
                # Extract text from page
                page = pdf_reader.pages[page_num]
                text = page.extract_text()
                
                # Check if adding this page's text would exceed the limit
                if total_chars + len(text) > max_chars:
                    # Only add text up to the limit
                    remaining_chars = max_chars - total_chars
                    extracted_text.append(text[:remaining_chars])
                    print(f"Reached {max_chars} character limit at page {page_num + 1}")
                    break
                
                extracted_text.append(text)
                total_chars += len(text)
                print(f"Processed page {page_num + 1}/{num_pages}")
            
            final_text = '\n'.join(extracted_text)
            print(f"\nExtraction complete! Total characters: {len(final_text)}")
            return final_text
            
    except PyPDF2.PdfReadError:
        print("Error: Invalid or corrupted PDF file")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")
        return None


In [11]:
# Get PDF metadata
def get_pdf_metadata(file_path: str) -> Optional[dict]:
    if not validate_pdf(file_path):
        return None
    
    try:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            metadata = {
                'num_pages': len(pdf_reader.pages),
                'metadata': pdf_reader.metadata
            }
            return metadata
    except Exception as e:
        print(f"Error extracting metadata: {str(e)}")
        return None

In [12]:
# Extract metadata first
print("Extracting metadata...")
metadata = get_pdf_metadata(pdf_path)
if metadata:
    print("\nPDF Metadata:")
    print(f"Number of pages: {metadata['num_pages']}")
    print("Document info:")
    for key, value in metadata['metadata'].items():
        print(f"{key}: {value}")

# Extract text
print("\nExtracting text...")
extracted_text = extract_text_from_pdf(pdf_path)

# Display first 500 characters of extracted text as preview
if extracted_text:
    print("\nPreview of extracted text (first 500 characters):")
    print("-" * 50)
    print(extracted_text[:500])
    print("-" * 50)
    print(f"\nTotal characters extracted: {len(extracted_text)}")

# Optional: Save the extracted text to a file
if extracted_text:
    output_file = 'extracted_text.txt'
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(extracted_text)
    print(f"\nExtracted text has been saved to {output_file}")

Extracting metadata...

PDF Metadata:
Number of pages: 44
Document info:
/Author: 
/CreationDate: D:20240311015030Z
/Creator: LaTeX with hyperref
/Keywords: 
/ModDate: D:20240311015030Z
/PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
/Producer: pdfTeX-1.40.25
/Subject: 
/Title: 
/Trapped: /False

Extracting text...
Processing PDF with 44 pages...
Processed page 1/44
Processed page 2/44
Processed page 3/44
Processed page 4/44
Processed page 5/44
Processed page 6/44
Processed page 7/44
Processed page 8/44
Processed page 9/44
Processed page 10/44
Processed page 11/44
Processed page 12/44
Processed page 13/44
Processed page 14/44
Processed page 15/44
Processed page 16/44
Reached 100000 character limit at page 17

Extraction complete! Total characters: 100016

Preview of extracted text (first 500 characters):
--------------------------------------------------
1
A Survey on Knowledge Distillation of Large
Language Models
Xiaohan Xu1, M

In [20]:
device = "cuda" if torch.cuda.is_available() else "cpu"

SYS_PROMPT = """
You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.

The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.

Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive

The goal is to use this in a podcast research transcript so a lot of the emails, citations, and things like that can be removed-please be smart with what you remove and be creative ok?

Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RETURNING AS IS

Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text.

ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?
Here is the text:
"""

In [22]:
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained(
    DEFAULT_MODEL,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)
model, tokenizer = accelerator.prepare(model, tokenizer)

In [32]:
def create_html_diff(text1, text2, chunk_num):
    """Create HTML diff between two texts"""
    # Wrap text to make it more readable
    text1_lines = textwrap.wrap(text1, width=80)
    text2_lines = textwrap.wrap(text2, width=80)
    
    # Create diff
    diff = HtmlDiff(wrapcolumn=80)
    html = diff.make_file(
        text1_lines, 
        text2_lines,
        fromdesc=f"Original (Chunk {chunk_num})",
        todesc=f"Processed (Chunk {chunk_num})",
        context=True
    )
    
    return html

In [34]:
def process_chunk(text_chunk):
    conversation = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": text_chunk},
    ]
    
    prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():  # Add this for efficiency
        output = model.generate(
            **inputs,
            temperature=0.7,
            top_p=0.9,
            max_new_tokens=8126
        )

    diff_html = create_html_diff(text_chunk, processed_text, chunk_num)
    
    return tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()

In [35]:
INPUT_FILE = "./extracted_text.txt"  # Replace with your file path
CHUNK_SIZE = 1000
output_file = f"clean_{os.path.basename(INPUT_FILE)}"
diff_file = f"diff_{os.path.basename(INPUT_FILE)}.html"

KeyError: ' font-family'

In [30]:
with open(INPUT_FILE, 'r', encoding='utf-8') as file:
    text = file.read()

# Calculate number of chunks
num_chunks = (len(text) + CHUNK_SIZE - 1) // CHUNK_SIZE

# Cell 6: Process the file
# Create output file name
output_file = f"clean_{os.path.basename(INPUT_FILE)}"

# Process chunks and write to file
processed_text = ""

with open(output_file, 'w', encoding='utf-8') as out_file:
    for chunk_num in tqdm(range(num_chunks), desc="Processing chunks"):
        # Get chunk with overlap
        start_idx = chunk_num * CHUNK_SIZE
        end_idx = start_idx + CHUNK_SIZE
        
        chunk = text[start_idx:end_idx]
        
        # Process chunk and append to complete text
        processed_chunk = process_chunk(chunk)
        processed_text += processed_chunk + " "
        
        # Write chunk immediately to file
        out_file.write(processed_chunk)
        
        # Force flush the file to disk
        out_file.flush()

Processing chunks:   0%|          | 0/101 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
1
A Survey on Knowledge Distillation of Large
Language Models
Xiaohan Xu1, Ming Li2, Chongyang Tao3, Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk jl0725@connect.hku.hk
Abstract —In the era of Large Language Models (LLMs), Knowledge Distillati...

PROCESSED TEXT:
...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ed knowledge to smaller models and its utility in model compression and self-
improvement. Our survey is meticulously structured around three foundational pillars: algorithm ,skill, and verticalization – providing
a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications
across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how
DA emerges as a powerfu...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
on and
proposing future research directions. By bridging the gap between proprietary and open-source LLMs, this survey underscores the
potential for more accessible, efficient, and powerful AI solutions. Most importantly, we firmly advocate for compliance with the legal
terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available
at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.
Index Terms —Large lang...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 have un-
locked new realms of possibility, from generating human-
like text to offering sophisticated problem-solving capa-
bilities. The core significance of these LLMs lies in their
emergent abilities (Wei et al., 2022a,b; Xu et al., 2024a), a
phenomenon where the models display capabilities beyond
their explicit training objectives, enabling them to tackle a
diverse array of tasks with remarkable proficiency. Their
deep understanding of context, nuance, and the intrica-
cies of human languag...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
g to revolutionize industries,
augment human creativity, and redefine our interaction with
technology.
Despite the remarkable capabilities of proprietary LLMs
like GPT-4 and Gemini, they are not without their shortcom-
ings, particularly when viewed in light of the advantages
offered by open-source models. A significant drawback is
their limited accessibility and higher cost (OpenAI et al.,
2023). These proprietary models often come with substantial
usage fees and restricted access, making them ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
straints of accessibility, cost, and adaptability
thus present significant challenges in leveraging the full
potential of proprietary LLMs.
In contrast to proprietary LLMs, open-source modelsarXiv:2402.13116v3  [cs.CL]  8 Mar 2024
2
like LLaMA (Touvron et al., 2023) and Mistral (Jiang et al.,
2023a) bring several notable advantages. One of the primary
benefits of open-source models is their accessibility and
adaptability. Without the constraints of licensing fees or
restrictive usage policies, t...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
y stemming from their relatively
limited scale and resources compared to their proprietary
counterparts. One of the most significant limitations is
the smaller model scale, which often results in lower per-
formance on real-world tasks with a bunch of instruc-
tions (Zheng et al., 2023a). These models, with fewer pa-
rameters, may struggle to capture the depth and breadth
of knowledge embodied in larger models like GPT-4. Ad-
ditionally, the pre-training investment in these open-source
models is...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ized applications. This
limitation becomes particularly evident when these models
are compared to the highly fine-tuned proprietary LLMs,
which are often tailored to excel in a wide array of complex
scenarios (OpenAI et al., 2023).
Primarily, recognizing the disparities between propri-
etary and open-source LLMs, KD techniques have surged
as a means to bridge the performance gap between these
models (Gou et al., 2021; Gupta and Agrawal, 2022). Knowl-
edge distillation, in this context, involves ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
t al., 2021) has emerged as a
prevalent paradigm to achieve knowledge distillation of
LLMs, where a small seed of knowledge is used to prompt
the LLM to generate more data with respect to a specific
skill or domain (Taori et al., 2023). Secondly, KD still retains
its fundamental role in compressing LLMs, making them
more efficient without significant loss in performance. (Gu
et al., 2024; Agarwal et al., 2024). More recently, the strategy
of employing open-source LLMs as teachers for their own
s...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:

via self-generated knowledge.
A key aspect of the knowledge distillation is the en-
hancement of skills such as advanced context following
(e.g., in-context learning (Huang et al., 2022a) and in-
struction following (Taori et al., 2023)), improved align-
ment with user intents (e.g., human values/principles (Cui
et al., 2023a), and thinking patterns like chain-of-thought
(CoT) (Mukherjee et al., 2023)), and NLP task specialization
(e.g., semantic understanding (Ding et al., 2023a), and code
gen...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
rom the
proprietary models that have been extensively trained and
fine-tuned in these areas.
The benefits of knowledge distillation in the era of
LLMs are multifaceted and transformative (Gu et al., 2024).
Through a suite of distillation techniques, the gap between
proprietary and open-source models is significantly nar-
rowed (Chiang et al., 2023; Xu et al., 2023a) and even
filled (Zhao et al., 2023a). This process not only streamlines
computational requirements but also enhances the environ-
m...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 growth across various industries
and research domains.
The escalating need for a comprehensive survey on the
knowledge distillation of LLMs stems from the rapidly
evolving landscape of AI (OpenAI et al., 2023; Team et al.,
2023) and the increasing complexity of these models. As AI
continues to penetrate various sectors, the ability to effi-
ciently and effectively distill knowledge from proprietary
LLMs to open-source ones becomes not just a technical
aspiration but a practical necessity. This ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
eRankOptimizationy,1y,2y3y1y2y3≻≻rank……
DataCuration
X,YrawdatasynthesizefeedbackFeedback
input
outputSelf-Knowledge
outputinputinput
YlabelLabelingExpansion
X,YdemonstrationsexpandFeature
featureinput,outputextractSec.4Sec.5
Sec.3.1Sec.3.2
Fig. 2: An overview of this survey on knowledge distillation of large language models. Note that ‘Section’ is abbreviated
as ‘Sec.’ in this figure. RM S(·)denotes the student reward model.
the growing demand for more accessible, cost-effective, and
adaptable ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
posing direc-
tions for future research.
Survey Organization. The remainder of this survey is orga-
nized into several comprehensive sections, each designed to
offer a deep dive into the multifaceted aspects of knowledge
distillation within the realm ofLLMs. Following this intro-
duction, §2 provides a foundational overview of knowledge
distillation, comparing traditional techniques with those
emerging in the era of LLMs and highlighting the role of
data augmentation (DA) in this context. §3 del...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
guage understanding (NLU), genera-
tion (NLG), information retrieval, recommendation systems,
and the evaluation of text generation. In §5, we ventureinto domain-specific vertical distillation, showcasing how
knowledge distillation techniques are applied within spe-
cialized fields such as law, healthcare, finance, and science,
illustrating the practical implications and transformative
impact of these approaches. The survey suggests open
problems in §6, identifying current challenges and gaps in...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
large, complex model (teacher) to a
smaller, more efficient model (student) (Gou et al., 2021).
This technique is pivotal in mitigating the challenges posed
by the computational demands and resource constraints of
deploying large-scale models in practical applications.
Historically, knowledge distillation techniques, prior to
the era of LLMs, primarily concentrated on transferring
knowledge from complex, often cumbersome neural net-
works to more compact and efficient architectures (Sanh
et al.,...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 (Chenglin et al., 2023)
ExpansionSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023)
Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a),
WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b)
CurationUltraChat (Ding et al., 2023b), Phi-1 (Gunasekar et al., 2023), Phi-1.5 (Li et al., 2023a),
Phi-2 (Mar, 2023), Magicoder (Wei et al., 2023), WaveCoder (Yu et al., 2024)
ZeroGen (Ye et al...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2024)
Self-KnowledgeSelf-Instruct (Wang et al., 2022a), Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024a),
ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023),
Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022)
DistillationSupervised Fine-TuningAlpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a),
Self-Instruct (Wang et al., 2022a), Baize (Xu et al., 2023b), STaR (Zelikman et a...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
kill
DistillationContext FollowingInstruction FollowingSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023),
WizardLM (Xu et al., 2023a), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023),
WizardMath (Luo et al., 2023b), Llama-GPT4 (Peng et al., 2023a),
Multi-turn DialogueVicuna (Chiang et al., 2023), Baize (Xu et al., 2023b), UltraLLaMA (Ding et al., 2023b),
CAMEL (Li et al., 2023b), OpenChat (Wang et al., 2023c), Zephyr (Tunstall et al., 2023),...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2023), UltraFeedback (Cui et al., 2023a),
ValueCAI (Bai et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b),
Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024a)
AgentTool UsingToolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023),
ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a),
Confucius (Gao et al., 2023b), MLLM-Tool (Wang et al., 2024), α-UMi (...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
OMP (Xu et al., 2024b), MaRio (Ramnath et al., 2023),
ID (Jung et al., 2023), GPT-3 Labeling (Wang et al., 2021b), BioGPT (Guo et al., 2023a),
ChatGPT NMT (Yang and Nicolai, 2023),
Information RetrievalQUILL (Srinivasan et al., 2022), Promptgator (Dai et al., 2023b), InPars (Bonifacio et al., 2022),
AugTriever (Meng et al., 2023), (Sun et al., 2023a), RankVicuna (Pradeep et al., 2023a),
RankZephyr (Pradeep et al., 2023b), ExaRanker (Ferraretto et al., 2023),
Recommendation NDR (Mysore et al., 20...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
lti-ModalityLLaVA (Liu et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c),
LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b),
Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e),
Verticalization
DistillationLaw (Huang et al., 2023b; Cui et al., 2023b); Medical & Healthcare (Zhang et al., 2023c; Chen ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
network to mimic the
output of a larger teacher network, often through techniques
like soft target training, where the student learns from
the softened softmax output of the teacher. Please refer to
the survey (Gou et al., 2021) for more details on general
knowledge distillation techniques in AI and DL.
In contrast, the advent of LLMs has revolutionized
the knowledge distillation landscape. The current era of
knowledge distillation in LLMs shifts the focus from mere
architecture compression to t...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
r reduce the model size , the current focus in LLM-based
knowledge distillation is to extract and transfer the rich,
nuanced understanding that these models have developed.
The key to this modern approach lies in heuristic and
carefully designed prompts, which are used to elicit specific
knowledge (Ding et al., 2023b) or capabilities (Chaudhary,
2023) from the LLMs. These prompts are crafted to tap
into the LLM’s understanding and capabilities in various
domains, ranging from natural language un...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
their explicit training objectives.
Furthermore, this era of knowledge distillation also em-
phasizes the transfer of more abstract qualities such as
reasoning patterns (Mitra et al., 2023), preference align-
ment (Cui et al., 2023a), and value alignment (Sun et al.,
2024b). This is in stark contrast to the earlier focus on output
replication (Taori et al., 2023), indicating a shift towards
a more holistic and comprehensive transfer of cognitive
capabilities. The current techniques involve not j...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 al., 2022) emerges as a critical paradigm integral
to the process of knowledge distillation. Unlike traditional
DA techniques such as paraphrasing (Gangal et al., 2022) orback-translation (Longpre et al., 2019), which primarily aim
at expanding the training dataset in a somewhat mechanical
manner. DA within the context of LLMs focuses on the
generation of novel, context-rich training data tailored to
specific domains and skills. This innovation is driven by the
unique capabilities of LLMs to ge...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
capability gap between proprietary and open-
source models. Through DA, LLMs are prompted to create
targeted, high-quality datasets that are not merely larger in
volume but are also rich in diversity and specificity. This
approach enables the distillation process to be more effec-
tive, ensuring that the distilled models not only replicate
the teacher model’s output behavior but also embody its
deep-seated understanding and cognitive strategies.
The significance and necessity of DA for achieving...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ssible approach to harnessing
the power of LLMs. It empowers open-source models with
the ability to approximate the contextual adeptness, ethical
alignment, and deep semantic insights characteristic of their
proprietary counterparts, thereby democratizing access to
advanced AI capabilities and fostering innovation across a
broader spectrum of applications and users.
2.3 Survey Scope
Building on the discussions introduced earlier, this survey
aims to comprehensively explore the landscape of knowl...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ions and methodologies of knowledge distillation. It
includes an in-depth exploration of the processes involved
in constructing knowledge from teacher models (e.g., pro-
prietary LLMs) and integrating this knowledge into student
models (e.g., open-source LLMs). Under the umbrella of
‘knowledge ’, we delve into strategies such as labeling (Hsieh
et al., 2023), expansion (Taori et al., 2023), curation (Gu-
nasekar et al., 2023), feature understanding (Agarwal et al.,
6
2024), feedback mechanisms (...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
This analysis
aims to illuminate how these algorithms facilitate the trans-
fer of knowledge, ensuring that open-source models can
replicate and, in some cases, surpass the capabilities of their
proprietary counterparts.
Skill Distillation. This facet examines the specific compe-
tencies and capabilities enhanced through KD. It encom-
passes detailed discussions on context following (Taori et al.,
2023; Luo et al., 2023c), with subtopics like instruction
following and retrieval-augmented generat...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
roader AI and ML ecosystem.
By navigating through these facets, this survey en-
deavors to provide an extensive and nuanced analysis of
knowledge distillation in the era of LLMs. It serves as a
guide for researchers, practitioners, and enthusiasts in the
field, shedding light on current methodologies, challenges,
and opportunities for innovation in this rapidly evolving
domain.
Declaration. This survey represents our earnest effort to
provide a comprehensive and insightful overview of knowl-
edg...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 their impacts
across a range of applications.
2.4 Distillation Pipeline in LLM Era
SeedKnowledgeSkill/Domain
TeacherLLMKnowledgeElicitationStudentModelDistillationAlgorithmsteer
driveGeneratedKnowledgeLearningObjectivetrain
Fig. 4: An illustration of a general pipeline to distill knowl-
edge from a large language model to a student model.
The general distillation pipeline of LLMs is a structured
and methodical process aimed at transferring knowledge
from a sophisticated teacher model to a less ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
lves directing the teacher LLM towards a
specific target skill or domain. This is achieved through care-
fully crafted instructions or templates that guide the LLM’s
focus. These instructions are designed to elicit responses
that demonstrate the LLM’s proficiency in a particular area,
be it a specialized domain like healthcare or law, or a skill
such as reasoning or language understanding. The objective
here is to utilize the teacher LLM’s extensive training and
nuanced capabilities to generate ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 seed knowledge is crucial as it provides a
foundation upon which the teacher model can build and
expand, thereby creating more comprehensive and in-depth
knowledge examples.
III. Generation of Distillation Knowledge. In response
to the seed knowledge and steering instructions, the teacher
LLM generates knowledge examples. These examples are
predominantly in the form of question-and-answer (QA)
dialogues or narrative explanations, aligning with the nat-
ural language processing/understanding cap...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ge examples to train the student
model. This training is guided by a loss function that aligns
with the learning objectives. The loss function quantifies
the student model’s performance in replicating or adapting
the knowledge from the teacher model. By minimizing this
loss, the student model learns to emulate the target skills or
domain knowledge of the teacher, thereby acquiring similar
capabilities. The process involves iteratively adjusting the
student model’s parameters to reduce the discre...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ch the LLM can
explore to generate novel knowledge, Parse( o, s)stands for
to parse the distillation example ( e.g., (x, y)) from the
teacher LLM’s output o(plus the input sin some cases),
andpTrepresents the teacher LLM with parameters θT.
Given the datasets D(kd)
Ibuilt for distillation, we then define
a learning objective as
L=X
ILI(D(kd)
I;θS), (2)
whereP
Idenotes there could be multiple tasks or skills
being distilled into one student model, LI(·;·)stands for a
specific learning objective, ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 LLMs (Eq.1), and ‘Distillation,’
centered on injecting this knowledge into student models
(Eq.2). We will elaborate on these two processes in the
subsequent sections.
3.1 Knowledge
This section focuses on the approaches to elicit knowledge
from teacher LLMs. According to the manners to acquire
knowledge, we divided them into Labeling ,Expansion ,DataCuration ,Feature ,Feedback , and Self-Knowledge . Figure 5
shows an illustration of these knowledge elicitation meth-
ods.
3.1.1 Labeling
Labeling...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
lable through the
predefined Iandc. This process can be formulated as
follows:
D(lab)={x, y|x∼ X, y∼pT(y|I⊕c⊕x)}. (3)
Input xcould be sourced from existing NLP task
datasets, which serve as typical reservoirs for distillation
efforts. Numerous works have sought to harness the capa-
bilities of powerful LLMs as teachers for annotating dataset
samples across a range of tasks. For instance, efforts in
natural language understanding involve using LLMs to cat-
egorize text (Gilardi et al., 2023; Ding...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
t al., 2023d;
Liu et al., 2023g), among others. Rather than concentrating
on specific tasks, many current works focus on labeling
outputs based on instructions, thereby teaching student
models to solve tasks in a more flexible way by following in-
structions. Collections of various NLP tasks, complemented
by instructional templates, serve as valuable input sources
forx. For instance, FLAN-v2 collections (Longpre et al.,
2023) offers extensive publicly available sets of tasks with
instructions, w...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
from forums like Quora and Stack Overflow.
Moreover, the process of labeling could be guided by
instructions Ior demonstrations c. A commonly used in-
struction type for guiding labeling is chain-of-thought (CoT)
prompt (Hsieh et al., 2023; Fu et al., 2023; Magister et al.,
2023). Mukherjee et al. (2023) add multiple system messages
(e.g. “You must generate a detailed and long answer.” or
“explain like I’m five, think step-by-step”) to elicit rich
signals. Yue et al. (2023a) and Chenglin et al. ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
acher LLMs. Labeling : The teacher generates
the output from the input; Expansion : The teacher generates samples similar to the given demonstrations through in-
context learning; Data Curation : The teacher synthesizes data according to meta-information, such as a topic or an entity;
Feature : Feed the data into the teacher and extract its internal knowledge, such as logits and features; Feedback : The teacher
provides feedback on the student’s generations, such as preferences, corrections, exp...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 is constrained by the scale
and variety of the input data. In real-world applications,
especially those involving user conversations, there are also
concerns regarding the privacy of the data involved. To
address these limitations, various expansion methods have
been proposed (Wang et al., 2022a; Taori et al., 2023; Chaud-
hary, 2023; Si et al., 2023; Ji et al., 2023a; Luo et al., 2023b,a;
Wu et al., 2023c; Sun et al., 2024b; Xu et al., 2023a; Guo
et al., 2023c; Rozi `ere et al., 2023; West et ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
as follows:
D(exp)={(x, y)|x∼pT(x|I⊕c), y∼pT(y|I⊕x)}.(4)
In this formulation, xand yrepresent the new input-
output pairs generated by the teacher LLM. The input x
is generated based on a set of input-output demonstrations
c. The output yis then generated in response to the new
input xunder the guidance of an instruction I. Note thatthe demonstrations could be predefined or dynamically
updated by adding the newly generated samples.
Expansion techniques have been widely utilized to
extract extens...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 text-
davinci-003, to distill 52K high-quality data. To improve
the diversity and coverage during expansion, Wu et al.
(2023c) and (Sun et al., 2024b) prompt the teacher LLM to
generate instructions corresponding to some specific topics.
Xu et al. (2023a) propose an Evol-Instruct method to ex-
pand the instructions from two dimensions: difficulty (e.g.
rewriting the question to be more complex) and diversity
(e.g. generating more long-tailed instructions). This Evol-
Instruct method is domain-a...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
023b) proposes the Targeted Data Generation (TDG)
framework, which automatically identifies challenging sub-
groups within data and generates new samples for these
subgroups using LLMs through in-context learning.
In summary, the expansion method leverages the in-
9
context learning strengths of LLMs to produce more var-
ied and extensive datasets with both inputs and outputs.
However, the quality and diversity of the generated data
are heavily reliant on the teacher LLMs and the initial seed
de...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
emergence
of the Data Curation approach. This method arises in re-
sponse to the limitations observed in both the Labeling and
Expansion approaches. These methods often yield data of
variable quality and face constraints in quantity. In Labeling,
the seed knowledge is sourced from task datasets, leading
to potential noise and dirty data. Meanwhile, in Expansion,
the input xis derived from seed demonstrations, which
can result in homogeneous data when generated in large
quantities. To overcome th...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
to this process to generate controllable x
andy. Thus, this process can be meticulously controlled
to yield datasets that are not only large in scale but also
of high quality. The formulation for Data Curation can be
represented as:
D(cur)={(x, y)|x∼pT(x|I⊕m), y∼pT(y|I⊕x)}.(5)
In this formulation, mrepresents the diverse meta-
information used to guide the synthesis of x, and Iis the
instruction guiding teacher LLMs to generate xory.
Different studies primarily vary in their source and
method of...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 broad array
of instructions and conversations, achieving a substantial
scale of 1.5 million instances. UltraChat stands out with its
lexical and topical diversity. The UltraLLaMA model, fine-
tuned on this data, consistently surpasses other open-source
models. Another notable series, phi(Gunasekar et al., 2023;
Li et al., 2023a; Mar, 2023), focuses on distilling smaller,
high-quality datasets akin to ”textbooks.” Phi-1 (Gunasekar
et al., 2023) experiments with synthesizing ”textbook qual-
ity” ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
coding benchmarks like Hu-
manEval and MBPP while being 10 times smaller in model
size and 100 times smaller in dataset size. MFTCoder (Liu
et al., 2023d) utilizes hundreds of Python knowledge points
as meta-information to create a CodeExercise Dataset. In
contrast, Magicoder (Wei et al., 2023) and WaveCoder (Yu
et al., 2024) get raw code collections from open-source
code datasets, using this as meta-information for generating
instructional data. In the context of NLU tasks, certain
studies (Ye ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ts
that are not only high-quality and diverse but also large
in scale. The success of models like phi-1 in specialized
domains underscores the efficacy of this method. The ability
to create synthetic datasets will become a crucial technical
skill and a key area of focus in AI (Li et al., 2023a).
3.1.4 Feature
The previously discussed knowledge elicitation methods
are typically applied to powerful black-box models, which
are expensive and somewhat unreproducible due to calling
API. In contrast, w...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
e context of
generative LLMs (Timiryasov and Tastet, 2023; Liang et al.,
2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a;
Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin
et al., 2023b; Boizard et al., 2024; Zhong et al., 2024).
The typical method for acquiring this feature knowledge
involves teacher LLMs annotating the output sequence y
with its internal representations. These annotations are then
distilled into the student model using methods such as
Kullback-Leible...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
d dataset of sequences with
token-level probability distributions (Sanh et al., 2019; Wen
et al., 2023). To leverage the rich semantic and syntactic
knowledge in intermediate layers of the teacher model,
TED (Liang et al., 2023a) designs task-aware layer-wise
distillation. They align the student’s hidden representations
with those of the teacher at each layer, selectively extracting
knowledge pertinent to the target task. Gu et al. (2024) and
Agarwal et al. (2024) introduce a novel approach wher...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
rve the original output
distribution when quantizing the LLMs, ensuring minimal
loss of performance. Additionally, feature knowledge could
serve as a potent source for multi-teacher knowledge distil-
lation. Timiryasov and Tastet (2023) leverages an ensemble
of GPT-2 and LLaMA as teacher models to extract output
distributions. Similarly, FuseLLM (Wan et al., 2024a) inno-
vatively combines the capabilities of various LLMs through
a weighted fusion of their output distributions, integrating
them i...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ma-
tion. While showing promise, especially in smaller models,
its application is not suitable for black-box LLMs where
internal parameters are inaccessible. Furthermore, student
models distilled from white-box LLMs may underperform
compared to their black-box counterparts, as the black-box
teacher LLMs (e.g. GPT-4) tend to be more powerful.
3.1.5 Feedback
Most previous works predominantly focus on one-way
knowledge transfer from the teacher to the student for
imitation, without considering feed...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(x, y, ϕ fb(x, y;θT))|x∼ X, y∼pS(y|x)}, (7)
where ydenotes the output generated by the student
model in response to x, and ϕfb(·;θT))represents providing
feedback from teacher LLMs. This operation evaluates thestudent’s output ygiven the input x, by offering assess-
ment, corrective information, or other forms of guidance.
This feedback knowledge can not only be distilled into
the student to also generate feedback (such as creating a
student preference model) but, more importantly, enable
the st...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
preferences could be distilled from teachers
by prompting it with specific criteria. Bai et al. (2022a) in-
troduce RLAIF for distilling harmlessness preferences from
LLMs. This involves using an SFT-trained LLM to generate
response pairs for each prompt, then ranking them for
harmlessness to create a preference dataset. This dataset is
distilled into a Preference Model (PM), which then guides
the RL training of a more harmless LLM policy. Wizard-
Math (Luo et al., 2023b) places emphasis on math...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
-following, truthfulness, honesty and
helpfulness.
Beyond merely assessing student generations, teachers
can also furnish extensive feedback on instances where
students underperform. In Lion (Jiang et al., 2023b), teacher
model pinpoints instructions that pose challenges to the
student model, generating new, more difficult instructions
aimed at bolstering the student’s abilities. PERsD (Chen
et al., 2023a) showcases a method where teacher offers
tailored refinement feedback on incorrect code sni...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ent an innovative strategy
wherein the student model initially generates sequences,
followed by teacher model producing an output distribution
as feedback. This method leverages the teacher’s insight
to directly inform and refine the student model’s learning
process.
3.1.6 Self-Knowledge
The knowledge could also be elicited from the student itself,
which we refer to as Self-Knowledge . In this setting, the same
model acts both as the teacher and the student, iteratively
improving itself by disti...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
elf-generated outputs y, which
could include but is not limited to filtering, rewarding, or
any other mechanisms for enhancing or evaluating y. It
could be governed by external tools or the student itself θS.
Recent research in this area has proposed various innovative
methodologies to elicit self-knowledge, demonstrating its
potential for creating more efficient and autonomous learn-
ing systems. (Allen-Zhu and Li, 2020; Wang et al., 2022a;
Sun et al., 2024b; Yang et al., 2024a; Jung et al., 20...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ne-tunes the original
model. Other methods aim to elicit targeted knowledge
from student models by modifying prompts, and leveraging
these data for further refinement. In Self-Align (Sun et al.,
2024b), they find that models fine-tuned by Self-Instruct
data tend to generate short or indirect responses. They
prompt this model with verbose instruction to produce in-
depth and detailed responses. Then, they employ context-
distillation (Askell et al., 2021) to distill these responses
paired with no...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:

tence summarization tasks, implementing filters based on
entailment, length, and diversity to screen self-generated
summaries. LMSI (Huang et al., 2023a) generates multiple
CoT reasoning paths and answers for each question, and
then retains only those paths that lead to the most consistent
answer.
Note that refined self-knowledge can be iteratively ac-
quired as the student model continuously improves, further
enhancing the student’s capabilities. This is Gulcehre et al.
(2023) introduces a Rei...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
,
2024a) introduces a framework resembling iterative DPO,
where the language model is fine-tuned to differentiate the
self-generated responses from the human-annotated data.
These self-generated responses could be seen as “negative
knowledge” to promote the student to better align with
the target distribution. Self-Rewarding (Yuan et al., 2024a)
explores a novel and promising approach by utilizing the
language model itself as a reward model. It employs LLM-
as-a-Judge prompting to autonomously a...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
g and Rank Optimization ,
as shown in Figure 3.
3.2.1 Supervised Fine-Tuning
Supervised Fine-Tuning (SFT), or called Sequence-Level KD
(SeqKD) (Kim and Rush, 2016), is the simplest and one of
the most effective methods for distilling powerful black-box
LLMs. SFT finetunes student model by maximizing the like-
lihood of sequences generated by the teacher LLMs, aligning
the student’s predictions with those of the teacher. This
process can be mathematically formulated as minimizing
the objective fu...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
, 2022a;
Huang et al., 2023c; Xu et al., 2023b; Zelikman et al., 2022).
Due to the large number of KD works applying SFT, we
only list representative ones here. More detailed works can
be found in §4.
3.2.2 Divergence and Similarity
This section mainly concentrates on algorithms designed for
distilling feature knowledge from white-box teacher LLMs,
including distributions and hidden state features. These
algorithms can be broadly categorized into two groups:
those minimizing divergence in probab...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(x, y))∥2
L1-Norm Distance ∥ΦT(fT(x, y))−ΦS(fS(x, y))∥1
Cross-Entropy Loss −PΦT(fT(x, y)) log(Φ S(fS(x, y)))
Maximum Mean Discrepancy MMD (ΦT(fT(x, y)),ΦS(fS(x, y)))
TABLE 2: Summary of similarity functions in knowledge
distillation.
and student models, represented by a general divergence
function D:
LDiv= E
x∼X,y∼Y[D(pT(y|x), pS(y|x))], (10)
The specific form of Dvaries depending on the type of
divergence employed. Table 1 outlines the functional forms
ofDfor different divergence measures. The ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ss to tokens with low probability
under the teacher’s distribution (cf. Figure 6 blue curve).
This mode-covering phenomenon can potentially lead to
hallucinations and low-quality generations. Alternatively,
mode-seeking divergences like reverse KL prioritize tokens
where the teacher assigns high probabilities (cf. Figure 6
green curve). This approach can mitigate the risk of low-
quality outputs, fostering more accurate generations. How-
ever, it often does so at the cost of reduced diversity. G...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
r variations, while reverse KL
divergence is preferable for tasks like dialogue generation
and instruction tuning, which involve multiple modes and
a wider range of potential responses. Thus, the nature of the
task significantly influences the selection of the divergence
function for optimal performance.
Similarity. Similarity-based methods in knowledge distilla-
tion aim to align the hidden states or features of the student
pargminqKL(p||q)argminqKL(q||p)Fig. 6: Comparison of Forward and Revers...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
tive is to ensure that the student model not only
produces similar outputs to the teacher but also processes
information in a comparable manner. The formulation for a
similarity-based objective might look like this:
LSim= E
x∼X,y∼Y[LF(ΦT(fT(x, y)),ΦS(fS(x, y)))],(11)
where fT(x, y)andfS(x, y)are the feature maps of the
teacher and student models, respectively. The transforma-
tion functions ΦTandΦSare applied to these feature maps
to ensure they are in the same shape, facilitating direct
compari...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
iltered
representations in both teacher and student models. While
similarity-based approaches are common in encoder-based
LMs (Sun et al., 2019, 2020; Jiao et al., 2020; Hou et al.,
2020; Zuo et al., 2022; Liang et al., 2021), their application in
LLM knowledge distillation is not as widespread. However,
considering their effectiveness, we anticipate an increase in
research exploring these methods for LLM distillation in the
near future.
3.2.3 Reinforcement Learning
This section explores advance...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
model rϕusing the feedback data D(fd)
generated by teacher LLMs. Preference data, as one of the
typical feedback, is employed to train the student reward
model (Bai et al., 2022a; Cui et al., 2023a; Lee et al., 2023a;
Kim et al., 2023a). They usually consist of input-output
pairs (x, yw, yl). Here, ywandylrepresent “winning” and
“losing” outputs relative to the teacher’s preferences. The
loss function for the reward model is defined as:
LRM(rϕ,D(fd)) =− E
(x,yw,yl)∼D(fd)[logσ(rϕ(x, yw)−rϕ(x, yl)...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
second stage,
the student model, represented by a policy πθ, is optimized
to maximize the expected reward as per the trained reward
model. Simultaneously, it minimizes the divergence from
a reference policy πref, typically the initial policy of the
student model trained by SFT, controlled by a factor β. The
RL objective is given by:
max
πθE
x∼X,y∼πθ(y|x)[rϕ(x, y)]−βDKL[πθ(y|x)∥πref(y|x)]
(13)
This RL framework not only ensures that the student model
learns the explicit content from the teacher b...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
putational cost compared to employing a
smaller distilled reward model.
3.2.4 Ranking Optimization
Ranking optimization presents a stable and computationally
efficient alternative to RL for injecting preference feedback
into language models (Rafailov et al., 2023; Song et al.,
2023a; Yuan et al., 2023b). This method, diverging from
traditional RL approaches, directly incorporates ranking
information into language models from a fixed preference
dataset during fine-tuning. Intuitively, it directly...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
v et al., 2023) to distill the
preference alignment in teacher LLMs. DPO streamlines
the objective of reinforcement learning (as in Eq. 13),
which involves reward maximization with a KL-divergence
constraint, into a single-stage policy training. Specifically,
DPO’s training goal is to maximize the following expecta-
tion:
E
(x,yw,yl)∼D(fd)
logσ
βlogπθ(yw|x)
πref(yw|x)−βlogπθ(yl|x)
πref(yl|x)
,
(14)
where ywis preferred over ylaccording to the teacher
LLM. Hong et al. (2023) (Hong et al., 202...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
uding
Context Following ,Alignment ,Agent ,NLP Task Specializa-
tion and Multi-Modality .Context Following focuses on the
student’s ability to comprehend and respond effectively
to input information. Alignment delves into the student’s
capability to align its output with the teacher’s responses.
Moving forward, Agent underscores the autonomous nature
of language models. NLP Task Specialization highlights the
LLM’s versatility in specializing across various Natural
Language Processing tasks, demo...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
A Expansion SFT
Lion (Jiang et al., 2023b) IF Alpaca Cata ChatGPT LLaMA Labeling + Expansion + Feedback -
BabyLlama (Timiryasov and Tastet, 2023) IF 10M-word BabyLM dataset GPT-2 + small LLaMA 58M-parameter LLaMA Feature D&S
MiniLLM (Gu et al., 2024) IF Dolly Dataset GPT2 + OPT + LLaMA GPT2 + OPT + LLaMA Feature D&S
Self-Align (Sun et al., 2024b) IF Human-written Principles LLaMA LLaMA Expansion + Self-Knowledge SFT
Self-Rewarding (Yuan et al., 2024a) IF Human-written Samples LLaMA LLaMA Self-Kn...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 Human Conversation ChatGPT LLaMA Labeling SFT
Baize (Xu et al., 2023b) IF/MD Quora + Stack Overflow ChatGPT LLaMA Expansion + Self-Knowledge SFT
UltraChat (Ding et al., 2023b) IF/MD Wikidata + Text Material + C4 ChatGPT LLaMA Curation SFT
Orca (Mukherjee et al., 2023) IF/TP FLAN-v2 ChatGPT + GPT4 LLaMA Labeling SFT
Orca2 (Mitra et al., 2023) IF/TP FLAN-v2 + Few-Shot/Math/Synthetic GPT4 LLaMA Labeling SFT
SelFee (Ye et al., 2023) IF/TP Human Conv, Flan/Code/Math Collection ChatGPT LLaMA Labeling...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
MA Label SFT
KARD (Kang et al., 2023b) IF/RAG MedQAUSMLE ChatGPT T5 + OPT Label SFT + D&S
Self-RAG (Asai et al., 2023) IF/RAG Open-Instruct GPT4 LLaMA Labeling SFT
Alignment
OpenChat (Wang et al., 2023c) IF/Preference Human Conversation ChatGPT + GPT4 LLaMA Labeling SFT + RL
Zephyr (Tunstall et al., 2023) IF/Preference Mixed Datasets GPT4 Mistral Labeling + Feedback SFT + RO
ALMoST (Kim et al., 2023a) IF/Preference Human-written Prompts LLaMA LLaMA Expansion + Labeling SFT + RL
RLCD (Yang et al....

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
itten Prompts Self-defined Student Model Self-defined Model Labeling + Expansion + Feedback SFT + RL
SANDBOX (Liu et al., 2023b) Value Simulationtext-davinci-002/-003 +
GPT4 + ChatGPTLLaMA Data Curation SFT + RL
Agent
Toolformer (Schick et al., 2023) Tool CCNet GPT-J GPT-J Labeling SFT
Graph-ToolFormer (Zhang, 2023) Tool Mixed Graph Dataset ChatGPT GPT-J + LLaMA Labeling SFT
Gorilla (Patil et al., 2023) Tool Online API Documentation GPT4 LLaMA Expansion SFT
GPT4Tools (Yang et al., 2023b) Tool Im...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
3a) Planning Mixed Interactive Tasks GPT4 LLaMA Labeling SFT
AUTOACT (Qiao et al., 2024) Planning Mixed QA Tasks LLaMA LLaMA Labeling SFT
NLP Task Specialization
AugGPT (Dai et al., 2023a) NLU Amazon/Symptoms/PubMed20k Dataset ChatGPT BERT Label SFT
TDG (He et al., 2023b) NLU SST + QQP + MNLI GPT3 BERT Expansion SFT
SunGen (Gao et al., 2023a) NLU Text Classification Tasks GPT2 DistilBERT Curation SFT
UDG (Wang et al., 2021a) NLU NLU Tasks GPT3 BERT Expansion SFT
InheritSumm (Xu et al., 2023c) NL...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
tGPT LLaMA Labeling SFT
RankZephyr (Pradeep et al., 2023b) IR IR Datasets ChatGPT + GPT4 Mistral Labeling SFT
NDR (Mysore et al., 2023) Recommendation Recommendation Datasets GPT3 MPnet-110M Labeling SFT
InstrcutRec (Zhang et al., 2023b) Recommendation 39 instruction templates ChatGPT Flan-T5 Expansion + Self-Knowledge SFT
ONCE (Liu et al., 2023c) Recommendation Recommendation Dataset ChatGPT LLaMA Labeling SFT
PandaLM (Wang et al., 2023b) Evaluation Alpaca Data ChatGPT LLaMA Labeling SFT
Promet...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
tarCoder Expansion SFT
Magicoder (Wei et al., 2023) Code Existing Source Codes ChatGPT LLaMa Curation SFT
WaveCoder (Yu et al., 2024) Code Existing Source Codes GPT4 LLaMa Curation SFT
Code Alpaca (Chaudhary, 2023) Code Code Instructions ChatGPT LLaMA Expansion + Self-Knowledge SFT
Code Llama (Rozi `ere et al., 2023) Code Human-written Instructions LLaMA LLaMA Expansion + Self-Knowledge SFT
Code Clean (Jain et al., 2023) Code Code Datasets ChatGPT LLaMA Labeling SFT
Multi-Modality
LLaVA (Liu et ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
atBridge (Zhao et al., 2023d) Multiple Modalities Task-Specific/Multimodal-Chat Data GPT4 + ChatGPT LLaMA Labeling SFT
TABLE 3: A summary of skill distillation works. IF: Instruction Following, MD: Multi-turn Dialoue, TP: Think Pattern,
RAG: Retrieval-Augmented Generation, NLU: Natural Language Understanding, NLG: Natural Language Generation, IR:
Information Retrieval, SFT: Supervised Fine-Tuning, D&S: Divergence and Similarity, RL: Reinforcement Learning, RO:
Ranking Optimization.
Finally, Mult...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
uctional
formats with templates, such as prefacing machine transla-
tion data with ”Translate this sentence to Spanish:” . However,
these approaches have limitations. Manual data creation is
labor-intensive, while template-based transformation lacks
diversity in instructions and may not align well with natural
human input. LLMs like GPT-4 offer an efficient alternative
for creating diverse and controlled SFT data by their capabil-
ities of in-context learning and instruction following. Most
rele...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
,
ensuring a broad spectrum of general instructions. Addi-
tionally, a filtering and post-processing stage is introduced
to eliminate redundant or similar instructions. Notably,
through training with this enriched dataset, GPT-3 acquires
the ability to follow instructions, enabling it to perform
comparably to InstructGPT in zero-shot instruction tasks
and when provided with expert-written instructions for
novel tasks. Based on the self-instruct method, Taori et al.
(2023) train an Alpaca model u...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ctions (Xu et al., 2023a; Luo et al.,
2023b,a; Guo et al., 2023c). According to Xu et al. (2023a), in-
struction datasets derived from human-written seeds often
exhibit low to moderate complexity. To enhance the com-
plex instruction-following capabilities of smaller models,
WizardLM (Xu et al., 2023a) introduces Evol-Instruct . This
method gradually transforms instructions into more com-
plex forms through a multi-step evolution process, focusing
on both increasing difficulty levels and expandi...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
nstruction Fusion (Guo et al., 2023c)
further uses teacher LLMs to increase the complexity by
fusing two distinct evolved instructions. Furthermore, this
concept of “evolving” instructions has been extended to
distill specific skills such as coding (Luo et al., 2023a) and
mathematics (Luo et al., 2023b).
Human Instructions. In contrast to works that rely on gener-
ating instructions from ChatGPT, which may lack diversity
and have gaps with real human instructions, Vicuna (Chiang
et al., 2023) an...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
t al., 2023).
System Instructions. To encourage student models to learn
the reasoning process, Orca and Orca 2 (Mukherjee et al.,
2023; Mitra et al., 2023) enhance the prompt, response data
pairs by introducing a system message (e.g., ”explain like
I’m five, think step-by-step”) to encourage student mod-
els to grasp the reasoning process. This system message
prompts GPT-4 to provide explanation traces that eluci-
date the teacher’s reasoning process. Orca 2 (Mitra et al.,
2023) further trains t...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
 by various meta-
information. The UltraLLaMA model, fine-tuned on this
data, consistently surpasses other open-source models. The
Phi series models (Gunasekar et al., 2023; Li et al., 2023a;
Mar, 2023) prioritize data quality and employ synthetic
methods to generate data of “textbook quality” to enhance
the learning experience for smaller models. Notably, Phi
exhibits the ability to follow instructions effectively even
without specific instruction fine-tuning. What’s particularly
remarkable is ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ent-
16
ing vanilla instructions with specialized Expert Identity
descriptions. Reflection-Tuning (Li et al., 2023e) improves
both the instruction and response sequentially by reflecting
on specific criteria. DEITA (Liu et al., 2023h) proposes to
enhance and score instructions in three directions includ-
ing complexity, quality, and diversity to get high-quality
distillation data. MUFFIN (Lou et al., 2023) proposes to
scale the instruction according to the input by diversifying
these tasks with ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ility, like diver-
sity, complexity and explanation. However, student mod-
els trained on instruction data expanded by ChatGPT of-
ten mimic ChatGPT’s style without replicating its factual
accuracy (Gudibande et al., 2023). Achieving a more ca-
pable instruction-following capability requires a stronger
teacher LLM (Gudibande et al., 2023) and access to di-
verse, high-quality instruction data, such as the one used
in Orca (Mukherjee et al., 2023; Mitra et al., 2023), which
incorporates extensive...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
knowl-
edge from teacher LLMs (Chiang et al., 2023; Xu et al., 2023b;
Ding et al., 2023b; Li et al., 2023b; Wang et al., 2023c; Tunstall
et al., 2023).
ShareGPT serves as a platform for users to share their
conversations with ChatGPT, offering a vast repository of
multi-turn conversations readily available. Some small chat
models are trained using this data to acquire the capability
for engaging in multi-turn dialogues (Chiang et al., 2023; Ye
et al., 2023; Wang et al., 2023c). For example, Vicu...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
) enhance the quality of multi-turn
data from ShareGPT by generating self-feedback on model
responses and iteratively refining the responses based on
the received feedback.
3. MT-Bench: a multi-turn question set, where the generations of
models are evaluated by LLM, like GPT-4.To enhance the multi-turn capabilities of student models,
another line of research focuses on expanding conversa-
tional datasets through self-chat and using them to train
smaller models (Xu et al., 2023b; Ding et al., 202...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ogues from ChatGPT. Notably, UltraChat encom-
passes a wide range of topics and instructions. Building
upon the UltraChat dataset, they fine-tune a LLaMA model,
resulting in the creation of a powerful chat model known as
UltraLLaMA. UltraLLaMA consistently outperforms other
open-source chat models, including Vicuna and Baize. Fur-
thermore, UltraChat is employed in conjunction with an
AI preference-aligned chat model named Zephyr (Tunstall
et al., 2023). Zephyr enhances intent alignment through
...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ave been proposed (Kang et al., 2023a; Luo
et al., 2023c; Asai et al., 2023).
SAIL (Luo et al., 2023c) starts by retrieving search results
for each training case using search APIs, creating search-
augmented instructions that include both the instruction
and grounding information. To encourage the language
model to prioritize informative retrieval results, they input
each retrieved passage along with the ground truth response
into the entailment model to label each retrieval result for
relevance...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ionales serve
as a means to retrieve relevant knowledge d, and the student
LM is subsequently fine-tuned using the rationales along-
side questions and knowledge. However, during inference,
only questions are available. To address this, the Reranker
is trained to mimic how the retriever scores passages with
the rationale by minimizing the KL divergence between
Retriever (d|r)andReranker (d|x). However, the integra-
tion of a fixed number of passages in language models,
without considering their ...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
n Retrieve . To distill
this critic data, GPT-4 is prompted to assess the need for
retrieval using few-shot demonstrations I, the task input
x, and output yto predict a reflection token ras follows:
p(r|I, x, y ).
4.2 Alignment
4.2.1 Thinking Pattern
Most existing methods mainly focus on directly aligning the
direct responses of the student models to the responses of
teacher models (Taori et al., 2023). Though effective, these
models might suffer the problems that they tend to learn to
imitate t...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2022; Madaan et al., 2023; Saunders
et al., 2022), SelFee (Ye et al., 2023) proposes to train a
model that has been fine-tuned to continuously revise its
own answer until it provides a high-quality response in a
single inference. During training, it utilizes both the final
response and feedback chain as the fitting target. This pat-
tern, response with the revision process, shows a promising
performance gain. Following SelFee, Reflection-Tuning (Li
et al., 2023e, 2024d) also utilizes the reflect...

PROCESSED TEXT:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillatio