To create a new virtual environment, you can use the following command in the terminal:

For Python 3 (recommended):
python3 -m venv myenv

Or, if 'python3' doesn't work, try:
python -m venv myenv

Replace 'myenv' with your desired environment name

After creating the environment, activate it with:
On Windows:
myenv\Scripts\activate
On macOS and Linux:
source myenv/bin/activate

To deactivate the environment when you're done:
deactivate


In [134]:
from langchain_community.document_loaders import PyPDFLoader
import dspy
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric
from dotenv import load_dotenv
import tiktoken
from openai import AzureOpenAI
import os

In [138]:
load_dotenv('/Users/netraranga/Desktop/Projects/.env')

True

In [121]:
def process_arxiv_paper(pdf_path):
    # Load the PDF
    loader = PyPDFLoader(pdf_path)
    pages = loader.load_and_split()
    
    # Combine all text into one string
    full_text = ' '.join([page.page_content for page in pages])
    
    # Lowercase the text
    full_text = full_text.lower()
    
    # Find the index of the reference marker
    ref_marker = 'references\n[1'
    ref_index = full_text.find(ref_marker)
    
    if ref_index != -1:
        # Remove all text after the reference marker
        full_text = full_text[:ref_index]
    
    # Remove extra spaces
    full_text = ' '.join(full_text.split())
    
    return full_text

# Example usage
#arxiv_url = "https://arxiv.org/pdf/2402.06196"
#arxiv_url = 'https://arxiv.org/pdf/2302.07459'
#arxiv_url = "https://arxiv.org/pdf/2204.05862"
#arxiv_url = "https://arxiv.org/pdf/2303.08774"
arxiv_url = 'https://arxiv.org/pdf/2402.13116'
processed_content = process_arxiv_paper(arxiv_url)
print(f"Processed content length: {len(processed_content.split())} words")
print("First 500 characters:")
print(processed_content[:1500])


Processed content length: 42282 words
First 500 characters:
1 a survey on knowledge distillation of large language models xiaohan xu1, ming li2, chongyang tao3, tao shen4, reynold cheng1, jinyang li1, can xu5, dacheng tao6, tianyi zhou2 1the university of hong kong2university of maryland3microsoft 4university of technology sydney5peking university6the university of sydney {shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu ckcheng@cs.hku.hk jl0725@connect.hku.hk abstract —in the era of large language models (llms), knowledge distillation (kd) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary llms, such as gpt -4, to their open-source counterparts like llama and mistral. additionally, as open-source llms flourish, kd plays a crucial role in both compressing these models, and facilitating their self- improvement by employing themselves as teachers. this paper presents a comprehensive survey of kd’s role within the realm of

In [122]:
azure_oai = dspy.AzureOpenAI(api_base='https://fsodnaopenai2.openai.azure.com/', api_version='2023-05-15',model='gpt-4o', max_tokens=4000)
dspy.configure(lm=azure_oai)

In [125]:
class SectionHeaderExtraction(dspy.Signature):
    """Extract section headers from a research paper"""

    input_text = dspy.InputField(desc="The full text of a research paper")
    section_headers = dspy.OutputField(desc="A list of only main section headers found in the paper, do not include subheaders. Main headers typical lead with an integer and do not contains decimals. Return the first sentence after each main section header in the following format: Number. header: first sentence")

In [126]:
class SectionHeaderExtractor(dspy.Module):
    'Custom module, need to initialize with a prompting method  '
    def __init__(self):
        super().__init__()
        self.extractor = dspy.ChainOfThought(SectionHeaderExtraction)

    def forward(self, input_text):
        'Parameter is input field from Signature'
        result = self.extractor(input_text=input_text)
        return result

# Example usage
extractor = SectionHeaderExtractor()
section_results = extractor(input_text=processed_content)

print(section_results.rationale)
print("Section Headers Output:")
print(section_results.section_headers)

produce the section_headers. We will start by identifying the main section headers in the text. These headers typically lead with an integer and do not contain decimals. We will then extract the first sentence following each main section header.
Section Headers Output:
1. introduction: in the evolving landscape of artificial intelligence (ai), proprietary large language models (llms) such as gpt-3.5 (ouyang et al., 2022), gpt-4 (openai et al., 2023), gemini (team et al., 2023) and claude have emerged as groundbreaking technologies, reshaping our understanding of natural language processing (nlp).
2. overview: the concept of knowledge distillation in the field of ai and deep learning (dl) refers to the process of transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student) (gou et al., 2021).
3. knowledge distillation algorithms: this section navigates through the process of knowledge distillation.
4. skill distillation: building upon the fou

In [127]:

def parse_section_headers(section_results):
    section_headers = {}
    for header in section_results.section_headers.split('\n'):
        if ':' in header:
            key, value = header.split(':', 1)
            section_headers[key.strip().lower()] = value.strip().lower()
    return section_headers

section_headers = parse_section_headers(section_results)

section_headers.items()


dict_items([('1. introduction', 'in the evolving landscape of artificial intelligence (ai), proprietary large language models (llms) such as gpt-3.5 (ouyang et al., 2022), gpt-4 (openai et al., 2023), gemini (team et al., 2023) and claude have emerged as groundbreaking technologies, reshaping our understanding of natural language processing (nlp).'), ('2. overview', 'the concept of knowledge distillation in the field of ai and deep learning (dl) refers to the process of transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student) (gou et al., 2021).'), ('3. knowledge distillation algorithms', 'this section navigates through the process of knowledge distillation.'), ('4. skill distillation', 'building upon the foundation laid out in section 3 about eliciting knowledge and distillation algorithms, we shift our focus to how these techniques facilitate the distillation of specific skills in llms.'), ('5. domain-specified vertical distillation', 

In [128]:
section_headers

{'1. introduction': 'in the evolving landscape of artificial intelligence (ai), proprietary large language models (llms) such as gpt-3.5 (ouyang et al., 2022), gpt-4 (openai et al., 2023), gemini (team et al., 2023) and claude have emerged as groundbreaking technologies, reshaping our understanding of natural language processing (nlp).',
 '2. overview': 'the concept of knowledge distillation in the field of ai and deep learning (dl) refers to the process of transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student) (gou et al., 2021).',
 '3. knowledge distillation algorithms': 'this section navigates through the process of knowledge distillation.',
 '4. skill distillation': 'building upon the foundation laid out in section 3 about eliciting knowledge and distillation algorithms, we shift our focus to how these techniques facilitate the distillation of specific skills in llms.',
 '5. domain-specified vertical distillation': 'this section sh

In [129]:
def extract_text_chunks(output_text, section_headers):
    # Convert the text to lowercase for case-insensitive matching
    output_text_lower = output_text.lower()
    
    chunks = {}
    
    for header, content in section_headers.items():
        current_header = content.lower()
        
        # Try to find the full first sentence
        start_index = output_text_lower.find(current_header)
        
        # If full sentence not found, try to find 6 consecutive words
        if start_index == -1:
            words = current_header.split()
            for j in range(len(words) - 5):
                six_word_phrase = ' '.join(words[j:j+6])
                start_index = output_text_lower.find(six_word_phrase)
                if start_index != -1:
                    break
        
        if start_index == -1:
            print(f"Warning: Couldn't find the content for header: {header}")
            continue
        
        # Find the start of the next section
        next_header = None
        for next_h, next_c in section_headers.items():
            if next_h > header:
                next_header = next_c.lower()
                break
        
        if next_header:
            end_index = output_text_lower.find(next_header)
            
            # If full next header not found, try to find 6 consecutive words
            if end_index == -1:
                words = next_header.split()
                for j in range(len(words) - 5):
                    six_word_phrase = ' '.join(words[j:j+6])
                    end_index = output_text_lower.find(six_word_phrase)
                    if end_index != -1:
                        break
        else:
            end_index = len(output_text)
        
        # Extract the chunk
        if end_index != -1:
            chunk = output_text[start_index:end_index].strip()
            chunks[header] = chunk
    
    return chunks

In [130]:
chunks = extract_text_chunks(processed_content, section_headers)
chunks
# print("Extracted chunks:")
# for i, chunk in enumerate(chunks, 1):
#     print(f"\nChunk {i}:")
#     print(chunk[:100] + "...")

{'1. introduction': 'in the evolving landscape of artificial intelligence (ai), proprietary1large language models (llms) such as gpt- 3.5 (ouyang et al., 2022), gpt-4 (openai et al., 2023), gemini (team et al., 2023) and claude2have emerged as groundbreaking technologies, reshaping our understand- ing of natural language processing (nlp). these models, characterized by their vast scale and complexity, have un- locked new realms of possibility, from generating human- like text to offering sophisticated problem-solving capa- bilities. the core significance of these llms lies in their emergent abilities (wei et al., 2022a,b; xu et al., 2024a), a phenomenon where the models display capabilities beyond their explicit training objectives, enabling them to tackle a diverse array of tasks with remarkable proficiency. their deep understanding of context, nuance, and the intrica- cies of human language enables them to excel in a wide array of applications, from creative content generation to 1. 

In [132]:
processed_content

'1 a survey on knowledge distillation of large language models xiaohan xu1, ming li2, chongyang tao3, tao shen4, reynold cheng1, jinyang li1, can xu5, dacheng tao6, tianyi zhou2 1the university of hong kong2university of maryland3microsoft 4university of technology sydney5peking university6the university of sydney {shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu ckcheng@cs.hku.hk jl0725@connect.hku.hk abstract —in the era of large language models (llms), knowledge distillation (kd) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary llms, such as gpt -4, to their open-source counterparts like llama and mistral. additionally, as open-source llms flourish, kd plays a crucial role in both compressing these models, and facilitating their self- improvement by employing themselves as teachers. this paper presents a comprehensive survey of kd’s role within the realm of llm, highlighting its critical function in imparting advan

In [139]:
def create_client_object():
    client_val = AzureOpenAI(
        api_key=os.getenv('AZURE_OPENAI_KEY'),
        api_version='2023-12-01-preview',
        azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')
    )
    return client_val

In [144]:
def generate_completion(client, model, messages):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.3,
        max_tokens=4096
    )
    return response.choices[0].message.content

In [170]:
import json
with open('main_summaries.json', 'r') as f:
    file = json.load(f)

In [176]:
file['technical']['user_prompt'].format(processed_content=processed_content)

'Create an engaging, technical summary of the following AI research paper tailored for ML engineers and AI professionals. Your summary should:\n\nOpening:\n\nBegin with a powerful statement that captures the paper\'s most innovative or impactful idea.\nFollow with a brief explanation of the paper\'s main contribution to the field of AI/ML and why it matters.\nUse language that would intrigue technical practitioners and immediately convey the significance of the research.\n\n\nKey Takeaways:\nPresent 2-3 key takeaways. For each:\n\nExplain the main idea in clear terms, avoiding excessive technical jargon.\nMention its potential impact on current ML practices or future research.\nEnsure clear linkages between concepts, showing how ideas connect or build upon each other.\nInclude enough detail to convey the importance and novelty of each takeaway, while maintaining overall brevity.\n\nImpactful Quote:\nInclude one short, impactful quote from the paper. This should be the only instance of 

In [179]:
t_message = [{"role": "system", "content": file['technical']['system_prompt']}, 
            {"role": "user", "content": file['technical']['user_prompt'].format(processed_content=processed_content)}]

In [180]:
t_message

[{'role': 'system',
  'content': 'You are an AI assistant skilled at distilling complex AI research papers into compelling, holistic summaries for data scientists, software engineers, and machine learning engineers. Your summaries are known for capturing the essence of groundbreaking research and sparking curiosity in technical minds.'},
 {'role': 'user',
  'content': 'Create an engaging, technical summary of the following AI research paper tailored for ML engineers and AI professionals. Your summary should:\n\nOpening:\n\nBegin with a powerful statement that captures the paper\'s most innovative or impactful idea.\nFollow with a brief explanation of the paper\'s main contribution to the field of AI/ML and why it matters.\nUse language that would intrigue technical practitioners and immediately convey the significance of the research.\n\n\nKey Takeaways:\nPresent 2-3 key takeaways. For each:\n\nExplain the main idea in clear terms, avoiding excessive technical jargon.\nMention its pote

In [150]:
messages = [{"role": "system", "content": "You are an AI assistant skilled at translating complex AI research into clear, exciting summaries for business leaders who are curious about AI but may not have deep technical knowledge. Your summaries are known for their ability to convey the business potential of cutting-edge AI research in simple, relatable terms."}, 
            {"role": "user", "content": f"""Create an engaging, business-friendly summary of the following academic paper on AI technology. Your summary should:

Start with a brief, exciting explanation (2-3 sentences) of the paper's core concept and why it matters to businesses. Use simple language and avoid technical jargon.
Present 2-3 key takeaways as short paragraphs. For each:

Explain the idea in simple terms, as if you're talking to someone with no AI background.
Provide a concrete, relatable example of how this could impact or improve a common business operation.
Briefly mention how this overcomes an existing business challenge.


Include one short, impactful quote from the paper. Explain its significance for businesses in plain language.
Throughout the summary, subtly connect these ideas to current business trends or concerns.
Conclude with 1-2 sentences that excite the reader about the near-future potential of this technology for businesses.

Aim for a tone that's enthusiastic and accessible. Focus on the practical business applications and benefits, avoiding deep technical explanations. Your goal is to leave business leaders thinking, "This could really help my company - I want to learn more!"
Keep the entire summary under 300 words to maintain engagement.
Now, here's the full paper:
{processed_content}
"""}]

In [165]:
messages_2 = [{"role": "system", "content": "You are an AI assistant skilled at distilling complex AI research papers into compelling, holistic summaries for data scientists, software engineers, and machine learning engineers. Your summaries are known for capturing the essence of groundbreaking research and sparking curiosity in technical minds."}, 
            {"role": "user", "content": f"""Create an engaging, technical summary of the following AI research paper tailored for ML engineers and AI professionals. Your summary should:

Opening:

Begin with a powerful statement that captures the paper's most innovative or impactful idea.
Follow with a brief explanation of the paper's main contribution to the field of AI/ML and why it matters.
Use language that would intrigue technical practitioners and immediately convey the significance of the research.


Key Takeaways:
Present 2-3 key takeaways. For each:

Explain the main idea in clear terms, avoiding excessive technical jargon.
Mention its potential impact on current ML practices or future research.
Ensure clear linkages between concepts, showing how ideas connect or build upon each other.
Include enough detail to convey the importance and novelty of each takeaway, while maintaining overall brevity.

Impactful Quote:
Include one short, impactful quote from the paper. This should be the only instance of using the paper's exact words. Explain its significance succinctly, focusing on its potential implications for AI/ML development.
Conclusion:
Highlight the potential future impact of this research and why it's exciting for the field, drawing on the ideas presented in the takeaways.

Throughout the summary:

Use your own words to convey the paper's ideas. Do not copy exact sentences or substantial phrases from the paper, except for the single quote in section 3.
Aim for a tone that balances technical insight with accessibility and enthusiasm.
Focus on sparking curiosity and conveying the potential importance of the research.
Ensure clear connections between different concepts and ideas presented.
                         
Aim for a tone that balances technical precision with enthusiasm for the research's potential. Focus on aspects that would most interest ML practitioners and researchers, always ensuring clear explanations of how concepts interact or build upon each other. Your goal is to leave technical readers thinking, ""This approach could significantly advance our current methods  - I need to explore the full paper."
Keep the entire summary under 300 words to maintain engagement, remembering that detailed section-by-section summaries will follow.
Now, here's the full paper:
{processed_content}
"""}]

In [163]:
test_b_summary = generate_completion(client_val, "gpt-4o", messages)
print(test_b_summary)

**Unlocking the Power of AI: How Knowledge Distillation Can Transform Your Business**

Imagine having the advanced capabilities of cutting-edge AI models like GPT-4 at your fingertips, but without the hefty price tag or the need for extensive computational resources. This is the promise of Knowledge Distillation (KD), a process that transfers the intelligence of these powerful models into more accessible, cost-effective versions. For businesses, this means harnessing top-tier AI performance to drive innovation, efficiency, and competitive advantage.

**Key Takeaways:**

1. **Enhanced AI Capabilities at Lower Costs:**
   KD allows smaller, open-source models to learn from the best proprietary models. Think of it as a master-apprentice relationship where the apprentice (smaller model) learns the tricks of the trade from the master (larger model). For instance, a small retail business could use a distilled model to implement sophisticated customer service chatbots that provide personalize

In [166]:
test_b1_summary = generate_completion(client_val, "gpt-4o", messages_2)
print(test_b1_summary)

**Opening:**

In the rapidly evolving landscape of AI, knowledge distillation (KD) has emerged as a pivotal technique for transferring the advanced capabilities of proprietary large language models (LLMs) like GPT-4 to more accessible, open-source counterparts such as LLaMA and Mistral. This paper presents a comprehensive survey of KD's role in enhancing the performance of LLMs, focusing on its applications in model compression, self-improvement, and skill enhancement. By leveraging data augmentation (DA) to generate context-rich training data, KD transcends traditional boundaries, enabling open-source models to approximate the nuanced understanding and ethical alignment of their proprietary counterparts.

**Key Takeaways:**

1. **Algorithmic Foundations of KD:**
   - **Main Idea:** The paper categorizes KD mechanisms into algorithmic strategies, skill enhancement, and verticalization, providing a structured overview of how KD can be systematically applied to LLMs.
   - **Impact:** Thi