# Evol Instruct

Demonstrates using Evol Instruct to create hard instructions (questions) starting from a set of simple instructions.
Models trained on datasets with hard problems might perform better.

In [1]:
#%pip install --upgrade --quiet pydantic-ai-slim[anthropic,openai]

In [1]:
MODEL_ID="gemini-2.0-flash"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please specify the GEMINI_API_KEY access token in keys.env file"

In [2]:
# Needed in Jupyter environment See: https://ai.pydantic.dev/troubleshooting/ 
import nest_asyncio
nest_asyncio.apply()

### Define a dataclass for Questions and Answers
The class does some minor wrangling to remove numbering from questions

In [31]:
from dataclasses import dataclass
import re

def strip_numbering(s: str) -> str:
    pattern = r"^[Question|Answer ]*\d+[\.:]\s+" # remove numbering from strings
    return re.sub(pattern, "", s)

@dataclass
class QuestionAnswer:
    question: str
    answer: str
    
    def __init__(self, question, answer):
        self.question = strip_numbering(question)
        self.answer = strip_numbering(answer)
        
    def strip_numbering(self) -> QuestionAnswer:
        return QuestionAnswer(self.question, self.answer)

In [32]:
QuestionAnswer("1. some question", "1. some answer")

QuestionAnswer(question='some question', answer='some answer')

In [33]:
QuestionAnswer("Question 1. some question", "Answer 1. some answer")

QuestionAnswer(question='some question', answer='some answer')

In [34]:
QuestionAnswer("some question", "some answer")

QuestionAnswer(question='some question', answer='some answer')

In [35]:
QuestionAnswer("Question 1: some question", "Answer 1: some answer")

QuestionAnswer(question='some question', answer='some answer')

## CIK to Symbol lookup

Edgar filings are based on a id key.  Build a lookup to go from that CIK to stock symbols for the companies we care about

In [8]:
!wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36" "https://httpbin.io/user-agent" https://www.sec.gov/files/company_tickers.json

--2025-04-15 15:10:48--  https://httpbin.io/user-agent
Resolving httpbin.io (httpbin.io)... 52.70.33.41, 44.211.11.205, 3.224.228.208
Connecting to httpbin.io (httpbin.io)|52.70.33.41|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 138 [application/json]
Saving to: ‘user-agent.1’


2025-04-15 15:10:48 (287 MB/s) - ‘user-agent.1’ saved [138/138]

--2025-04-15 15:10:48--  https://www.sec.gov/files/company_tickers.json
Resolving www.sec.gov (www.sec.gov)... 23.196.153.179, 2600:1409:3c00:68b::17b2, 2600:1409:3c00:689::17b2
Connecting to www.sec.gov (www.sec.gov)|23.196.153.179|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘company_tickers.json.2’

company_tickers.jso     [ <=>                ] 723.08K  --.-KB/s    in 0.06s   

2025-04-15 15:10:49 (12.1 MB/s) - ‘company_tickers.json.2’ saved [740434]

FINISHED --2025-04-15 15:10:49--
Total wall clock time: 0.7s
Downloaded: 2 files, 723K in 0.06s

In [7]:
import json
with open("company_tickers.json") as ifp:
    lookup = json.load(ifp)
lookup = {item['cik_str']:item['ticker']  for key, item in lookup.items()}

In [8]:
import pandas as pd
lookup_df = pd.DataFrame.from_dict(lookup, orient='index', columns=['symbol'])
lookup_df

Unnamed: 0,symbol
320193,AAPL
789019,MSFT
1045810,NVDA
1018724,AMZN
1652044,GOOG
...,...
1839586,DMXCF
1663712,NRSAX
1762400,LVCE
1506721,BAFBF


In [9]:
lookup_df.to_csv('symbol_lookup.csv', index=True, header=False)

In [10]:
!head -3 symbol_lookup.csv

320193,AAPL
789019,MSFT
1045810,NVDA


## Initial set of questions

From a single filing, create a set of questions.
These will form the base instructions that we will evolve to make them harder.

### CIK -> Symbol so that we can expand synonyms

In [12]:
import pandas as pd
lookup_df = pd.read_csv('symbol_lookup.csv', names=['cik', 'symbol'])
lookup_df

Unnamed: 0,cik,symbol
0,320193,AAPL
1,789019,MSFT
2,1045810,NVDA
3,1018724,AMZN
4,1652044,GOOG
...,...,...
7624,1839586,DMXCF
7625,1663712,NRSAX
7626,1762400,LVCE
7627,1506721,BAFBF


### Create initial questions.

Take on the role of a business school prof

In [16]:
from typing import List
from pydantic_ai import Agent

# item_7 is the management discussion
def create_questions(filing, key='item_7', num_questions=3) -> List[str]:
    symbol = lookup_df[lookup_df['cik'] == int(filing['cik'])]['symbol'].values[0]
    system_prompt=f"""
    You are a professor in a MBA program.
    You will be given a passage from a SEC filing from {filing['company']} (symbol: {symbol}) made on {filing['filing_date']}
    Create {num_questions} analytical questions suitable for students of a class on company strategy based on this filing.
    
    Good questions should be:
    * Standalone. For example, make sure the question includes the name of the company, product, and year being referenced.
    * Avoid asking for factual numerical information such as revenue or capital expenditures
    * Ask "how", "why", "compare", etc.
    
    Example question: How might Google's (GOOG) reorganization of its hardware divisions affect its ability to grow Pixel phones' marketshare in 2023?"
    """.strip()
    agent = Agent(MODEL_ID, 
                  result_type=List[str],
                  system_prompt=system_prompt)
    result = agent.run_sync(filing[key])
    questions = result.data
    return [strip_numbering(q) for q in questions]

In [17]:
import json
with open("edgar-crawler/datasets/EXTRACTED_FILINGS/10-K/2969_10K_2021_0000002969-21-000055.json") as ifp:
    filing = json.load(ifp)
    print(filing.keys())

dict_keys(['cik', 'company', 'filing_type', 'filing_date', 'period_of_report', 'sic', 'state_of_inc', 'state_location', 'fiscal_year_end', 'filing_html_index', 'htm_filing_link', 'complete_text_filing_link', 'filename', 'item_1', 'item_1A', 'item_1B', 'item_1C', 'item_2', 'item_3', 'item_4', 'item_5', 'item_6', 'item_7', 'item_7A', 'item_8', 'item_9', 'item_9A', 'item_9B', 'item_9C', 'item_10', 'item_11', 'item_12', 'item_13', 'item_14', 'item_15', 'item_16'])


In [18]:
questions = create_questions(filing)
questions

["Here are 3 analytical questions on AIR PRODUCTS & CHEMICALS INC /DE/ (APD) suitable for an MBA class focusing on company strategy, based on their 2021 SEC filing:\n\n1.  Air Products (APD) is investing heavily in gasification, carbon capture, and hydrogen projects. How might the cyclical nature of the energy market impact the long-term profitability and strategic viability of these capital-intensive projects, particularly given the company's reliance on long-term contracts and customer relationships as of 2021?\n\n2.  Air Products (APD) reorganized its industrial gases segments effective October 1, 2021. How could this reorganization affect APD's ability to respond to regional market differences, and what are the potential benefits and risks of this change in structure regarding operational efficiency and strategic focus?\n\n3.  Air Products (APD) highlights Lu'An Clean Energy Company's reduced contributions to sales and operating income in 2021 and expects this to continue through 2

### Write answers to the questions

Take on the role of a business school student

In [28]:
def write_answers(questions, filing, key='item_7') -> List[QuestionAnswer]:
    system_prompt=f"""
    You are a top student in a highly-ranked MBA program.
    You are given an SEC filing from {filing['company']} made on {filing['filing_date']}
    Use that filing to answer the following questions, but if some information is not in the filing,
    answer based on your general market insights and knowledge of business strategy.
    Do not refuse to answer as that will give you zero points on the exam.
    Each answer should be 2-3 sentences.
    
    FILING:
    {filing[key]}
    """.strip()
    agent = Agent(MODEL_ID, 
                  result_type=List[QuestionAnswer],
                  system_prompt=system_prompt)
    
    prompt = "\n".join([f"Question {idx+1}: {question}" for idx, question in enumerate(questions)])
    result = agent.run_sync(prompt)
    return [qa.strip_numbering() for qa in result.data]

answers = write_answers(questions, filing)
answers

[QuestionAnswer(question="Air Products (APD) is investing heavily in gasification, carbon capture, and hydrogen projects. How might the cyclical nature of the energy market impact the long-term profitability and strategic viability of these capital-intensive projects, particularly given the company's reliance on long-term contracts and customer relationships as of 2021?", answer="The cyclical nature of the energy market could significantly impact the profitability of Air Products' (APD) capital-intensive gasification, carbon capture, and hydrogen projects. While long-term contracts and customer relationships provide some stability, a prolonged downturn in energy prices could reduce demand or pressure contract pricing, diminishing returns on these investments. APD needs to develop flexible contract terms and explore hedging strategies to mitigate these risks."),
 QuestionAnswer(question="Air Products (APD) reorganized its industrial gases segments effective October 1, 2021. How could th

## Prompt Rewriting

Evolve an instruction by rewriting the prompt

In [36]:
def evolve_instructions(questions: List[str], num_extra_questions: int) -> List[str]:
    system_prompts = []

    # Make it deeper
    num_to_generate = num_extra_questions // 3
    system_prompts.append(f"""
    Your are creating questions for an extremely hard exam for a class on business strategy.
    Your objective is to create {num_to_generate} harder versions of the given questions so that it requires greater skills on the part of the student. 
    Ways in which you can make the question harder:
    * Add constraints based on current market conditions and competitor actions
    * Add hypotheticals such as potential cost overruns or an acquistion failing to take place
    Do not make the question itself more verbose. It should be approximately the same length as the original question.
    
    Harder questions:
    """.strip())
    
    # Make it more concrete
    system_prompts.append(f"""
    Your are creating questions for an extremely hard exam for a class on business strategy.
    Your objective is to create {num_to_generate} more concrete versions of the given questions so that they require greater grasp of the details. 
    Ways in which you can make the question harder:
    * Instead of asking "why", ask for 3 reasons why
    * Instead of asking "how", ask for the steps
    * Ask why a specific outcome is not larger or smaller
    Do not make the question itself more verbose. It should be approximately the same length as the original question.
        
    Harder questions:
    """.strip())    
    
    # Make it require more reasoning
    system_prompts.append(f"""
    Your are creating questions for an extremely hard exam for a class on business strategy.
    Your objective is to create {num_extra_questions - 2*num_to_generate} more multi-step reasoning versions of the given questions. 
    Combine two of the questions so that both questions have to be answered implicitly in order to answer the given question.
    The combined question should be no more than 10 words longer than the original question.
        
    Harder questions:
    """.strip()) 
    
    extra_questions = []
    for system_prompt in system_prompts:
        agent = Agent(MODEL_ID, result_type=List[str], system_prompt=system_prompt)
        prompt = "\n".join([f"Question {idx+1}: {question}" for idx, question in enumerate(questions)])
        result = agent.run_sync(prompt)
        extra_questions.extend(result.data)
    
    return [strip_numbering(q) for q in extra_questions]

evolve_instructions(questions, 10) 

["Considering that competitors are also investing in green hydrogen projects and government subsidies for renewable energy are uncertain, how might a combination of increased competition and fluctuating subsidies affect the return on Air Products' (APD) gasification, carbon capture, and hydrogen projects, and what strategic adjustments might be necessary to maintain profitability if a major hydrogen project faces a 20% cost overrun?",
 "If a major competitor were to aggressively discount prices in one of the reorganized industrial gas segments, how might Air Products' (APD) new organizational structure affect its ability to respond, and what specific metrics should be monitored to assess the effectiveness of the reorganization in maintaining market share and profitability under such competitive pressure?",
 "If Lu'An Clean Energy Company's contributions were to drop to zero due to unforeseen regulatory changes in China, how should Air Products (APD) adjust its emerging markets strategy

## Put it all together

In [37]:
def generate_question_answers(filename) -> List[QuestionAnswer]:
    import json
    with open(filename) as ifp:
        filing = json.load(ifp)
    questions = create_questions(filing, key='item_7', num_questions=3) # management discussion
    extra_questions = evolve_instructions(questions, num_extra_questions=10)
    questions.extend(extra_questions)
    return write_answers(questions, filing)

In [38]:
generate_question_answers("edgar-crawler/datasets/EXTRACTED_FILINGS/10-K/2488_10K_2022_0000002488-23-000047.json")

[QuestionAnswer(question="In 2022, AMD acquired Xilinx and Pensando. How might these acquisitions enable AMD to better compete with NVIDIA (NVDA) in the data center market, considering NVIDIA's existing strengths in GPUs and networking?", answer="The acquisitions of Xilinx and Pensando provide AMD with adaptable hardware platforms (FPGAs, Adaptive SoCs, ACAPs) and high-performance data processing units (DPUs). These technologies, combined with AMD's existing EPYC server processors and Instinct accelerators, allow AMD to offer a more comprehensive portfolio of solutions for data centers, addressing diverse workloads and customer needs. This broader product suite enables AMD to better compete with NVIDIA's GPUs and networking solutions by offering a more integrated and optimized approach."),
 QuestionAnswer(question="AMD's Client segment revenue decreased by 10% in 2022 due to weak PC market conditions. How might AMD diversify its Client segment product offerings or target specific sub-s

## Create supervised training dataset


In [39]:
import glob
from multiprocessing.pool import ThreadPool

filenames = [filepath for filepath in glob.iglob('edgar-crawler/datasets/EXTRACTED_FILINGS/10-K/*.json')]
print(f"Generating question-answer pairs from {len(filenames)} filings")

def threaded_task(filename: str) -> List[QuestionAnswer]:
    try:
        return generate_question_answers(filename)
    except:
        return []

def save_to_file(qas: List[QuestionAnswer], filename: str):
    with open(filename, "w") as ofp:
        for qa in qas:
            json.dump({
                "question": qa.question,
                "answer": qa.answer
            }, ofp)
            ofp.write("\n")
    
question_answers = []
with ThreadPool() as tpool:
    completed = 0
    for qas in tpool.imap_unordered(threaded_task, filenames):
        question_answers.extend(qas)
        completed += 1
        if completed%10 == 1:
            print(f"{completed}->{len(question_answers)} examples generated")
            # save progress ...
            save_to_file(question_answers, "generated_qas.json")

Generating question-answer pairs from 1124 filings
1->13 examples generated
11->131 examples generated
21->258 examples generated
31->372 examples generated
41->490 examples generated
51->620 examples generated
61->751 examples generated
71->877 examples generated
81->1004 examples generated
91->1122 examples generated
101->1235 examples generated
111->1361 examples generated
121->1488 examples generated
131->1606 examples generated
141->1731 examples generated
151->1851 examples generated
161->1981 examples generated
171->2111 examples generated
181->2238 examples generated
191->2349 examples generated
201->2468 examples generated
211->2588 examples generated
221->2703 examples generated
231->2809 examples generated
241->2934 examples generated
251->3063 examples generated
261->3179 examples generated
271->3302 examples generated
281->3418 examples generated
291->3545 examples generated
301->3666 examples generated
311->3783 examples generated
321->3908 examples generated
331->4015 ex

In [40]:
# final ...
save_to_file(question_answers, "generated_qas.json")

## Evaluate questions and answers, and keep only the good ones.

Ideally, you'd also have human review here.

In [44]:
@dataclass
class Score:
    score: int
    explanation: str
    
@dataclass
class ScoredQuestionAnswer:
    question: str
    answer: str
    score: int
    explanation: str
    
def score_qa(qa: QuestionAnswer) -> Score:
    system_prompt="""
    You are a journalist who interviewed a number of Wall Street analysts of
    large public companies in the United States. I'll give you a question and an answer to that question.
    Now, you need to select the interview questions that will appear in an article on business strategy.
    
    Reply with a score of 1-5 where:
    1 is for questions and answers that will be obvious to your audience, or which are wrong.
    5 is for questions and answers that are genuinely insightful.
    
    Explain your reasoning.
    """
    agent = Agent(MODEL_ID, result_type=Score, system_prompt=system_prompt)
    prompt = f"Question: {qa.question}\nAnswer: {qa.answer}"
    result = agent.run_sync(prompt)
    score = result.data
    return ScoredQuestionAnswer(qa.question, qa.answer, score.score, score.explanation)

for idx in [5, 15, 25]:
    print(score_qa(question_answers[idx]), "\n")

ScoredQuestionAnswer(question='In 2020, how did UPS balance cost control in its U.S. Domestic Package segment versus its International Package segment, given varying labor regulations and infrastructure limitations, and how would a sudden 20% increase in fuel costs have altered these strategies?', answer="In 2020, UPS's cost control strategies differed between the U.S. Domestic Package and International Package segments due to factors like varying labor regulations and infrastructure. The U.S. Domestic Package segment was affected by residential volume growth, increased labor costs, and delivery density challenges. The document does not specify how a sudden 20% increase in fuel costs would have altered these strategies.", score=3, explanation="This question delves into the complexities of UPS's operational strategy, highlighting the challenges posed by different regional factors. The answer acknowledges these differences but admits a lack of information about the impact of a fuel cost 

In [None]:
def threaded_task(qa: QuestionAnswer) -> ScoredQuestionAnswer:
    try:
        return score_qa(qa)
    except:
        return ScoredQuestionAnswer(qa.question, qa.answer, 3, "Failed to score")

def save_to_file(qas: List[ScoredQuestionAnswer], filename: str):
    qas = sorted(qas, key=lambda x: x.score, reverse=True)
    with open(filename, "w") as ofp:
        for qa in qas:
            json.dump({
                "question": qa.question,
                "answer": qa.answer,
                "score": qa.score,
                "explanation": qa.explanation
            }, ofp)
            ofp.write("\n")

scored_qas = []
with ThreadPool() as tpool:
    for qa in tpool.imap_unordered(threaded_task, question_answers):
        scored_qas.append(qa)
        if len(scored_qas)%1000 == 1:
            print(f"{len(scored_qas)} examples scored")
            # save progress ...
            save_to_file(scored_qas, "generated_qas_scored.json")

save_to_file(scored_qas, "generated_qas_scored.json")

1 examples scored
1001 examples scored
2001 examples scored
3001 examples scored
4001 examples scored
5001 examples scored
6001 examples scored
7001 examples scored
8001 examples scored
9001 examples scored
10001 examples scored
11001 examples scored
12001 examples scored
13001 examples scored
