# Evol Instruct

Demonstrates using Evol Instruct to create hard instructions (questions) starting from a set of simple instructions.
Models trained on datasets with hard problems might perform better.

In [1]:
#%pip install --upgrade --quiet pydantic-ai-slim[anthropic,openai]

In [2]:
GEMINI="gemini-2.0-flash"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please specify the GEMINI_API_KEY access token in keys.env file"

In [3]:
# Needed in Jupyter environment See: https://ai.pydantic.dev/troubleshooting/ 
import nest_asyncio
nest_asyncio.apply()

### Define a dataclass for Questions and Answers
The class does some minor wrangling to remove numbering from questions

In [4]:
from dataclasses import dataclass
import re

@dataclass
class QuestionAnswer:
    question: str
    answer: str
    
    def __init__(self, question, answer):
        pattern = r"^[Question|Answer ]*\d+\.\s+" # remove numbering from strings
        self.question = re.sub(pattern, "", question)
        self.answer = re.sub(pattern, "", answer)

In [5]:
QuestionAnswer("1. some question", "1. some answer")

QuestionAnswer(question='some question', answer='some answer')

In [6]:
QuestionAnswer("Question 1. some question", "Answer 1. some answer")

QuestionAnswer(question='some question', answer='some answer')

In [7]:
QuestionAnswer("some question", "some answer")

QuestionAnswer(question='some question', answer='some answer')

## CIK to Symbol lookup

Edgar filings are based on a id key.  Build a lookup to go from that CIK to stock symbols for the companies we care about

In [8]:
!wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36" "https://httpbin.io/user-agent" https://www.sec.gov/files/company_tickers.json

--2025-04-15 05:02:03--  https://httpbin.io/user-agent
Resolving httpbin.io (httpbin.io)... 3.224.228.208, 44.211.11.205, 52.70.33.41
Connecting to httpbin.io (httpbin.io)|3.224.228.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 138 [application/json]
Saving to: ‘user-agent.3’


2025-04-15 05:02:03 (147 MB/s) - ‘user-agent.3’ saved [138/138]

--2025-04-15 05:02:03--  https://www.sec.gov/files/company_tickers.json
Resolving www.sec.gov (www.sec.gov)... 23.196.153.179, 2600:1409:3c00:b86::17b2, 2600:1409:3c00:b8b::17b2
Connecting to www.sec.gov (www.sec.gov)|23.196.153.179|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘company_tickers.json.4’

company_tickers.jso     [ <=>                ] 723.08K  --.-KB/s    in 0.04s   

2025-04-15 05:02:04 (19.4 MB/s) - ‘company_tickers.json.4’ saved [740434]

FINISHED --2025-04-15 05:02:04--
Total wall clock time: 0.6s
Downloaded: 2 files, 723K in 0.0

In [9]:
import json
with open("company_tickers.json") as ifp:
    lookup = json.load(ifp)
lookup = {item['cik_str']:item['ticker']  for key, item in lookup.items()}

In [10]:
import pandas as pd
lookup_df = pd.DataFrame.from_dict(lookup, orient='index', columns=['symbol'])
lookup_df

Unnamed: 0,symbol
320193,AAPL
789019,MSFT
1045810,NVDA
1018724,AMZN
1652044,GOOG
...,...
1839586,DMXCF
1663712,NRSAX
1762400,LVCE
1506721,BAFBF


In [11]:
lookup_df.to_csv('symbol_lookup.csv', index=True, header=False)

In [12]:
!head -3 symbol_lookup.csv

320193,AAPL
789019,MSFT
1045810,NVDA


## Initial set of questions

From a single filing, create a set of questions.
These will form the base instructions that we will evolve to make them harder.

### CIK -> Symbol so that we can expand synonyms

In [13]:
import pandas as pd
lookup_df = pd.read_csv('symbol_lookup.csv', names=['cik', 'symbol'])
lookup_df

Unnamed: 0,cik,symbol
0,320193,AAPL
1,789019,MSFT
2,1045810,NVDA
3,1018724,AMZN
4,1652044,GOOG
...,...,...
7624,1839586,DMXCF
7625,1663712,NRSAX
7626,1762400,LVCE
7627,1506721,BAFBF


### Create initial questions.

Take on the role of a business school prof

In [14]:
from typing import List
from pydantic_ai import Agent

def create_questions(filing, key, num_questions=3, model_id=GEMINI) -> List[str]:
    symbol = lookup_df[lookup_df['cik'] == int(filing['cik'])]['symbol'].values[0]
    system_prompt=f"""
    You are a professor in a MBA program.
    You will be given a passage from a SEC filing from {filing['company']} (symbol: {symbol}) made on {filing['filing_date']}
    Create {num_questions} analytical questions suitable for students of a class on company strategy based on this filing.
    
    Good questions should be:
    * Standalone. For example, make sure the question includes the name of the company, product, and year being referenced.
    * Avoid asking for factual numerical information such as revenue or capital expenditures
    * Ask "how", "why", "compare", etc.
    
    Example question: How might Google's (GOOG) reorganization of its hardware divisions affect its ability to grow Pixel phones' marketshare in 2023?"
    """.strip()
    agent = Agent(model_id, 
                  result_type=List[str],
                  system_prompt=system_prompt)
    result = agent.run_sync(filing[key])
    return (result.data)

In [15]:
import json
with open("edgar-crawler/datasets/EXTRACTED_FILINGS/10-K/2969_10K_2021_0000002969-21-000055.json") as ifp:
    filing = json.load(ifp)
    print(filing.keys())

dict_keys(['cik', 'company', 'filing_type', 'filing_date', 'period_of_report', 'sic', 'state_of_inc', 'state_location', 'fiscal_year_end', 'filing_html_index', 'htm_filing_link', 'complete_text_filing_link', 'filename', 'item_1', 'item_1A', 'item_1B', 'item_1C', 'item_2', 'item_3', 'item_4', 'item_5', 'item_6', 'item_7', 'item_7A', 'item_8', 'item_9', 'item_9A', 'item_9B', 'item_9C', 'item_10', 'item_11', 'item_12', 'item_13', 'item_14', 'item_15', 'item_16'])


In [16]:
questions = create_questions(filing, 'item_7') # management discussion
questions

['Here are 3 analytical questions based on the provided SEC filing from AIR PRODUCTS & CHEMICALS INC /DE/ (APD) made on 2021-11-18:',
 "1. How might Air Products' (APD) strategic focus on gasification, carbon capture, and hydrogen projects, as highlighted in the 2021 summary, influence its competitive positioning relative to peers in the industrial gases market in the long term?",
 "2. Considering the reorganization of Air Products' (APD) industrial gases segments effective October 1, 2021, how could this restructuring impact the company's ability to efficiently allocate resources and respond to regional market dynamics within the Americas, EMEA, and Asia?",
 "3. Given Air Products' (APD) emphasis on passing through higher energy and natural gas costs to customers, as mentioned in the 2021 results, how might this pricing strategy affect customer relationships and demand for its products and services, especially in price-sensitive markets or during periods of economic downturn?"]

### Write answers to the questions

Take on the role of a business school student

In [17]:
def write_answers(questions, filing, model_id=GEMINI) -> List[QuestionAnswer]:
    system_prompt=f"""
    You are a top student in a highly-ranked MBA program.
    You will be given an SEC filing from {filing['company']} made on {filing['filing_date']}
    Use that filing to answer the following questions.
    Each answer should be 2-3 sentences and be informed by your market insights and knowledge of business strategy.
    
    Do NOT number the questions or answers.
    """.strip()
    agent = Agent(model_id, 
                  result_type=List[QuestionAnswer],
                  system_prompt=system_prompt)
    
    prompt = "\n".join([f"Question {idx+1}: {question}" for idx, question in enumerate(questions)])
    result = agent.run_sync(prompt)
    return result.data

answers = write_answers(questions, filing)
answers

[QuestionAnswer(question="How might Air Products' (APD) strategic focus on gasification, carbon capture, and hydrogen projects, as highlighted in the 2021 summary, influence its competitive positioning relative to peers in the industrial gases market in the long term?", answer="Air Products' strategic focus on gasification, carbon capture, and hydrogen projects will likely strengthen its competitive positioning by aligning with the growing demand for sustainable solutions. This focus allows APD to differentiate itself and capture market share in emerging clean energy sectors, potentially leading to long-term growth and increased profitability compared to competitors with less emphasis on these areas."),
 QuestionAnswer(question="Considering the reorganization of Air Products' (APD) industrial gases segments effective October 1, 2021, how could this restructuring impact the company's ability to efficiently allocate resources and respond to regional market dynamics within the Americas, E

## Prompt Rewriting

Evolve an instruction by rewriting the prompt

In [18]:
def evolve_instructions(questions: List[str], num_extra_questions: int, model_id=GEMINI) -> List[str]:
    system_prompts = []

    # Make it deeper
    num_to_generate = num_extra_questions // 3
    system_prompts.append(f"""
    Your are creating questions for an extremely hard exam for a class on business strategy.
    Your objective is to create {num_to_generate} harder versions of the given questions so that it requires greater skills on the part of the student. 
    Ways in which you can make the question harder:
    * Add constraints based on current market conditions and competitor actions
    * Add hypotheticals such as potential cost overruns or an acquistion failing to take place
    Do not make the question itself more verbose. It should be approximately the same length as the original question.
    
    Harder questions:
    """.strip())
    
    # Make it more concrete
    system_prompts.append(f"""
    Your are creating questions for an extremely hard exam for a class on business strategy.
    Your objective is to create {num_to_generate} more concrete versions of the given questions so that they require greater grasp of the details. 
    Ways in which you can make the question harder:
    * Instead of asking "why", ask for 3 reasons why
    * Instead of asking "how", ask for the steps
    * Ask why a specific outcome is not larger or smaller
    Do not make the question itself more verbose. It should be approximately the same length as the original question.
        
    Harder questions:
    """.strip())    
    
    # Make it require more reasoning
    system_prompts.append(f"""
    Your are creating questions for an extremely hard exam for a class on business strategy.
    Your objective is to create {num_extra_questions - 2*num_to_generate} more multi-step reasoning versions of the given questions. 
    Combine two of the questions so that both questions have to be answered implicitly in order to answer the given question.
    The combined question should be no more than 10 words longer than the original question.
        
    Harder questions:
    """.strip()) 
    
    extra_questions = []
    for system_prompt in system_prompts:
        agent = Agent(model_id, result_type=List[str], system_prompt=system_prompt)
        prompt = "\n".join([f"Question {idx+1}: {question}" for idx, question in enumerate(questions)])
        result = agent.run_sync(prompt)
        extra_questions.extend(result.data)
        
    return extra_questions

evolve_instructions(questions, 10) 

["1. Considering current trends of increased environmental regulations and competitor investments in green technologies, how might Air Products' (APD) strategic focus on gasification, carbon capture, and hydrogen projects influence its competitive positioning if a major carbon capture project faces significant cost overruns and delays?",
 "2. If a key acquisition target in Asia falls through, how would the reorganization of Air Products' (APD) industrial gases segments impact the company's ability to efficiently allocate resources and respond to regional market dynamics, particularly in offsetting the anticipated growth from the Asian market?",
 "3. In a scenario where a global recession leads to decreased industrial production and a major competitor initiates a price war, how might Air Products' (APD) strategy of passing through higher energy and natural gas costs to customers affect customer retention and overall demand, specifically among small to medium-sized enterprises with limit

## Put it all together

In [19]:
def generate_question_answers(filename) -> List[QuestionAnswer]:
    import json
    with open(filename) as ifp:
        filing = json.load(ifp)
    questions = create_questions(filing, key='item_7', num_questions=3) # management discussion
    extra_questions = evolve_instructions(questions, num_extra_questions=10)
    questions.extend(extra_questions)
    return write_answers(questions, filing)

In [26]:
generate_question_answers("edgar-crawler/datasets/EXTRACTED_FILINGS/10-K/2488_10K_2022_0000002488-23-000047.json")

[QuestionAnswer(question='In 2022, AMD acquired Xilinx and Pensando. How might AMD leverage these acquisitions to create bundled product offerings or integrated solutions that address a broader range of customer needs, especially in the data center and embedded systems markets?', answer="AMD can leverage the Xilinx and Pensando acquisitions by offering integrated solutions that combine AMD CPUs/GPUs with Xilinx FPGAs and Pensando's data processing units (DPUs). This allows them to provide customized, high-performance solutions for specific workloads like AI inference, network acceleration, and storage processing, targeting data center operators looking for optimized infrastructure. These bundled solutions can provide greater value and performance than individual components, strengthening AMD's position against competitors."),
 QuestionAnswer(question="AMD's Client segment experienced a revenue decrease in 2022 due to a weak PC market. How could AMD diversify its Client segment product 

## Create supervised training dataset


In [24]:
import glob
from multiprocessing.pool import ThreadPool

filenames = [filepath for filepath in glob.iglob('edgar-crawler/datasets/EXTRACTED_FILINGS/10-K/*.json')]
print(f"Generating question-answer pairs from {len(filenames)} filings")

def threaded_task(filename: str) -> List[QuestionAnswer]:
    try:
        return generate_question_answers(filename)
    except:
        return []

def save_to_file(qas: List[QuestionAnswer], filename: str):
    with open(filename, "w") as ofp:
        for qa in qas:
            json.dump({
                "question": qa.question,
                "answer": qa.answer
            }, ofp)
            ofp.write("\n")
    
question_answers = []
with ThreadPool() as tpool:
    completed = 0
    for qas in tpool.imap_unordered(threaded_task, filenames):
        question_answers.extend(qas)
        completed += 1
        if completed%10 == 1:
            print(f"{completed}->{len(question_answers)} examples generated")
            # save progress ...
            save_to_file(question_answers, "generated_qas.json")

Generating question-answer pairs from 1124 filings
1->13 examples generated
11->114 examples generated
21->234 examples generated
31->325 examples generated
41->433 examples generated
51->512 examples generated
61->612 examples generated
71->683 examples generated
81->786 examples generated
91->874 examples generated
101->992 examples generated
111->1114 examples generated
121->1232 examples generated
131->1342 examples generated
141->1457 examples generated
151->1567 examples generated
161->1638 examples generated
171->1729 examples generated
181->1847 examples generated
191->1927 examples generated
201->2042 examples generated
211->2144 examples generated
221->2233 examples generated
231->2342 examples generated
241->2452 examples generated
251->2567 examples generated
261->2667 examples generated
271->2763 examples generated
281->2860 examples generated
291->2940 examples generated
301->3037 examples generated
311->3146 examples generated
321->3227 examples generated
331->3337 examp

In [25]:
# final ...
save_to_file(question_answers, "generated_qas.json")