Construct a dataset of Microsoft's Directors & Principal officers over time. You can skip the data wrangling and head straight over to the datasetbuilder code.

In [None]:
# First, lets extract every item 5.02 Departure of Directors or Principal Officers; Election of Directors; Appointment of Principal Officers. from 8-K disclosures

from datamule import Portfolio

portfolio = Portfolio('data/msft_8k')
# Note: This is slow because I'm at a hotel with slow internet.
portfolio.download_submissions(submission_type=['8-K'],ticker='MSFT', provider='sec')

In [None]:
# Now we extract item 5.02 into a new csv. We will have three columns, accession_number, filing_date, and text.
# Note: this workflow may change in the future, as I update the datamule parser to parse more documents.
def extract_item_5_02(submission):

    try:
        row_dict = {}
        submission_metadata = submission.metadata['submission']
        filing_date = submission_metadata['FILED AS OF DATE']
        accession_number = submission_metadata['ACCESSION NUMBER']
        for document in submission.document_type('8-K'):
            document.parse()
            row_dict['accession_number'] = accession_number
            row_dict['filing_date'] = filing_date
            row_dict['text'] = document.data['document']['item502']
            return row_dict
    except:
        return None

    return row_dict

rows = portfolio.process_submissions(extract_item_5_02)
rows = [row for row in rows if row]

# we get 49 rows as of 1/14/25
print(len(rows))

In [31]:
# save to csv
import pandas as pd

df = pd.DataFrame(rows)
df.to_csv('data/msft_8k_item_5_02.csv', index=False)

In [1]:
# Now, we build the dataset
# Note: You can skip previous steps if you have the csv file already.
from txt2dataset import DatasetBuilder
import os


builder = DatasetBuilder()

builder.set_api_key(os.environ["GEMINI_API_KEY"])

# set base prompt, e.g. what the model looks for
base_prompt = """Extract officer changes and movements to JSON format.
    Track when officers join, leave, or change roles.
    Provide the following information:
    - date (YYYYMMDD)
    - name (First Middle Last)
    - title
    - action (one of: ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"])
    Return an empty dict if info unavailable."""

response_schema = {
    "type": "ARRAY",
    "items": {
        "type": "OBJECT",
        "properties": {
            "date": {"type": "STRING", "description": "Date of action in YYYYMMDD format"},
            "name": {"type": "STRING", "description": "Full name (First Middle Last)"},
            "title": {"type": "STRING", "description": "Official title/position"},
            "action": {
                "type": "STRING", 
                "enum": ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"],
                "description": "Type of personnel action"
            }
        },
        "required": ["date", "name", "title", "action"]
    }
}

builder.set_rpm(1500)
builder.set_model('gemini-1.5-flash-8b')

  from .autonotebook import tqdm as notebook_tqdm


<txt2dataset.dataset_builder.DatasetBuilder at 0x1cec9511b50>

In [2]:
# build the data
builder.build(base_prompt=base_prompt,
               response_schema=response_schema,
               text_column='text',
               index_column='accession_number',
               input_path="data/msft_8k_item_5_02.csv",
               output_path='data/msft_officers.csv') # index_column is the unique identifier, if none is specified, will use row index

Loading data...
Total entries in dataset: 49
Already processed: 0
New entries to process: 49


Processed 49/49 | 688 RPM | Mem: 154MB: 100%|██████████| 49/49 [00:04<00:00, 11.52it/s]


Processing complete:
Total processed in this run: 49
Average speed: 687 RPM
Failed entries: 0





In [3]:
builder.standardize(response_schema=response_schema,input_path='data/msft_officers.csv', output_path='data/msft_officers_standardized.csv',columns=['name'])

Loading data...
Standardized 33 unique values in name
Saved standardized data to data/msft_officers_standardized.csv


In [4]:
builder.standardize(response_schema=response_schema,input_path="data/msft_officers_standardized.csv", output_path='data/msft_officers_standardized.csv',columns=['title'])

Loading data...
Standardized 27 unique values in title
Saved standardized data to data/msft_officers_standardized.csv


In [5]:
results = builder.validate(input_path='data/msft_8k_item_5_02.csv',
                 output_path= 'data/msft_officers_standardized.csv', 
                 text_column='text',
                 index_column='accession_number', 
                 base_prompt=base_prompt,
                 response_schema=response_schema,
                 n=5,
                 quiet=False)

Validation complete: 5 correct out of 5 total


In [6]:
results

[{'input_text': "Item 5.02. Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers. On May 8, 2013, the Company announced the appointment of Amy Hood, age 41, to serve as chief financial officer. As chief financial officer, Ms. Hood is responsible for leading Microsoft's worldwide finance organization, including acquisitions, corporate strategy, treasury activities, tax planning, accounting and reporting, and internal audit and investor relations. Beginning in 2010, Ms. Hood was chief financial officer of the Microsoft Business Division. From 2006 through 2009, Ms. Hood was General Manager, Microsoft Business Division Strategy. Since joining Microsoft in 2002, Ms. Hood has also held positions in the Server and Tools Business and the corporate finance organization.",
  'process_output': [{'date': 20130508,
    'name': 'Hood, Amy',
    'title': 'Chief Financial Officer',
    'action': 'HIRED'}],
  

In [5]:
def print_validation_results(results):
    """
    Print validation results in a formatted, readable way
    
    Args:
        results (list): List of dictionaries containing validation results
    """
    for i, result in enumerate(results, 1):
        print(f"\n{'='*80}")
        print(f"Result {i}")
        print(f"{'='*80}")
        
        # Print input text
        print("\nInput Text:")
        print("-" * 40)
        print(result['input_text'])
        
        # Print processed outputs
        print("\nProcessed Output:")
        print("-" * 40)
        for output in result['process_output']:
            print(f"Date: {output['date']}")
            print(f"Name: {output['name']}")
            print(f"Title: {output['title']}")
            print(f"Action: {output['action']}")
            print("-" * 20)
        
        # Print validation status
        print("\nValidation Status:")
        print("-" * 40)
        print(f"Valid: {result['is_valid']}")
        if 'reason' in result:
            print(f"Reason: {result['reason']}")
        print()

print_validation_results(results)


Result 1

Input Text:
----------------------------------------
Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers (b) On June 30, 2016, Kevin Turner provided notice he was resigning his position as Chief Operating Officer of Microsoft.

Processed Output:
----------------------------------------
Date: 20160630
Name: Kevin Turner
Title: Chief Operating Officer
Action: RESIGNED
--------------------

Validation Status:
----------------------------------------
Valid: True
Reason: The generated JSON is valid and follows the schema.  It correctly extracts the date, name, title, and action from the provided text and maps them to the expected format.  No important details are missed or misrepresented.


Result 2

Input Text:
----------------------------------------
Item 5.02. Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arr

Results are pretty good for a simple response schema. Quick reminder - this is in early development, and it was created by stapling a bunch of LLM stuff together. Will need work, and will get better over the next few months.