# Simple Pipeline Test

Starting simple - just load markdown files and extract entities!

In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
# First, let's create some test markdown files
import os
from pathlib import Path

# Create test directories
test_dir = Path("../test_data")
test_dir.mkdir(exist_ok=True)
(test_dir / "personal-notes").mkdir(exist_ok=True)
(test_dir / "work").mkdir(exist_ok=True)

In [11]:
# Create some test markdown files
test_files = [
    ("personal-notes/meeting1.md", 
     """# Meeting with Jennifer Smith\n\nDiscussed the Odyssey Project progress. Jennifer mentioned that Microsoft Corp is interested in our approach.\n\nNext steps:\n- Follow up with Bob Johnson\n- Review Sierra Initiative requirements"""),
    
    ("personal-notes/ideas.md",
     """# Random Ideas\n\nThinking about sustainable architecture. The Green Building Project could be revolutionary.\n\nNeed to contact Sarah Lee about the Tokyo Design Initiative."""),
     
    ("work/project_notes.md",
     """# Odyssey Project Update\n\nWorking with Jennifer Smith and Bob Johnson on the next phase. \n\nGoogle Inc has shown interest in partnering."""),
]

for filename, content in test_files:
    file_path = test_dir / filename
    file_path.write_text(content)
    print(f"Created: {filename}")

Created: personal-notes/meeting1.md
Created: personal-notes/ideas.md
Created: work/project_notes.md


In [12]:
# Now let's test our pipeline!
from ara.markdown_directory import MarkdownDirectory
from ara.agents import extract_entities, count_entities_summary

In [13]:
# Load markdown files from specific paths
md_dir = MarkdownDirectory(
    base_path="../test_data",
    paths=["personal-notes", "work"]
)

# Load the files
md_dir.load()

print(f"\nFiles loaded:")
for file in md_dir:
    print(f"  - {file.path.relative_to(md_dir.base_path)}")

Loaded 3 markdown files

Files loaded:
  - personal-notes/ideas.md
  - personal-notes/meeting1.md
  - work/project_notes.md


In [14]:
# Apply entity extraction to each file
md_dir.apply(extract_entities)

# Let's see what entities were extracted from each file
for i, file in enumerate(md_dir):
    entities = file.metadata.get('extract_entities', {})
    print(f"\n{file.path.name}:")
    for category, items in entities.items():
        if items:
            print(f"  {category}: {items}")


ideas.md:
  people: ['Sarah Lee']
  projects: ['The Green Building Project', 'Tokyo Design Initiative']
  other: ['Random Ideas\n\nThinking', 'Need']

meeting1.md:
  people: ['Bob Johnson']
  projects: ['Odyssey Project', 'Review Sierra Initiative']
  companies: ['Microsoft Corp']
  other: ['Meeting', 'Follow', 'Next', 'Jennifer', 'Jennifer Smith\n\nDiscussed']

project_notes.md:
  people: ['Jennifer Smith', 'Bob Johnson']
  projects: ['Odyssey Project Update\n\nWorking']
  companies: ['Google Inc']


In [15]:
# Run the synthesis step to aggregate all entities
summary = md_dir.synthesize(count_entities_summary)

print("\n=== SUMMARY ===")
print(f"Total files processed: {summary['total_files']}")
print(f"\nEntity counts:")
for category, count in summary['entity_counts'].items():
    print(f"  {category}: {count}")
    
print(f"\nTop entities per category:")
for category, entities in summary['top_entities'].items():
    if entities:
        print(f"  {category}: {', '.join(entities)}")


=== SUMMARY ===
Total files processed: 3

Entity counts:
  people: 3
  projects: 5
  companies: 2
  other: 7

Top entities per category:
  people: Sarah Lee, Jennifer Smith, Bob Johnson
  projects: The Green Building Project, Review Sierra Initiative, Odyssey Project, Odyssey Project Update

Working, Tokyo Design Initiative
  companies: Google Inc, Microsoft Corp
  other: Random Ideas

Thinking, Meeting, Need, Next, Follow


## Next Steps

This is super simple! The real MVP would:
1. Use actual LLMs for entity extraction (not regex)
2. Add more agent types (goals, preferences, etc)
3. Build a proper knowledge graph
4. Add the "Biographer" synthesis step

But this shows the basic pipeline structure working!