# /Traverse & Persist with ScrapingOrchestrator

Now we have the `ScrapingOrchestrator` class that handles:
- ‚úÖ Persist the metadata (SQLite database) and the markdown files
- ‚úÖ Traverse the next set of child_urls
- ‚úÖ Before scraping the next url first check if that specific page has been scraped

### Features:
1. **Database Persistence**: Uses SQLAlchemy ORM with SQLite
2. **URL Deduplication**: Tracks all scraped URLs to prevent re-scraping
3. **Queue Management**: FIFO queue for traversing child URLs
4. **Batch Processing**: Scrape single URLs or batch operations
5. **Depth Control**: Traverse URLs by depth level

In [1]:
from models import ScrapingOrchestrator
import logging
import dotenv, os, pathlib



dotenv.load_dotenv(dotenv.find_dotenv(".env"))

# Initialize the orchestrator
root_url = 'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/'
api_key = os.getenv('FIRECRAWL_API_KEY')

orchestrator = ScrapingOrchestrator(
    firecrawl_api_key=api_key,
    root_url=root_url,
    db_path=None,  # Uses default: ./data/scraping.dbadded to 
    log_level=logging.INFO,
    ask_ollama=False,  # Set to True if Ollama is running
    load_existing_urls=True
)

print("‚úÖ ScrapingOrchestrator initialized")
print(f"Database location: {orchestrator.db_manager.db_path}")

2026-01-24 15:53:23,416 - models.orchestrator.ScrapingOrchestrator - INFO - ScrapingOrchestrator initialized
2026-01-24 15:53:23,445 - models.orchestrator.ScrapingOrchestrator - INFO - Loaded existing scraped URLs from database


‚úÖ ScrapingOrchestrator initialized
Database location: /home/joestry/git-projects/github/scraping-projects/flink-docs/firecrawl_flink_docs/data/scraping.db


In [2]:
orchestrator.to_dict()

{'root_url': 'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
 'total_scraped_urls': 342,
 'failed_urls': 0,
 'queue_pending': 0,
 'database_pages': 342,
 'allowed_domain': 'nightlies.apache.org/flink/flink-docs-release-1.20/docs/',
 'allow_outside_domain': False,
 'allowed_netloc': 'nightlies.apache.org',
 'allowed_path': '/flink/flink-docs-release-1.20/docs/',
 'ask_ollama': False,
 'load_existing_urls': True,
 'scraped_urls_count': 342,
 'failed_urls_count': 0,
 'scraped_urls_sample': ['https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/memory/mem_tuning',
  'https://code.visualstudio.com/',
  'https://calcite.apache.org/docs/reference.html',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/table/sql/load',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/ops/state/savepoints',
  'https://github.com/apache/flink-training',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/

### Usage Examples

#### Example 1: Scrape single URL with persistence
This scrapes a URL, saves markdown file, and persists metadata to database.

In [2]:
orchestrator.scraped_urls

{'https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/dev/datastream/operators/process_function',
 'https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/dev/table/overview',
 'https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/overview',
 'https://nightlies.apache.org/flink/flink-docs-stable/docs/learn-flink/fault_tolerance',
 'https://nightlies.apache.org/flink/flink-docs-stable/docs/learn-flink/overview'}

In [2]:
# Example 1: Scrape single URL and persist
# Note: You may want to use ask_ollama=False for faster testing
test_url = 'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/'

metadata = orchestrator.scrape_and_persist(test_url)
if metadata:
    print(f"\n‚úÖ Scraped and persisted!")
    print(f"   Page ID: {metadata.page_id}")
    print(f"   Title: {metadata.title}")
    print(f"   Child URLs found: {len(metadata.child_urls)}")
    print(f"   File saved: ./data/markdown_files/{metadata.prefix}_{metadata.page_id}.md")
else:
    print("URL already scraped or scrape failed")

2026-01-18 12:41:48,457 - models.orchestrator.ScrapingOrchestrator - INFO - URL already scraped, skipping


URL already scraped or scrape failed


In [4]:
metadata

#### Example 2: Check if URL has been scraped (deduplication)

In [None]:
# Check if URLs have been scraped
urls_to_check = [
    test_url,  # Should return True (we just scraped it)
    'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/try-flink/datastream/',
]

for url in urls_to_check:
    has_been_scraped = orchestrator.has_been_scraped(url)
    status = "‚úÖ Already scraped" if has_been_scraped else "‚ùå Not yet scraped"
    print(f"{status}: {url}")

#### Example 3: Queue child URLs for traversal

In [None]:
# If metadata had child URLs, they are automatically added to queue
# You can also manually add URLs:

sample_child_urls = [
    ("Getting Started", "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/try-flink/"),
    ("DataStream API", "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/"),
]

orchestrator.add_urls_to_queue(sample_child_urls)
print(f"Queue size: {orchestrator.queue_size()}")
print(f"Queue contents:")
for i, (text, url) in enumerate(list(orchestrator.url_queue), 1):
    print(f"  {i}. {text}: {url}")

#### Example 4: Batch scraping from queue

In [None]:
# Scrape a batch (e.g., first 3 URLs from queue)
# Note: Adjust max_urls based on your API quota
if orchestrator.queue_size() > 0:
    stats = orchestrator.scrape_batch(max_urls=3, stop_on_failure=False)
    print("\nüìä Batch Scraping Results:")
    print(f"   ‚úÖ Scraped: {stats['scraped']}")
    print(f"   ‚ùå Failed: {stats['failed']}")
    print(f"   ‚è≠Ô∏è  Skipped (already done): {stats['skipped']}")
    print(f"   üìã Queue remaining: {stats['queue_remaining']}")
else:
    print("Queue is empty! Add URLs first.")

#### Example 5: Full traversal from root (depth-limited)

In [3]:
# Uncomment to do a full traversal (WARNING: Be careful with API quotas!)
# This will traverse all URLs starting from root up to max_depth levels
stats = orchestrator.scrape_from_root(max_depth=5) # 

# For now, let's just show stats
stats = orchestrator.get_scraping_stats()
print("\nüìà Overall Scraping Statistics:")
print(f"   Root URL: {stats['root_url']}")
print(f"   Total scraped URLs: {stats['total_scraped_urls']}")
print(f"   Failed URLs: {stats['failed_urls']}")
print(f"   Pending in queue: {stats['queue_pending']}")
print(f"   Pages in database: {stats['database_pages']}")

2026-01-18 13:01:57,883 - models.orchestrator.ScrapingOrchestrator - INFO - Starting scrape from root URL
2026-01-18 13:01:57,884 - models.orchestrator.ScrapingOrchestrator - INFO - Scraping URL
2026-01-18 13:01:59,084 - models.processdata.ResponseProcessor - INFO - parse_raw_response called
2026-01-18 13:01:59,090 - models.processdata.ResponseProcessor - INFO - Saved markdown file
2026-01-18 13:01:59,116 - models.orchestrator.ScrapingOrchestrator - INFO - Successfully scraped and persisted URL
2026-01-18 13:01:59,117 - models.orchestrator.ScrapingOrchestrator - INFO - Added URLs to queue
2026-01-18 13:01:59,118 - models.orchestrator.ScrapingOrchestrator - INFO - Processing depth level 1
2026-01-18 13:01:59,119 - models.orchestrator.ScrapingOrchestrator - INFO - Scraping URL
2026-01-18 13:02:00,522 - models.processdata.ResponseProcessor - INFO - parse_raw_response called
2026-01-18 13:02:00,526 - models.processdata.ResponseProcessor - INFO - Saved markdown file
2026-01-18 13:02:00,540 

KeyboardInterrupt: 

#### Example 6: Query the database

In [None]:
# Get all pages from database
all_pages = orchestrator.db_manager.get_all_pages()
print(f"\nüìö All pages in database ({len(all_pages)} total):")
for page in all_pages[:5]:  # Show first 5
    print(f"   - {page.title} ({page.page_id})")
    print(f"     URL: {page.url}")
    print(f"     Scraped at: {page.scrape_timestamp}")
    print()

# Get pages by version (if multiple versions exist)
pages_v120 = orchestrator.db_manager.get_pages_by_version('flink-docs-release-1.20')
print(f"Pages for Flink 1.20: {len(pages_v120)}")