# FireCrawl playpen

This is a simple notebook to discover what the response of `Firecrawl`'s response object looks like...

The documentation takes time... and I got a bit unpatient... :)

In [1]:
from firecrawl import Firecrawl
import dotenv, os, ast, json
import logging

from models.processdata import ResponseProcessor
proc = ResponseProcessor(root_url="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",log_level=logging.INFO)


dotenv.load_dotenv(dotenv.find_dotenv("firecrawl-flink_docs/.env"))
firecrawl = Firecrawl(api_key=os.getenv('FIRECRAWL_API_KEY'))

## /scrape

In [5]:
print("\n Starting scrape...")

# Crawl with scrape options
response = firecrawl.scrape(
    url='https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    wait_for=2000,
    only_main_content=True,
    formats=['markdown'],
)



print("\n Scrape finished...")

print('\n Writing to file...')
with open("./flink_firecrawl_output.json", "w", encoding="utf-8") as f:
    f.write(response.model_dump()['markdown'])

print("\n Scrape response:")
print(response.model_dump()['markdown'][:100])




 Starting scrape...

 Scrape finished...

 Writing to file...

 Scrape response:
# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\


This prints the markdown content of the scraped page. I.e. it works!!! YES!!!

## /response_read

In [2]:
with open('./data/flink_firecrawl_markdown.md', 'r') as f:
    lines = f.readlines()

md_content = '\n'.join(lines)

with open('./data/flink_firecrawl_response_full.txt', 'r', encoding='utf-8') as f:
    full_content = f.read()

file_response = ast.literal_eval(full_content)

# Metadata extraction

## Datamodel

In this part we are describing the data that needs to be saved from the scraping per page.

1. Main content into `.md`-file:
    1. File name = `<prefix>_<page_id>.md`
        1. `<prefix>` = url - `<https://../docs/>`
        2. `<page_id>` = hash of `<prefix>`
2. Meta-data:
    1. page_id: hash
    2. title: str
    3. url: str
    4. parent_url: str
    5. is_root_url: bool
    6. child_urls (a list of tuples for ('link_text','link_url')): list[(str,str)]
    7. scrape_timestamp: timestamp



In [3]:
processed = proc.process_response(file_response)

2026-01-17 15:35:35,356 - models.processdata.ResponseProcessor - INFO - parse_raw_response called
2026-01-17 15:35:35,358 - models.processdata.ResponseProcessor - INFO - extract_summaries_with_ollama called
2026-01-17 15:35:51,640 - models.processdata.ResponseProcessor - INFO - Saved markdown file
2026-01-17 15:35:51,640 - models.processdata.ResponseProcessor - INFO - process_response completed


In [4]:
processed

< PageMetadata
    page_id=d699b5373c84d3776703d9c89d472a1ecee196e604219eb74f8e5647e6a4513c,
    prefix=concepts_overview,
    url=https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview,
    title=Overview | Apache Flink,
    version=flink-docs-release-1.20,
    slug=concepts,
    summary="Learning Flink: Concepts and APIs Overview",
    headings[2]=
      -->  1: Concepts
      -->  2: Flink‚Äôs APIs,
    is_root_url=True,
    parent_url=None,
    child_urls[7]=
      -->  Handson Training (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview)
      -->  Data Pipelines ETL (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl)
      -->  Fault Tolerance (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance)
      -->  Streaming Analytics (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics)
      -->  DataStream API (h

## /Traverse & Persist with ScrapingOrchestrator

Now we have the `ScrapingOrchestrator` class that handles:
- ‚úÖ Persist the metadata (SQLite database) and the markdown files
- ‚úÖ Traverse the next set of child_urls
- ‚úÖ Before scraping the next url first check if that specific page has been scraped

### Features:
1. **Database Persistence**: Uses SQLAlchemy ORM with SQLite
2. **URL Deduplication**: Tracks all scraped URLs to prevent re-scraping
3. **Queue Management**: FIFO queue for traversing child URLs
4. **Batch Processing**: Scrape single URLs or batch operations
5. **Depth Control**: Traverse URLs by depth level

In [None]:
from models import ScrapingOrchestrator
import os

# Initialize the orchestrator
root_url = 'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/'
api_key = os.getenv('FIRECRAWL_API_KEY')

orchestrator = ScrapingOrchestrator(
    firecrawl_api_key=api_key,
    root_url=root_url,
    db_path=None,  # Uses default: ./data/scraping.db
    log_level=logging.INFO,
    ask_ollama=True  # Set to True if Ollama is running
)

print("‚úÖ ScrapingOrchestrator initialized")
print(f"Database location: {orchestrator.db_manager.db_path}")

### Usage Examples

#### Example 1: Scrape single URL with persistence
This scrapes a URL, saves markdown file, and persists metadata to database.

In [None]:
# Example 1: Scrape single URL and persist
# Note: You may want to use ask_ollama=False for faster testing
test_url = 'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/'

metadata = orchestrator.scrape_and_persist(test_url)
if metadata:
    print(f"\n‚úÖ Scraped and persisted!")
    print(f"   Page ID: {metadata.page_id}")
    print(f"   Title: {metadata.title}")
    print(f"   Child URLs found: {len(metadata.child_urls)}")
    print(f"   File saved: ./data/markdown_files/{metadata.prefix}_{metadata.page_id}.md")
else:
    print("URL already scraped or scrape failed")

#### Example 2: Check if URL has been scraped (deduplication)

In [None]:
# Check if URLs have been scraped
urls_to_check = [
    test_url,  # Should return True (we just scraped it)
    'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/try-flink/datastream/',
]

for url in urls_to_check:
    has_been_scraped = orchestrator.has_been_scraped(url)
    status = "‚úÖ Already scraped" if has_been_scraped else "‚ùå Not yet scraped"
    print(f"{status}: {url}")

#### Example 3: Queue child URLs for traversal

In [None]:
# If metadata had child URLs, they are automatically added to queue
# You can also manually add URLs:

sample_child_urls = [
    ("Getting Started", "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/try-flink/"),
    ("DataStream API", "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/"),
]

orchestrator.add_urls_to_queue(sample_child_urls)
print(f"Queue size: {orchestrator.queue_size()}")
print(f"Queue contents:")
for i, (text, url) in enumerate(list(orchestrator.url_queue), 1):
    print(f"  {i}. {text}: {url}")

#### Example 4: Batch scraping from queue

In [None]:
# Scrape a batch (e.g., first 3 URLs from queue)
# Note: Adjust max_urls based on your API quota
if orchestrator.queue_size() > 0:
    stats = orchestrator.scrape_batch(max_urls=3, stop_on_failure=False)
    print("\nüìä Batch Scraping Results:")
    print(f"   ‚úÖ Scraped: {stats['scraped']}")
    print(f"   ‚ùå Failed: {stats['failed']}")
    print(f"   ‚è≠Ô∏è  Skipped (already done): {stats['skipped']}")
    print(f"   üìã Queue remaining: {stats['queue_remaining']}")
else:
    print("Queue is empty! Add URLs first.")

#### Example 5: Full traversal from root (depth-limited)

In [None]:
# Uncomment to do a full traversal (WARNING: Be careful with API quotas!)
# This will traverse all URLs starting from root up to max_depth levels
# stats = orchestrator.scrape_from_root(max_depth=2)

# For now, let's just show stats
stats = orchestrator.get_scraping_stats()
print("\nüìà Overall Scraping Statistics:")
print(f"   Root URL: {stats['root_url']}")
print(f"   Total scraped URLs: {stats['total_scraped_urls']}")
print(f"   Failed URLs: {stats['failed_urls']}")
print(f"   Pending in queue: {stats['queue_pending']}")
print(f"   Pages in database: {stats['database_pages']}")

#### Example 6: Query the database

In [None]:
# Get all pages from database
all_pages = orchestrator.db_manager.get_all_pages()
print(f"\nüìö All pages in database ({len(all_pages)} total):")
for page in all_pages[:5]:  # Show first 5
    print(f"   - {page.title} ({page.page_id})")
    print(f"     URL: {page.url}")
    print(f"     Scraped at: {page.scrape_timestamp}")
    print()

# Get pages by version (if multiple versions exist)
pages_v120 = orchestrator.db_manager.get_pages_by_version('flink-docs-release-1.20')
print(f"Pages for Flink 1.20: {len(pages_v120)}")