# FireCrawl playpen

This is a simple notebook to discover what the response of `Firecrawl`'s response object looks like...

The documentation takes time... and I got a bit unpatient... :)

In [1]:
from firecrawl import Firecrawl
import dotenv, os, ast, json
import logging

from models.processdata import ResponseProcessor
proc = ResponseProcessor(root_url="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",log_level=logging.INFO)


dotenv.load_dotenv(dotenv.find_dotenv("firecrawl-flink_docs/.env"))
firecrawl = Firecrawl(api_key=os.getenv('FIRECRAWL_API_KEY'))

## /scrape

In [5]:
print("\n Starting scrape...")

# Crawl with scrape options
response = firecrawl.scrape(
    url='https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    wait_for=2000,
    only_main_content=True,
    formats=['markdown'],
)



print("\n Scrape finished...")

print('\n Writing to file...')
with open("./flink_firecrawl_output.json", "w", encoding="utf-8") as f:
    f.write(response.model_dump()['markdown'])

print("\n Scrape response:")
print(response.model_dump()['markdown'][:100])




 Starting scrape...

 Scrape finished...

 Writing to file...

 Scrape response:
# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\


This prints the markdown content of the scraped page. I.e. it works!!! YES!!!

## /response_read

In [2]:
with open('./data/flink_firecrawl_markdown.md', 'r') as f:
    lines = f.readlines()

md_content = '\n'.join(lines)

with open('./data/flink_firecrawl_response_full.txt', 'r', encoding='utf-8') as f:
    full_content = f.read()

file_response = ast.literal_eval(full_content)

# Metadata extraction

## Datamodel

In this part we are describing the data that needs to be saved from the scraping per page.

1. Main content into `.md`-file:
    1. File name = `<prefix>_<page_id>.md`
        1. `<prefix>` = url - `<https://../docs/>`
        2. `<page_id>` = hash of `<prefix>`
2. Meta-data:
    1. page_id: hash
    2. title: str
    3. url: str
    4. parent_url: str
    5. is_root_url: bool
    6. child_urls (a list of tuples for ('link_text','link_url')): list[(str,str)]
    7. scrape_timestamp: timestamp



In [3]:
processed = proc.process_response(file_response)

2026-01-17 15:35:35,356 - models.processdata.ResponseProcessor - INFO - parse_raw_response called
2026-01-17 15:35:35,358 - models.processdata.ResponseProcessor - INFO - extract_summaries_with_ollama called
2026-01-17 15:35:51,640 - models.processdata.ResponseProcessor - INFO - Saved markdown file
2026-01-17 15:35:51,640 - models.processdata.ResponseProcessor - INFO - process_response completed


In [4]:
processed

< PageMetadata
    page_id=d699b5373c84d3776703d9c89d472a1ecee196e604219eb74f8e5647e6a4513c,
    prefix=concepts_overview,
    url=https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview,
    title=Overview | Apache Flink,
    version=flink-docs-release-1.20,
    slug=concepts,
    summary="Learning Flink: Concepts and APIs Overview",
    headings[2]=
      -->  1: Concepts
      -->  2: Flinkâ€™s APIs,
    is_root_url=True,
    parent_url=None,
    child_urls[7]=
      -->  Handson Training (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview)
      -->  Data Pipelines ETL (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl)
      -->  Fault Tolerance (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance)
      -->  Streaming Analytics (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics)
      -->  DataStream API (h

## /NEXT-STEPS

* [x] Limit scraping to specific domain
* [x] Allow pruning of child urls depending on limited domain
* [ ] Allow DB updates to existing records with LLM calls
* [ ] Check each db records child urls to only include existing urls