# FireCrawl playpen

This is a simple notebook to discover what the response of `Firecrawl`'s response object looks like...

The documentation takes time... and I got a bit unpatient... :)

In [1]:
from firecrawl import Firecrawl
import dotenv, os, ast, json
import logging

from models.processdata import ResponseProcessor
# proc = ResponseProcessor(root_url="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",log_level=logging.INFO)
proc = ResponseProcessor(root_url="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",log_level=logging.DEBUG)

dotenv.load_dotenv(dotenv.find_dotenv(".env"))
firecrawl = Firecrawl(api_key=os.getenv('FIRECRAWL_API_KEY'))

2026-02-04 19:54:00,136 - models.processdata.ResponseProcessor - DEBUG - Initializing ResponseProcessor


## /scrape

In [5]:
print("\n Starting scrape...")

# Crawl with scrape options
response = firecrawl.scrape(
    url='https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    wait_for=2000,
    only_main_content=True,
    formats=['markdown'],
)



print("\n Scrape finished...")

print('\n Writing to file...')
with open("./flink_firecrawl_output.json", "w", encoding="utf-8") as f:
    f.write(response.model_dump()['markdown'])

print("\n Scrape response:")
print(response.model_dump()['markdown'][:100])




 Starting scrape...

 Scrape finished...

 Writing to file...

 Scrape response:
# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\


This prints the markdown content of the scraped page. I.e. it works!!! YES!!!

## /response_read

In [3]:
with open('./data/flink_firecrawl_markdown.md', 'r') as f:
    lines = f.readlines()

md_content = '\n'.join(lines)

with open('./data/flink_firecrawl_response_full.txt', 'r', encoding='utf-8') as f:
    full_content = f.read()

file_response = ast.literal_eval(full_content)

# Metadata extraction

## Datamodel

In this part we are describing the data that needs to be saved from the scraping per page.

1. Main content into `.md`-file:
    1. File name = `<prefix>_<page_id>.md`
        1. `<prefix>` = url - `<https://../docs/>`
        2. `<page_id>` = hash of `<prefix>`
2. Meta-data:
    1. page_id: hash
    2. title: str
    3. url: str
    4. parent_url: str
    5. is_root_url: bool
    6. child_urls (a list of tuples for ('link_text','link_url')): list[(str,str)]
    7. scrape_timestamp: timestamp



In [4]:
processed = proc.process_response(file_response)

2026-02-04 18:17:10,767 - models.processdata.ResponseProcessor - INFO - parse_raw_response called
2026-02-04 18:17:10,770 - models.processdata.ResponseProcessor - INFO - extract_summaries_with_ollama called
2026-02-04 18:17:21,554 - models.processdata.ResponseProcessor - INFO - Saved markdown file
2026-02-04 18:17:21,555 - models.processdata.ResponseProcessor - INFO - process_response completed


In [5]:
processed

< PageMetadata
    page_id=d699b5373c84d3776703d9c89d472a1ecee196e604219eb74f8e5647e6a4513c,
    prefix=concepts_overview,
    url=https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview,
    title=Overview | Apache Flink,
    version=flink-docs-release-1.20,
    slug=flink,
    summary="Exploring Flink's Streaming APIs for stateful and timely processing",
    headings[2]=
      -->  1: Concepts
      -->  2: Flink’s APIs,
    is_root_url=True,
    parent_url=None,
    child_urls[7]=
      -->  Handson Training (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview)
      -->  Data Pipelines ETL (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl)
      -->  Fault Tolerance (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance)
      -->  Streaming Analytics (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics)
      --

In [2]:
with open('./data/markdown_files/concepts_stateful_stream_processing_2242824968fe3664ac00b3506911daf8e28b527e9f76a7f85c2c04e20e9ff783.md', 'r') as f:
    lines = f.readlines()

concepts_content = '\n'.join(lines)

In [3]:
print(concepts_content[:500])

> This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest [stable version](https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/stateful-stream-processing/).



# Stateful Stream Processing  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/stateful-stream-processing/\#stateful-stream-processing)



## What is State?  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/stateful-stream-pr


In [4]:
proc.extract_summaries_with_ollama(concepts_content)

2026-02-04 19:54:11,513 - models.processdata.ResponseProcessor - INFO - extract_summaries_with_ollama called
2026-02-04 19:54:11,519 - models.processdata.ResponseProcessor - DEBUG - Requesting slug from Ollama
2026-02-04 19:54:11,520 - models.processdata.ResponseProcessor - DEBUG - Requesting slug prompt: 
 'You are senior copy writer. Given the full markdown content, write a specific 'slug' from the page.
A 'slug' is a single-word, lowercase identifier (no spaces) that will specifically summarize the page.
Only respond with this 'slug'.

MARKDOWN:
> This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest [stable version](https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/stateful-stream-processing/).



# Stateful Stream Processing  [\#](htt' 
...
2026-02-04 19:54:11,521 - models.processdata.ResponseProcessor - DEBUG - Ollama API call for slug, attempt 1/3
2026-02-04 19:54:45,065 - models.processdata.ResponseProcessor - DEBUG -

{'slug': "Here's a summary of the provided text:\n\n**State and Fault Tolerance**\n\n* Flink provides state management for both streaming and batch programs.\n* The `ExecutionMode` enum is used to specify the execution mode: `BATCH`, `STREAMING`, or `DEFAULT`.\n* In BATCH execution mode, streams are bounded, which simplifies fault tolerance but increases recovery costs.\n\n**Checkpointing**\n\n* Checkpointing involves taking a snapshot of the program's state.\n* There are two types of checkpointing:\n\t+ **Aligned Checkpointing**: This is the default behavior, where checkpoints are aligned with the end of an operator's execution.\n\t+ **Unaligned Checkpointing**: In this mode, operators keep processing all inputs after receiving a checkpoint barrier. This reduces latency but increases recovery costs.\n\n**Savepoints**\n\n* Savepoints are manually triggered checkpoints that allow users to update their programs and Flink cluster without losing state.\n* Unlike regular checkpoints, savepo

## /NEXT-STEPS

* [x] Limit scraping to specific domain
* [x] Allow pruning of child urls depending on limited domain
* [ ] Allow DB updates to existing records with LLM calls
* [ ] Check each db records child urls to only include existing urls