# FireCrawl playpen

This is a simple notebook to discover what the response of `Firecrawl`'s response object looks like...

The documentation takes time... and I got a bit unpatient... :)

In [1]:
from firecrawl import Firecrawl
import dotenv, os, ast, json
import logging

dotenv.load_dotenv(dotenv.find_dotenv(".env"))

# The client gets the API key from the environment variable `GEMINI_API_KEY`.
from google import genai
gemini = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))



from models.processdata import ResponseProcessor
# proc = ResponseProcessor(root_url="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",log_level=logging.INFO)
proc = ResponseProcessor(root_url="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",log_level=logging.DEBUG)

firecrawl = Firecrawl(api_key=os.getenv('FIRECRAWL_API_KEY'))

2026-02-09 21:39:46,033 - models.processdata.ResponseProcessor - DEBUG - Initializing ResponseProcessor


## /scrape

In [5]:
print("\n Starting scrape...")

# Crawl with scrape options
response = firecrawl.scrape(
    url='https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    wait_for=2000,
    only_main_content=True,
    formats=['markdown'],
)



print("\n Scrape finished...")

print('\n Writing to file...')
with open("./flink_firecrawl_output.json", "w", encoding="utf-8") as f:
    f.write(response.model_dump()['markdown'])

print("\n Scrape response:")
print(response.model_dump()['markdown'][:100])




 Starting scrape...

 Scrape finished...

 Writing to file...

 Scrape response:
# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\


This prints the markdown content of the scraped page. I.e. it works!!! YES!!!

## /response_read

In [18]:
with open('./data/flink_firecrawl_markdown.md', 'r') as f:
    lines = f.readlines()

md_content = '\n'.join(lines)

with open('./data/flink_firecrawl_response_full.txt', 'r', encoding='utf-8') as f:
    full_content = f.read()

file_response = ast.literal_eval(full_content)

# Metadata extraction

## Datamodel

In this part we are describing the data that needs to be saved from the scraping per page.

1. Main content into `.md`-file:
    1. File name = `<prefix>_<page_id>.md`
        1. `<prefix>` = url - `<https://../docs/>`
        2. `<page_id>` = hash of `<prefix>`
2. Meta-data:
    1. page_id: hash
    2. title: str
    3. url: str
    4. parent_url: str
    5. is_root_url: bool
    6. child_urls (a list of tuples for ('link_text','link_url')): list[(str,str)]
    7. scrape_timestamp: timestamp



In [19]:
processed = proc.process_response(file_response)

2026-02-08 15:20:02,832 - models.processdata.ResponseProcessor - INFO - parse_raw_response called
2026-02-08 15:20:02,833 - models.processdata.ResponseProcessor - DEBUG - parse_raw_response called
2026-02-08 15:20:02,834 - models.processdata.ResponseProcessor - DEBUG - content_to_hash called
2026-02-08 15:20:02,834 - models.processdata.ResponseProcessor - DEBUG - content_to_hash result
2026-02-08 15:20:02,835 - models.processdata.ResponseProcessor - DEBUG - extract_version called
2026-02-08 15:20:02,836 - models.processdata.ResponseProcessor - DEBUG - extract_version result
2026-02-08 15:20:02,837 - models.processdata.ResponseProcessor - DEBUG - extract_prefix called
2026-02-08 15:20:02,838 - models.processdata.ResponseProcessor - DEBUG - extract_prefix result
2026-02-08 15:20:02,839 - models.processdata.ResponseProcessor - DEBUG - prefix_to_hash called
2026-02-08 15:20:02,839 - models.processdata.ResponseProcessor - DEBUG - prefix_to_hash result
2026-02-08 15:20:02,841 - models.proces

In [20]:
processed

< PageMetadata
    page_id=d699b5373c84d3776703d9c89d472a1ecee196e604219eb74f8e5647e6a4513c,
    prefix=concepts_overview,
    url=https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview,
    title=Overview | Apache Flink,
    version=flink-docs-release-1.20,
    slug=concepts,
    summary="Understanding Flink's APIs for Streaming/Batch Applications",
    headings[2]=
      -->  1: Concepts
      -->  2: Flink’s APIs,
    is_root_url=True,
    parent_url=None,
    child_urls[7]=
      -->  Handson Training (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview)
      -->  Data Pipelines ETL (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl)
      -->  Fault Tolerance (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance)
      -->  Streaming Analytics (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics)
      -->  Da

In [24]:
print(md_content)

# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\#concepts)



The [Hands-on Training](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview/) explains the basic concepts

of stateful and timely stream processing that underlie Flink’s APIs, and provides examples of how these mechanisms are used in applications. Stateful stream processing is introduced in the context of [Data Pipelines & ETL](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl/#stateful-transformations)

and is further developed in the section on [Fault Tolerance](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance/).

Timely stream processing is introduced in the section on [Streaming Analytics](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics/).



This _Concepts in Depth_ section provides a deeper understanding of how Flink

In [2]:
with open('./data/markdown_files/concepts_stateful_stream_processing_2242824968fe3664ac00b3506911daf8e28b527e9f76a7f85c2c04e20e9ff783.md', 'r') as f:
    lines = f.readlines()

concepts_content = '\n'.join(lines)

In [3]:
proc.extract_summaries_with_ollama(concepts_content,timeout=180,provider='gemini')

2026-02-08 16:14:22,025 - models.processdata.ResponseProcessor - INFO - extract_summaries_with_ollama called
2026-02-08 16:14:22,027 - models.processdata.ResponseProcessor - DEBUG - Requesting slug from Gemini
2026-02-08 16:14:22,028 - models.processdata.ResponseProcessor - DEBUG - Requesting slug prompt: 
 'You are senior copy writer. Given the full markdown content, write a specific 'slug' from the page.
A 'slug' is a single-word, lowercase identifier (no spaces) that will specifically summarize the page.
Only respond with this 'slug'.

MARKDOWN:
> This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest [stable version](https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/stateful-stream-processing/).



# Stateful Stream Processing  [\#](htt' 
...
2026-02-08 16:14:22,028 - models.processdata.ResponseProcessor - DEBUG - Gemini API call for slug, attempt 1/3
2026-02-08 16:14:22,029 - models.processdata.ResponseProcessor - DEBUG -

{'slug': '', 'summary': '', 'headings': []}

In [25]:
# slug_prompt = (
#             "\"\"\"Your role: senior copy writer. \n"
#             "Task: Create a slug given the full markdown content.\n"
#             "Slug definition: A 'slug' is a single-word, lowercase identifier (no spaces) that will specifically summarize the page.\n"
#             "Response: 'slug' ONLY.\n\n"
#             "MARKDOWN:\n" + f"'''{concepts_content}'''\"\"\""
#         )

# slug_prompt = (
#             "\"\"\"You must output EXACTLY one word. Nothing else. \n"
#             "CONTENT:\n" + f"'''{concepts_content}'''\"\"\""
#         )


slug_prompt = (
            "\"\"\"Task: Generate a slug for the content below. \n\n " +
            "Rules: \n" +
" - Output ONLY ONE WORD \n" +
" - Must be lowercase \n" +
" - No spaces, no hyphens, no special characters \n" +
" - Do not explain, do not summarize \n" +
" - Just the single word slug \n\n" +
"Example1: \n" +
"Content: \n" + f"'''{md_content}'''\n" +
"Response: 'concepts' \n\n" +
            "CONTENT:\n" + f"'''{concepts_content}'''\"\"\""
        )



In [None]:
print(slug_prompt)

"""Task: Generate a slug for the content below. 

 Rules: 
 - Output ONLY ONE WORD 
 - Must be lowercase 
 - No spaces, no hyphens, no special characters 
 - Do not explain, do not summarize 
 - Just the single word slug 

Example1: 
Content: 
'''# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\#concepts)



The [Hands-on Training](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview/) explains the basic concepts

of stateful and timely stream processing that underlie Flink’s APIs, and provides examples of how these mechanisms are used in applications. Stateful stream processing is introduced in the context of [Data Pipelines & ETL](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl/#stateful-transformations)

and is further developed in the section on [Fault Tolerance](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance/).

Timely stre

In [None]:
from google import genai

# The client gets the API key from the environment variable `GEMINI_API_KEY`.
client = genai.Client()

response = client.models.generate_content(
    model="gemini-3-flash-preview", contents="Explain how AI works in a few words"
)
print(response.text)

In [None]:
$ uv sync
warning: `VIRTUAL_ENV=/home/path/to/.venv` does not match the project environment path `.venv` and will be ignored; use `--active` to target the active environment instead
Resolved 95 packages in 1ms
Audited 91 packages in 1ms

## /NEXT-STEPS

* [x] Limit scraping to specific domain
* [x] Allow pruning of child urls depending on limited domain
* [ ] Allow DB updates to existing records with LLM calls
* [ ] Check each db records child urls to only include existing urls