# FireCrawl playpen

This is a simple notebook to discover what the response of `Firecrawl`'s response object looks like...

The documentation takes time... and I got a bit unpatient... :)

In [1]:
from firecrawl import Firecrawl
import dotenv, os, ast, json
import logging
# import urllib.parse
# import hashlib

from models.processdata import ResponseProcessor
proc = ResponseProcessor(root_url="https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",log_level=logging.INFO)


dotenv.load_dotenv(dotenv.find_dotenv("firecrawl-flink_docs/.env"))
firecrawl = Firecrawl(api_key=os.getenv('FIRECRAWL_API_KEY'))

## /crawl

In [22]:
print("\n Starting crawl...")

# Crawl with scrape options
response = firecrawl.crawl('https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    limit=3,
    scrape_options={
        "maxDepth": 1,
        "render": False,
        "ignoreRobotsTxt": True,
    }
)



print("\n Crawl finished...")

print("\n Crawl response:")
print(response.model_dump())


 Starting crawl...

 Crawl finished...

 Crawl response:
{'status': 'completed', 'total': 0, 'completed': 0, 'credits_used': 0, 'expires_at': datetime.datetime(2026, 1, 4, 11, 24, 44, tzinfo=TzInfo(0)), 'next': None, 'data': []}


The above shows that the crawl does not really work. I suspect it has to do with the `robots.txt` restriction on flinks docs... Not sure why that is restricted...

## /scrape

In [5]:
print("\n Starting scrape...")

# Crawl with scrape options
response = firecrawl.scrape(
    url='https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    wait_for=2000,
    only_main_content=True,
    formats=['markdown'],
)



print("\n Scrape finished...")

print('\n Writing to file...')
with open("./flink_firecrawl_output.json", "w", encoding="utf-8") as f:
    f.write(response.model_dump()['markdown'])

print("\n Scrape response:")
print(response.model_dump()['markdown'][:100])




 Starting scrape...

 Scrape finished...

 Writing to file...

 Scrape response:
# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\


This prints the markdown content of the scraped page. I.e. it works!!! YES!!!

## /response_read

In [2]:
with open('./data/flink_firecrawl_markdown.md', 'r') as f:
    lines = f.readlines()

md_content = '\n'.join(lines)

with open('./data/flink_firecrawl_response_full.txt', 'r', encoding='utf-8') as f:
    full_content = f.read()

file_response = ast.literal_eval(full_content)

# Metadata extraction

## Datamodel

In this part we are describing the data that needs to be saved from the scraping per page.

1. Main content into `.md`-file:
    1. File name = `<prefix>_<page_id>.md`
        1. `<prefix>` = url - `<https://../docs/>`
        2. `<page_id>` = hash of `<prefix>`
2. Meta-data:
    1. page_id: hash
    2. title: str
    3. url: str
    4. parent_url: str
    5. is_root_url: bool
    6. child_urls (a list of tuples for ('link_text','link_url')): list[(str,str)]
    7. scrape_timestamp: timestamp



In [3]:
processed = proc.process_response(file_response)

2026-01-10 20:49:06,459 - models.processdata.ResponseProcessor - INFO - extract_summaries_with_ollama called
2026-01-10 20:49:17,133 - models.processdata.ResponseProcessor - INFO - Saved markdown file
2026-01-10 20:49:17,133 - models.processdata.ResponseProcessor - INFO - process_response completed


In [4]:
processed

< PageMetadata page_id=d699b5373c84d3776703d9c89d472a1ecee196e604219eb74f8e5647e6a4513c,
  url=https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview
  slug=concepts,
  summary="Understanding Flink's Abstraction Layers for Streaming/Batch Applications",
  title=Overview | Apache Flink,
  headings=
  --> , 1: Concepts
  -->  2: Flink’s APIs,
  is_root_url=True,
  parent_url=None,
  child_urls[7]=
  -->  Handson Training (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview)
  -->  Data Pipelines ETL (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl)
  -->  Fault Tolerance (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance)
  -->  Streaming Analytics (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics)
  -->  DataStream API (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview)
 

In [8]:
proc.extract_metadata_with_ollama(markdown=md_content)

{'slug': 'concepts',
 'summary': "Flink's programming levels of abstraction explained.",
 'headings': {'headings': [{'level': 1, 'text': 'Concepts'},
   {'level': 2, 'text': 'Flink’s APIs'}]}}

In [25]:
import logging
import urllib.request
import urllib.error

class Tester:

    def __init__(self, log_level=logging.DEBUG):
        # Configure console logger
        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
        self.logger.setLevel(log_level)
        
        # Add console handler if not already present
        if not self.logger.handlers:
            handler = logging.StreamHandler()
            handler.setLevel(log_level)
            formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
            handler.setFormatter(formatter)
            self.logger.addHandler(handler)
        
        self.logger.debug("Initializing ResponseProcessor")


    def _request_ollama(self, prompt: str, model: str, host: str, timeout: int) -> str:
        payload = json.dumps({
            "model": model,
            "prompt": prompt,
            "stream": False
        }).encode('utf-8')

        url = host.rstrip('/') + "/api/generate"
        req = urllib.request.Request(url, data=payload, headers={"Content-Type": "application/json"}, method="POST")
        try:
            with urllib.request.urlopen(req, timeout=timeout) as resp:
                resp_text = resp.read().decode('utf-8', errors='replace')

            response_json = json.loads(resp_text)
            return response_json.get('response', '')
        except urllib.error.HTTPError as e:
            body = e.read().decode('utf-8', errors='ignore') if hasattr(e, 'read') else ''
            self.logger.exception("Ollama HTTP error", extra={"status": getattr(e, 'code', None), "body": body})
            raise
        except Exception:
            self.logger.exception("Failed contacting Ollama server")
            raise

    def extract_metadata_with_ollama(self, markdown: str, model: str = "llama3.2:3b", host: str = "http://localhost:11434", timeout: int = 30) -> dict:
        """
        Send `markdown` to an Ollama instance and ask for JSON containing:
          - slug: one-word lowercase summary
          - summary: ~100 character summary
          - headings: list of {"level":int, "text":str}

        Returns a dict with keys: `slug`, `summary`, `headings` (or raises on hard failure).
        """
        self.logger.debug("extract_metadata_with_ollama called", extra={"model": model, "host": host, "markdown_len": len(markdown)})

        slug_prompt = (
            "You are senior copy writer. Given the full markdown content, write a specific 'slug' from the page.\n"
            "A 'slug' is a single-word, lowercase identifier (no spaces) that will specifically summarize the page.\n"
            "Only respond with this 'slug'.\n\n"
            "MARKDOWN:\n" + markdown
        )

        summary_prompt = (
            "You are senior copy writer. Given the full markdown content, create a specific 'summary' that identifies the page.\n"
            "In this case a 'summary' is a concise specific sentence that identifies the page, and is only around 100 characters long.\n"
            "Only respond with this 'summary'.\n\n"
            "MARKDOWN:\n" + markdown
        )

        headings_prompt = (
            "You are senior copy writer with who always responds in JSON to any query. Given the full markdown content, a specific 'headings' from the page.\n"
            "In this case a 'headings' is a list of objects representing every heading in the page in document order.\n"
            "Each object must have 'level' (integer) and 'text' (string).\n"
            "Example: {\"headings\":[{\"level\":1,\"text\":\"Concepts\"},{\"level\":2,\"text\":\"Flink’s APIs\"}]}\n"
            "Only respond with the JSON payload  'headings' list.\n\n"
            "MARKDOWN:\n" + markdown
        )

        respons_dict = {}
        for n,prompt in zip(['slug', 'summary', 'headings'], [slug_prompt, summary_prompt, headings_prompt]):
            resp = self._request_ollama(prompt, model, host, timeout)
            if n == 'headings':
                try:
                    headings_resp = ast.literal_eval(resp)
                    respons_dict[n] = headings_resp
                except json.JSONDecodeError:
                    self.logger.error("Failed to decode headings JSON from Ollama response", extra={"response": resp})
                    respons_dict[n] = []
            else:
                respons_dict[n] = resp.strip()
        
        return respons_dict

ts = Tester()


2026-01-10 20:21:22,212 - __main__.Tester - DEBUG - Initializing ResponseProcessor


In [26]:
ts.extract_metadata_with_ollama(md_content)

2026-01-10 20:21:23,392 - __main__.Tester - DEBUG - extract_metadata_with_ollama called


{'slug': 'concepts',
 'summary': "Flink's Programming Abstraction Overview",
 'headings': [{'level': 1, 'text': 'Concepts'},
  {'level': 2, 'text': 'Flink’s APIs'}]}

In [None]:
%%bash
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Why is the sky blue?",
  "stream": false
}' >

In [20]:
# with open('./data/llama_response.json', 'r') as f:

# llama_response = json.loads('data/llama_response.json')
    
with open('./data/llama_response.json', 'r', encoding='utf-8') as f:
    llama_content = f.read()
    llama_response = json.loads(llama_content)

# try:
#     print("Trying json.loads first...")
#     llama_response = json.loads(llama_content)
# except json.JSONDecodeError:
#     try:
#         print("Trying ast.literal_eval next...")
#         llama_response = ast.literal_eval(llama_content)
#     except Exception:
#         with open('./data/llama_response.json', 'rb') as fb:
#             raw = fb.read()
#         try:
#             print("Trying json.loads with surrogateescape...")
#             llama_response = json.loads(raw.decode('utf-8', 'surrogateescape'))
#         except Exception:
#             print("Falling back to ast.literal_eval with replace...")
#             llama_response = ast.literal_eval(raw.decode('utf-8', 'replace'))


In [21]:
print(llama_response['response'])

The sky appears blue because of a phenomenon called scattering, which occurs when sunlight interacts with the tiny molecules of gases in the Earth's atmosphere.

Here's what happens:

1. Sunlight enters the Earth's atmosphere and contains all the colors of the visible spectrum (red, orange, yellow, green, blue, indigo, and violet).
2. The shorter wavelengths of light, such as blue and violet, are scattered more than the longer wavelengths by the tiny molecules of gases in the atmosphere, like nitrogen and oxygen.
3. This scattering effect is more pronounced for shorter wavelengths because they have a smaller wavelength and are more easily deflected by the gas molecules.
4. As a result, the blue light is distributed throughout the atmosphere, reaching our eyes from all directions.
5. Our brains perceive this scattered blue light as the color of the sky.

This phenomenon is known as Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described it in the late 1

Here are some suggestions from Copilot. It requires some work from an LLM (especially in the summary and stubb parts etc - but lets check if we can integrate this into ollama - i.e. not going out to external LLMs).

The name of the file is interesting:
> Suggested filename (example): overview_019b8f59-6e02.md
> (If you prefer canonical-hash, replace the UUID prefix with sha256(canonical_url)[:12].)
> Main .md file contents (save exactly as file body; no frontmatter):
>

Here is the json output:
```
{
"page_id": "sha256:<hex-of-canonical-url>",
"content_hash": "sha256:<hex-of-normalized-markdown>",
"slug": "overview",
"title": "Overview | Apache Flink",
"url": "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",
"canonical_url": "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",
"scrape_id": "019b8f59-6e02-767d-bf46-0690425307de",
"index_id": "2a63f795-1f18-4d10-a6c2-474de4abeab9",
"status_code": 200,
"content_type": "text/html",
"language": "en",
"summary": "Overview of Flink concepts, APIs, and training resources.",
"headings": [{"level":1,"text":"Concepts"},{"level":2,"text":"Flink’s APIs"}],
"assets": [{"original_url":"https://nightlies.apache.org/flink/.../fig/levels_of_abstraction.svg","inferred_filename":"levels_of_abstraction.svg","content_type":"image/svg+xml"}],
"previous_url": null,
"next_urls": [],
"is_root_url": false,
"scrape_timestamp": "2026-01-05T12:00:00Z",
"cached_at": null,
"provenance": "nightlies.apache.org",
"notes": "content taken from Firecrawl response.model_dump(); consider canonical_url normalization before dedup."
}
```

In [17]:
# ts.extract_prefix(response.model_dump()['metadata']['url'])
ts.extract_prefix(file_response['metadata']['url'])
# ts.extract_prefix('https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/')
# ts.extract_prefix('https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/#keyed-and-non-keyed-operators')

'concepts_overview'

In [18]:
ts.process_response(file_response)

< PageMetadata page_id=d699b5373c84d3776703d9c89d472a1ecee196e604219eb74f8e5647e6a4513c,
  url=https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview
  title=Overview | Apache Flink,
  is_root_url=True,
  parent_url=None,
  child_urls[7]=
  -->  Handson Training (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview)
  -->  Data Pipelines ETL (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl)
  -->  Fault Tolerance (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance)
  -->  Streaming Analytics (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics)
  -->  DataStream API (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview)
  -->  Process Function (https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/process_function)
  -->  Table API (https://nig

In [19]:
ts.parse_raw_response(file_response)

{'title': 'Overview | Apache Flink',
 'url': 'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
 'is_root_url': True,
 'parent_url': None,
 'prefix': 'concepts_overview',
 'page_id': 'd699b5373c84d3776703d9c89d472a1ecee196e604219eb74f8e5647e6a4513c',
 'child_urls': [('Handson Training',
   'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview'),
  ('Data Pipelines ETL',
   'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl'),
  ('Fault Tolerance',
   'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance'),
  ('Streaming Analytics',
   'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics'),
  ('DataStream API',
   'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview'),
  ('Process Function',
   'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/o

In [6]:
ollama_response = json.loads('data/llama_response.json')
print(ollama_response['response'])

JSONDecodeError: Expecting value: line 1 column 1 (char 0)