# FireCrawl playpen

This is a simple notebook to discover what the response of `Firecrawl`'s response object looks like...

The documentation takes time... and I got a bit unpatient... :)

In [17]:
from firecrawl import Firecrawl
import dotenv, os, re, ast
import urllib.parse
import hashlib


dotenv.load_dotenv(dotenv.find_dotenv("firecrawl-flink_docs/.env"))
firecrawl = Firecrawl(api_key=os.getenv('FIRECRAWL_API_KEY'))

## /crawl

In [22]:
print("\n Starting crawl...")

# Crawl with scrape options
response = firecrawl.crawl('https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    limit=3,
    scrape_options={
        "maxDepth": 1,
        "render": False,
        "ignoreRobotsTxt": True,
    }
)



print("\n Crawl finished...")

print("\n Crawl response:")
print(response.model_dump())


 Starting crawl...

 Crawl finished...

 Crawl response:
{'status': 'completed', 'total': 0, 'completed': 0, 'credits_used': 0, 'expires_at': datetime.datetime(2026, 1, 4, 11, 24, 44, tzinfo=TzInfo(0)), 'next': None, 'data': []}


The above shows that the crawl does not really work. I suspect it has to do with the `robots.txt` restriction on flinks docs... Not sure why that is restricted...

## /scrape

In [5]:
print("\n Starting scrape...")

# Crawl with scrape options
response = firecrawl.scrape(
    url='https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    wait_for=2000,
    only_main_content=True,
    formats=['markdown'],
)



print("\n Scrape finished...")

print('\n Writing to file...')
with open("./flink_firecrawl_output.json", "w", encoding="utf-8") as f:
    f.write(response.model_dump()['markdown'])

print("\n Scrape response:")
print(response.model_dump()['markdown'][:100])




 Starting scrape...

 Scrape finished...

 Writing to file...

 Scrape response:
# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\


This prints the markdown content of the scraped page. I.e. it works!!! YES!!!

# Metadata extraction

In [18]:
with open('./data/flink_firecrawl_markdown.md', 'r') as f:
    lines = f.readlines()

md_content = '\n'.join(lines)

with open('./data/flink_firecrawl_response_full.txt', 'r', encoding='utf-8') as f:
    full_content = f.read()

file_response = ast.literal_eval(full_content)

In [4]:
%ll ./data

total 16
-rw-rw-r-- 1 joestry 4248 Jan  5 21:19 flink_firecrawl_markdown.md
-rw-rw-r-- 1 joestry 7967 Jan  6 20:18 flink_firecrawl_response_full.txt


In [16]:
for i, k in enumerate(test_response.keys()):
    print(f"Key {i:<2}: {k}")
    # if k == 'markdown':
    #     print(test_response[k][:500])

Key 0 : markdown
Key 1 : html
Key 2 : raw_html
Key 3 : json
Key 4 : summary
Key 5 : metadata
Key 6 : links
Key 7 : images
Key 8 : screenshot
Key 9 : actions
Key 11: change_tracking
Key 12: branding


In [4]:
# test_str = response.model_dump()['markdown']
test_str = md_content

extract_markdown_links(test_str)

[('Hands-on Training',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview/'),
 ('Data Pipelines & ETL',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl/#stateful-transformations'),
 ('Fault Tolerance',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance/'),
 ('Streaming Analytics',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics/'),
 ('DataStream API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview/'),
 ('Process Function',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/process_function/'),
 ('DataStream API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview/'),
 ('Table API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/table/overview/'),
 ('SQL',
  'https://nightlies

## Datamodel

In this part we are describing the data that needs to be saved from the scraping per page.

1. Main content into `.md`-file:
    1. File name = `<prefix>_<page_id>.md`
        1. `<prefix>` = url - `<https://../docs/>`
        2. `<page_id>` = hash of `<prefix>`
2. Meta-data:
    1. page_id: hash
    2. title: str
    3. url: str
    4. previous_url: str
    5. is_root_url: bool
    6. next_urls (a list of tuples for ('link_text','link_url')): list[(str,str)]
    7. scrape_timestamp: timestamp



Here are some suggestions from Copilot. It requires some work from an LLM (especially in the summary and stubb parts etc - but lets check if we can integrate this into ollama - i.e. not going out to external LLMs).

The name of the file is interesting:
> Suggested filename (example): overview_019b8f59-6e02.md
> (If you prefer canonical-hash, replace the UUID prefix with sha256(canonical_url)[:12].)
> Main .md file contents (save exactly as file body; no frontmatter):
>

Here is the json output:
```
{
"page_id": "sha256:<hex-of-canonical-url>",
"content_hash": "sha256:<hex-of-normalized-markdown>",
"slug": "overview",
"title": "Overview | Apache Flink",
"url": "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",
"canonical_url": "https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/",
"scrape_id": "019b8f59-6e02-767d-bf46-0690425307de",
"index_id": "2a63f795-1f18-4d10-a6c2-474de4abeab9",
"status_code": 200,
"content_type": "text/html",
"language": "en",
"summary": "Overview of Flink concepts, APIs, and training resources.",
"headings": [{"level":1,"text":"Concepts"},{"level":2,"text":"Flinkâ€™s APIs"}],
"assets": [{"original_url":"https://nightlies.apache.org/flink/.../fig/levels_of_abstraction.svg","inferred_filename":"levels_of_abstraction.svg","content_type":"image/svg+xml"}],
"previous_url": null,
"next_urls": [],
"is_root_url": false,
"scrape_timestamp": "2026-01-05T12:00:00Z",
"cached_at": null,
"provenance": "nightlies.apache.org",
"notes": "content taken from Firecrawl response.model_dump(); consider canonical_url normalization before dedup."
}
```

In [None]:
class Tester:

    def __init__(self,root_url: str = None):
        if root_url:
            self.root_url = root_url


    def extract_prefix(self, url, remove_start: str = 'https://nightlies.apache.org/', remove_end: str = '/docs/') -> str:
        pattern = re.compile(re.escape(remove_start) + r'.*?' + re.escape(remove_end))
        rest = pattern.sub('', url, 1)
        cleaned = re.sub(r'[^A-Za-z]+', '_', rest).strip('_')
        return re.sub(r'_+', '_', cleaned)
    
    def prefix_to_hash(self, prefix: str, numeric: bool = False):
        h = hashlib.sha256(prefix.encode('utf-8')).hexdigest()
        return int(h[:16], 16) if numeric else h
    
    def extract_markdown_links(self, text):
        """
        Extract unique markdown page links (text, url) from `text`, excluding image links.
        Fragments (anchors) are removed so multiple section links to the same page yield one entry.
        """

        pattern = re.compile(r'(?<!\!)\[(?P<text>[^\]]+)\]\((?P<url>https?://[^\s)]+)\)')
        seen = set()
        ret = []

        for m in pattern.finditer(text):
            raw_url = m.group('url').replace('\\', '')
            parts = urllib.parse.urlsplit(raw_url)

            # normalize scheme and netloc, remove fragment
            scheme = parts.scheme.lower()
            netloc = parts.netloc.lower()
            if (scheme == 'http' and netloc.endswith(':80')) or (scheme == 'https' and netloc.endswith(':443')):
                netloc = netloc.rsplit(':', 1)[0]

            normalized = urllib.parse.urlunsplit((scheme, netloc, parts.path or '/','',''))
            normalized = normalized.rstrip('/')

            if normalized in seen:
                continue
            seen.add(normalized)

            desc = re.sub(r'\s+', ' ', m.group('text')).strip()
            ret.append((desc, normalized))

        return ret
    

ts = Tester()

In [22]:
# ts.extract_prefix(response.model_dump()['metadata']['url'])
ts.extract_prefix(file_response['metadata']['url'])
# ts.extract_prefix('https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/')
# ts.extract_prefix('https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/#keyed-and-non-keyed-operators')

'concepts_overview'

In [23]:
ts.extract_markdown_links(md_content)

[('\\#',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview'),
 ('Hands-on Training',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview'),
 ('Data Pipelines & ETL',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl'),
 ('Fault Tolerance',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance'),
 ('Streaming Analytics',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics'),
 ('DataStream API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview'),
 ('Process\\\\ Function',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/process_function'),
 ('Table\\\\ API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/table/overview')]

In [27]:
ts.extract_markdown_links_old(md_content)

[('Hands-on Training',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview/'),
 ('Data Pipelines & ETL',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl/#stateful-transformations'),
 ('Fault Tolerance',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance/'),
 ('Streaming Analytics',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics/'),
 ('DataStream API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview/'),
 ('Process Function',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/process_function/'),
 ('DataStream API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview/'),
 ('Table API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/table/overview/'),
 ('SQL',
  'https://nightlies