# FireCrawl playpen

This is a simple notebook to discover what the response of `Firecrawl`'s response object looks like...

The documentation takes time... and I got a bit unpatient... :)

In [2]:
from firecrawl import Firecrawl
import dotenv
import os
import re

dotenv.load_dotenv(dotenv.find_dotenv("firecrawl-flink_docs/.env"))
firecrawl = Firecrawl(api_key=os.getenv('FIRECRAWL_API_KEY'))

## /crawl

In [22]:
print("\n Starting crawl...")

# Crawl with scrape options
response = firecrawl.crawl('https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    limit=3,
    scrape_options={
        "maxDepth": 1,
        "render": False,
        "ignoreRobotsTxt": True,
    }
)



print("\n Crawl finished...")

print("\n Crawl response:")
print(response.model_dump())


 Starting crawl...

 Crawl finished...

 Crawl response:
{'status': 'completed', 'total': 0, 'completed': 0, 'credits_used': 0, 'expires_at': datetime.datetime(2026, 1, 4, 11, 24, 44, tzinfo=TzInfo(0)), 'next': None, 'data': []}


The above shows that the crawl does not really work. I suspect it has to do with the `robots.txt` restriction on flinks docs... Not sure why that is restricted...

## /scrape

In [None]:
print("\n Starting scrape...")

# Crawl with scrape options
response = firecrawl.scrape(
    url='https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/',
    wait_for=2000,
    only_main_content=True,
    formats=['markdown'],
)



print("\n Scrape finished...")

print('\n Writing to file...')
with open("./flink_firecrawl_output.json", "w", encoding="utf-8") as f:
    f.write(response.model_dump()['markdown'])

print("\n Scrape response:")
print(response.model_dump()['markdown'][:100])




 Writing to file...

 Scrape response:
# Concepts  [\#](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/concepts/overview/\


This prints the markdown content of the scraped page. I.e. it works!!! YES!!!

In [16]:

def extract_markdown_links(text):
    """
    Extract markdown links (text, url) from `text`, excluding image links like ![alt](url).
    """
    pattern = re.compile(r'(?<!\!)\[(?P<text>[^\]]+)\]\((?P<url>https?://[^\s)]+)\)')
    ret_list = [(m.group('text'), m.group('url')) for m in pattern.finditer(text) if not '\\#' in m.group('url')]

    ## Clean return list descriptions
    ret_list = [ (re.sub(r'\s+', ' ', desc).strip().replace('\\',''), url) for desc, url in ret_list ]

    return ret_list

In [17]:
test_str = response.model_dump()['markdown']

extract_markdown_links(test_str)

[('Hands-on Training',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview/'),
 ('Data Pipelines & ETL',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/etl/#stateful-transformations'),
 ('Fault Tolerance',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/fault_tolerance/'),
 ('Streaming Analytics',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/streaming_analytics/'),
 ('DataStream API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview/'),
 ('Process Function',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/process_function/'),
 ('DataStream API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview/'),
 ('Table API',
  'https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/table/overview/'),
 ('SQL',
  'https://nightlies