Implement Firecrawl caching to reduce API credit usage-Issue 46#151
Implement Firecrawl caching to reduce API credit usage-Issue 46#151ajalote2 wants to merge 18 commits intopromptdriven:mainfrom
Conversation
This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue promptdriven#46: Cache firecrawl results so it doesn't use up the API credit. Features implemented: - SQLite-based persistent caching with configurable TTL - URL normalization for consistent cache keys - Automatic cleanup and size management - Dual-layer caching (client-side + Firecrawl's maxAge parameter) - CLI commands for cache management (stats, clear, info, check) - Environment variable configuration - Comprehensive test suite with 20+ test cases - Complete documentation with usage examples Files added: - pdd/firecrawl_cache.py: Core caching functionality - pdd/firecrawl_cache_cli.py: CLI commands for cache management - tests/test_firecrawl_cache.py: Comprehensive test suite - docs/firecrawl-caching.md: Complete documentation Files modified: - pdd/preprocess.py: Updated to use caching with dual-layer approach - pdd/cli.py: Added firecrawl-cache command group Configuration options: - FIRECRAWL_CACHE_ENABLE (default: true) - FIRECRAWL_CACHE_TTL_HOURS (default: 24) - FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100) - FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000) - FIRECRAWL_CACHE_AUTO_CLEANUP (default: true) CLI commands: - pdd firecrawl-cache stats: View cache statistics - pdd firecrawl-cache clear: Clear all cached entries - pdd firecrawl-cache info: Show configuration - pdd firecrawl-cache check --url <url>: Check specific URL Benefits: - Significant reduction in API credit usage - Faster response times for cached content - Improved reliability with offline capability - Transparent integration with existing <web> tags - Comprehensive management through CLI tools
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive caching system for Firecrawl web scraping to reduce API credit consumption. The caching layer uses SQLite for persistent storage and integrates seamlessly with existing <web> tags in prompts.
Key changes:
- SQLite-based caching with configurable TTL, size limits, and automatic cleanup
- CLI commands for cache management (stats, clear, info, check)
- Integration with preprocess.py to cache web scraping results automatically
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pdd/firecrawl_cache.py | Core caching implementation with SQLite backend and URL normalization |
| pdd/firecrawl_cache_cli.py | CLI commands for cache management and statistics |
| pdd/preprocess.py | Integration of caching into web tag processing workflow |
| pdd/cli.py | Registration of firecrawl-cache CLI command group |
| tests/test_firecrawl_cache.py | Comprehensive test suite covering cache functionality and integration |
| docs/firecrawl-caching.md | Complete documentation for the caching feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def get_firecrawl_cache_stats(): | ||
| cache = get_firecrawl_cache() # your singleton/getter | ||
| return cache.get_stats() |
There was a problem hiding this comment.
Duplicate function definition: get_firecrawl_cache_stats is defined twice (lines 384-387 and 389-391). Remove the duplicate at lines 389-391.
| def get_firecrawl_cache_stats(): | |
| cache = get_firecrawl_cache() # your singleton/getter | |
| return cache.get_stats() |
| ) | ||
| row = cursor.fetchone() | ||
| assert row is not None | ||
| stored_metadata = eval(row[0]) # Simple eval for test |
There was a problem hiding this comment.
Using eval() is a security risk even in tests. Use json.loads() instead since the metadata is stored as JSON.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Resolved merge conflicts between PR promptdriven#151 (Firecrawl caching feature) and main's modular CLI refactoring. Changes: - Accepted main's CLI structure (core/ and commands/) - Registered firecrawl-cache commands in commands/__init__.py - Merged caching logic in preprocess.py with main's Firecrawl API - Updated context/firecrawl_example.py for new API - Moved documentation to README.md Feature: Adds firecrawl-cache command for managing web scraping cache to reduce API usage. Note: Implementation will be revised in follow-up commits.
This PR adds comprehensive caching functionality for web scraping using Firecrawl, addressing API quota management and improving efficiency. Key changes:
Resolves #46.