CLI tool for scraping and downloading files from JavaScript-heavy pages using Puppeteer.
Standard web fetching tools fail on Single Page Applications (SPAs) that require JavaScript to render content. This CLI uses Puppeteer (headless Chrome) to:
- Render JavaScript - Loads the page in a real browser
- Capture authenticated URLs - Intercepts network requests to find signed download URLs
- Extract files - Finds PDFs, images, audio, and other downloadable content
- Download files - Saves files with proper filenames
cd ~/Documents/Dev/Dev-Tools/web-scraper-cli
uv syncFirst run will automatically install Puppeteer dependencies.
uv run webscrape scrape "https://taskcards.de/board/..."uv run webscrape scrape "https://example.com/page" --download
# or
uv run webscrape scrape "https://example.com/page" -duv run webscrape scrape "https://example.com" -d -o ~/Desktop/downloadsuv run webscrape scrape "https://example.com" --debuguv run webscrape infouv run webscrape install- Documents: PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX
- Archives: ZIP, RAR, 7Z, TAR, GZ
- Images: JPG, JPEG, PNG, GIF, WEBP, SVG, BMP
- Audio/Video: MP3, MP4, WAV, AVI, MOV, MKV
- Data: TXT, CSV, JSON, XML
By default, files are downloaded to ~/Downloads/web-scraper/.
Each scrape also produces:
page-screenshot.png- Full page screenshotscrape-results.json- Structured data about found content
Works with any JavaScript-rendered site. Tested with:
- Notion (public pages)
- Airtable (public views)
- Other SPAs with embedded files
Known Issue: TaskCards serves JPEG preview thumbnails instead of actual PDF files at their S3 URLs. This is a platform limitation, not a bug in this scraper.
The scraper will:
- Detect when downloaded files don't match their expected type (e.g., PDF file that's actually JPEG)
- Save files with their correct extension (e.g.,
.jpgfor JPEG images) - Report which files are actual documents vs preview thumbnails
Workaround: For TaskCards PDFs, download manually from the website by clicking on each file.
- Python 3.11+
- Node.js 18+
- uv (Python package manager)
web-scraper-cli/
├── pyproject.toml # Python CLI config
├── web_scraper_cli/ # Python CLI package
│ ├── __init__.py
│ └── main.py # Click CLI entry point
└── scraper/ # Node.js Puppeteer scraper
├── package.json
└── scraper.js # Core scraping logic
The Python CLI wraps the Node.js Puppeteer scraper, providing a consistent interface with other Dev-Tools CLIs.