Web Scraper CLI

CLI tool for scraping and downloading files from JavaScript-heavy pages using Puppeteer.

Why?

Standard web fetching tools fail on Single Page Applications (SPAs) that require JavaScript to render content. This CLI uses Puppeteer (headless Chrome) to:

Render JavaScript - Loads the page in a real browser
Capture authenticated URLs - Intercepts network requests to find signed download URLs
Extract files - Finds PDFs, images, audio, and other downloadable content
Download files - Saves files with proper filenames

Installation

cd ~/Documents/Dev/Dev-Tools/web-scraper-cli
uv sync

First run will automatically install Puppeteer dependencies.

Usage

Basic scraping (list files without downloading)

uv run webscrape scrape "https://taskcards.de/board/..."

Download files

uv run webscrape scrape "https://example.com/page" --download
# or
uv run webscrape scrape "https://example.com/page" -d

Custom output directory

uv run webscrape scrape "https://example.com" -d -o ~/Desktop/downloads

Debug mode (shows browser window)

uv run webscrape scrape "https://example.com" --debug

Check installation status

uv run webscrape info

Install dependencies manually

uv run webscrape install

Supported File Types

Documents: PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX
Archives: ZIP, RAR, 7Z, TAR, GZ
Images: JPG, JPEG, PNG, GIF, WEBP, SVG, BMP
Audio/Video: MP3, MP4, WAV, AVI, MOV, MKV
Data: TXT, CSV, JSON, XML

Output

By default, files are downloaded to ~/Downloads/web-scraper/.

Each scrape also produces:

page-screenshot.png - Full page screenshot
scrape-results.json - Structured data about found content

Supported Sites

Works with any JavaScript-rendered site. Tested with:

Notion (public pages)
Airtable (public views)
Other SPAs with embedded files

TaskCards Limitation

Known Issue: TaskCards serves JPEG preview thumbnails instead of actual PDF files at their S3 URLs. This is a platform limitation, not a bug in this scraper.

The scraper will:

Detect when downloaded files don't match their expected type (e.g., PDF file that's actually JPEG)
Save files with their correct extension (e.g., .jpg for JPEG images)
Report which files are actual documents vs preview thumbnails

Workaround: For TaskCards PDFs, download manually from the website by clicking on each file.

Requirements

Python 3.11+
Node.js 18+
uv (Python package manager)

Architecture

web-scraper-cli/
├── pyproject.toml          # Python CLI config
├── web_scraper_cli/        # Python CLI package
│   ├── __init__.py
│   └── main.py             # Click CLI entry point
└── scraper/                # Node.js Puppeteer scraper
    ├── package.json
    └── scraper.js          # Core scraping logic

The Python CLI wraps the Node.js Puppeteer scraper, providing a consistent interface with other Dev-Tools CLIs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scraper		scraper
web_scraper_cli		web_scraper_cli
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper CLI

Why?

Installation

Usage

Basic scraping (list files without downloading)

Download files

Custom output directory

Debug mode (shows browser window)

Check installation status

Install dependencies manually

Supported File Types

Output

Supported Sites

TaskCards Limitation

Requirements

Architecture

About

Uh oh!

Releases

Packages

Languages

orbruno/web-scraper-cli

Folders and files

Latest commit

History

Repository files navigation

Web Scraper CLI

Why?

Installation

Usage

Basic scraping (list files without downloading)

Download files

Custom output directory

Debug mode (shows browser window)

Check installation status

Install dependencies manually

Supported File Types

Output

Supported Sites

TaskCards Limitation

Requirements

Architecture

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages