A lightweight Node.js tool for web crawling and image extraction. This tool parses website sitemaps, extracts images from all pages, and provides a local gallery interface to browse and optionally download the extracted images.
- Sitemap Parsing: Automatically fetches and parses a website's sitemap.xml
- Image Extraction: Crawls all URLs from the sitemap and extracts image sources
- Optional Download: Save all extracted images locally with organized naming
- Gallery Interface: Built-in HTTP server with a clean gallery UI to browse extracted images
- Page Organization: Images are grouped by their source page URL
# Clone the repository
git clone https://github.com/hiimdjango/node-web-tools.git
# Navigate to the project directory
cd node-web-tools
# Install dependencies (if any are added in the future)
npm installExtract images and view them in a local gallery without downloading:
npm start -- --url=https://example.comOr directly with node:
node crawl.js --url=https://example.comExtract images and save them to the ./output/images/ directory:
npm run extract -- --url=https://example.comOr directly with node:
node crawl.js --url=https://example.com --downloadAfter running either command, the gallery will be available at:
http://localhost:3001
The homepage displays a list of all crawled pages. Click on any page to view its extracted images.
node-web-tools/
├── crawl.js # Main entry point - handles sitemap parsing and server
├── download-helpers.js # Image downloading functionality
├── ui-helpers.js # HTML generation for gallery interface
├── package.json # Project metadata and scripts
└── output/ # Generated directory for downloaded images
└── images/
- Sitemap Parsing: The tool fetches the sitemap.xml from the target URL and extracts all page URLs
- Page Crawling: Each URL is fetched and parsed for image sources using regex matching
- Image Processing:
- Images are displayed in the gallery interface
- If
--downloadflag is used, images are downloaded to./output/images/
- Gallery Server: An HTTP server runs on port 3001, providing:
- Homepage with links to all crawled pages
- Individual gallery pages showing all images from each URL
- Node.js (ES Modules support required)
- Target website must have a sitemap.xml file
--url: (Required) The target website URL to crawl--download: (Optional) Download images to local storage
When using the --download flag, images are saved to:
./output/images/
Images are named using the pattern: {page-url-slug}_{index}.{extension}
MIT
hiimdjango