Website Scraper Tool

A Python tool to crawl websites, extract text and metadata, and save results in PDF and TXT formats.

Requirements

Python 3.6+
Required packages:
- requests
- beautifulsoup4
- tqdm
- reportlab

Setup Steps

Install Python 3.6 or higher from python.org

Install required packages using pip:

pip install requests beautifulsoup4 tqdm reportlab

Download or copy the scrape_site.py script to your desired directory.

Usage

Run the tool from the command line:

python scrape_site.py <URL> [--depth DEPTH] [--max_pages MAX_PAGES] [--output OUTPUT_DIR]

Arguments

URL: The website URL to scrape (required). Can include http:// or https:// or just the domain (will default to https://)
--depth DEPTH: Maximum crawl depth (default: 2)
--max_pages MAX_PAGES: Maximum number of pages to scrape (default: 100)
--output OUTPUT_DIR: Output directory and base filename (default: "output"). Creates "OUTPUT_DIR/OUTPUT_DIR.txt" and "OUTPUT_DIR/OUTPUT_DIR.pdf"

Examples

Basic usage:
```
python scrape_site.py https://example.com
```
This will scrape example.com with default settings (depth 2, max 100 pages) and save to "output/output.txt" and "output/output.pdf"
Custom depth and output:
```
python scrape_site.py https://website.com --depth 3 --output mysite
```
This crawls website.com up to depth 3 and saves to "mysite/mysite.txt" and "mysite/mysite.pdf"

Features

Respects robots.txt when possible
Extracts visible text, titles, headings, links, alt texts, and metadata
Handles internal links and crawled pages within the same domain
Skips scripts, styles, ads, forms, iframes
Rate limiting (1 second between requests) to be polite to servers
Progress indicator showing crawling progress
Error handling for failed requests
Generates clean PDF with structured content
Generates plain TXT file with raw text
Handles pagination through internal links (if pages link to each other)
Adapts to different HTML structures by focusing on text extraction

Output Files

TXT File: Contains all raw text organized by page, with section headers for different types of content
PDF File: clean, readable document with page titles as headings, body text, and organized sections for headings, metadata, links, and alt texts

Limitations

Does not extract images, videos, or other non-text assets
May not handle JavaScript-rendered content (requires additional libraries like Selenium for that)
Crawls only internal links within the same domain
Rate limited to 1 request per second
Respects robots.txt but cannot guarantee compliance if site changes
PDF generation may not handle complex layouts perfectly

Ethics and Legality

Ensure you have permission to scrape websites
Respect robots.txt files
Follow website's Terms of Service
Be considerate with crawling frequency
This tool is for educational/research purposes only

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
scrape_site.py		scrape_site.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Website Scraper Tool

Requirements

Setup Steps

Usage

Arguments

Examples

Features

Output Files

Limitations

Ethics and Legality

About

Uh oh!

Releases

Packages

Languages

hemoyt/web-scrap

Folders and files

Latest commit

History

Repository files navigation

Website Scraper Tool

Requirements

Setup Steps

Usage

Arguments

Examples

Features

Output Files

Limitations

Ethics and Legality

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages