This project is a robust web scraping application designed to extract and store news articles from various credible online sources. It utilizes a sophisticated multi-layered approach to fetch content efficiently while handling potential anti-scraping mechanisms. The extracted data is stored in a PostgreSQL database.
The scraper employs several advanced techniques to ensure reliable content extraction:
- Multi-Method Extraction:
- Uses
newspaper3kfor initial, fast content extraction. - Employs Playwright with a headless Chromium browser as a fallback to handle JavaScript-heavy sites, dynamic content loading, and complex DOM structures. Special handling is included for redirect chains (e.g., from aggregators like Google News).
- Uses
- Advanced Scraping Strategy:
- Integrates a Special Strategy module (
src/scraper/scraper_hooks/strategies/special_strategy.py) specifically designed for sites with strong anti-bot protections. - Utilizes StealthyFetcher (via the
scraplinglibrary) as a final fallback, providing enhanced stealth capabilities against detection systems.
- Integrates a Special Strategy module (
- Anti-Detection Measures:
- User-Agent Rotation: Uses a diverse pool of realistic browser user agents.
- Realistic Headers: Sends browser-like HTTP headers, including referrers.
- Randomized Delays: Introduces variable delays between requests and actions to mimic human browsing patterns.
- Fingerprint Consistency: Applies consistent browser fingerprint properties (screen size, platform, etc.) within scraping sessions to avoid detection.
- Tor Integration: Leverages a Tor proxy service (configured via Docker Compose) for IP address rotation, enhancing anonymity and bypassing IP-based blocks. The Special Strategy module can trigger Tor circuit rotation when protection is detected.
- Behavior Simulation: Simulates human-like interactions such as scrolling patterns within Playwright sessions.
- Consent/Overlay Handling: Attempts to automatically handle cookie consent dialogs and dismiss overlay elements that might obstruct content.
- Concurrency: Uses Python's
ThreadPoolExecutorto process multiple articles and sources concurrently, maximizing throughput. - Robust Error Handling & Retries: Implements retry logic for failed extractions and handles various exceptions gracefully. Includes adaptive delays based on detection history.
- Persistence: Stores extracted article metadata and content in a PostgreSQL database.
- Containerized Environment: Fully containerized using Docker and Docker Compose for easy setup, dependency management, and deployment across different environments. Includes services for the scraper, database (PostgreSQL), database management (pgAdmin), and the Tor proxy.
- Language: Python 3.10
- Core Libraries:
requests/httpx(viascrapling): HTTP requestsnewspaper3k: Basic article extractionBeautifulSoup4/lxml: HTML parsing (dependencies fornewspaper3k)Playwright: Advanced browser automation and renderingscrapling(withStealthyFetcher): Stealthy, browser-based fetchingNLTK: Text processing (used bynewspaper3k)
- Database: PostgreSQL
- Anonymization: Tor Proxy
- Containerization: Docker, Docker Compose
The application is designed to be run using Docker Compose.
- Prerequisites:
- Docker (https://docs.docker.com/get-docker/)
- Docker Compose (https://docs.docker.com/compose/install/)
- Configuration:
- Review
docker-compose.ymlfor service configurations and exposed ports. - Environment variables (e.g.,
TORPASSWORD) can be set withindocker-compose.ymlor potentially via an external.envfile (though one is not strictly required by default).
- Review
- Build and Run:
This command builds the necessary Docker images (if not already built) and starts all the defined services (
docker-compose up --build -d
postgres,pgadmin,tor,scraper,api) in detached mode. - Accessing Services:
- pgAdmin:
http://localhost:5050 - API:
http://localhost:8000(if running)
- pgAdmin:
- Operation:
- The
scraperservice will automatically start processing sources and extracting articles based on the entry point defined in theDockerfile(src.main). - Logs can be viewed using
docker-compose logs scraper.
- The
- Stopping:
docker-compose down