Skip to content

maxdeveloperpro-dot/SecurityScraper

 
 

Repository files navigation

Security News Scraper

An automated web scraper that collects cybersecurity news and vulnerability information from reputable sources, extracts CVE identifiers, and provides a dashboard for monitoring the latest threats.

Dashboard

Dashboard

Dashboard

Dashboard

Features

  • Multi-Source Scraping: Collects news from The Hacker News, Bleeping Computer, Krebs on Security, and SecurityWeek
  • CVE Extraction: Automatically identifies and extracts CVE identifiers from articles
  • SQLite Database: Stores articles and CVEs with efficient querying capabilities
  • Web Dashboard: Flask-based interface for browsing and searching collected data
  • Scheduled Scraping: Automated scraping with APScheduler
  • Notification System: Email and Slack alerts for critical vulnerabilities
  • Search & Filtering: Filter articles by source, date, keywords, and CVEs by severity

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Setup

  1. Clone or download this repository

  2. Install dependencies:

pip install -r requirements.txt
  1. Initialize the database:
python main.py init-db

Usage

Commands

The application provides several commands through the main entry point:

# Run the scraper once
python main.py scrape

# Start the web dashboard
python main.py dashboard

# Start the scheduled scraper (runs in background)
python main.py scheduler

# Send a test notification
python main.py test-notify

# Show database statistics
python main.py stats

# Initialize/reset the database
python main.py init-db

Web Dashboard

Start the dashboard with:

python main.py dashboard

The dashboard will be available at http://127.0.0.1:5000

Dashboard Features:

  • Overview with statistics
  • Browse all articles with filtering
  • View CVE database with severity filtering
  • Search across articles and CVEs
  • Trigger manual scraping from the web interface

Scheduled Scraping

To run the scraper automatically at regular intervals:

python main.py scheduler

By default, the scraper runs every 60 minutes. You can change this in config.py:

SCRAPE_INTERVAL_MINUTES = 60  # Change to desired interval

Manual Scraping

To scrape news sources once without scheduling:

python main.py scrape

Or run the scraper module directly:

python scraper.py

Configuration

Edit config.py to customize the application:

News Sources

Enable or disable news sources:

NEWS_SOURCES = {
    'the_hacker_news': {
        'name': 'The Hacker News',
        'rss_url': 'https://thehackernews.com/feeds/posts/default',
        'base_url': 'https://thehackernews.com',
        'enabled': True  # Set to False to disable
    },
    # ... other sources
}

Scraping Settings

SCRAPE_INTERVAL_MINUTES = 60  # Scraping interval in minutes
MAX_ARTICLES_PER_SOURCE = 20  # Maximum articles to fetch per source

Notification Settings

Configure email notifications:

NOTIFICATIONS = {
    'enabled': True,
    'email': {
        'enabled': True,
        'smtp_server': 'smtp.gmail.com',
        'smtp_port': 587,
        'sender_email': 'your-email@gmail.com',
        'sender_password': 'your-app-password',
        'recipients': ['recipient@example.com']
    }
}

Configure Slack notifications:

NOTIFICATIONS = {
    'enabled': True,
    'slack': {
        'enabled': True,
        'webhook_url': 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    }
}

Alert Thresholds

Configure when to send alerts:

ALERT_THRESHOLDS = {
    'min_cvss_score': 7.0,  # Alert on CVSS >= 7.0
    'keywords': ['critical', 'zero-day', 'ransomware', 'exploit']
}

Flask Configuration

FLASK_CONFIG = {
    'host': '127.0.0.1',
    'port': 5000,
    'debug': False
}

Project Structure

SecurityNewsScraper/
├── app.py              # Flask web application
├── config.py           # Configuration settings
├── database.py         # SQLite database management
├── main.py             # CLI entry point
├── notifications.py    # Email and Slack notifications
├── scraper.py          # Web scraper and CVE extraction
├── scheduler.py        # Scheduled scraping with APScheduler
├── requirements.txt    # Python dependencies
├── templates/          # HTML templates for Flask
│   ├── base.html
│   ├── index.html
│   ├── articles.html
│   ├── article_detail.html
│   ├── cves.html
│   ├── cve_detail.html
│   └── search.html
└── security_news.db    # SQLite database (created automatically)

Web Scraping Ethics and Legal Considerations

robots.txt

This scraper uses RSS feeds provided by the news sources, which are intended for automated consumption. Always check a website's robots.txt file before scraping.

Rate Limiting

The scraper includes built-in rate limiting by:

  • Using RSS feeds instead of direct HTML scraping where possible
  • Limiting the number of articles fetched per source
  • Respecting the configured scrape interval

Terms of Service

Always review and comply with the terms of service of the websites you scrape. This tool is designed for educational and personal security monitoring purposes.

Attribution

When using scraped data, consider providing attribution to the original sources.

Data Quality Considerations

Official vs. Research

  • Official CVE announcements: Typically come from NVD or vendor advisories
  • Security research: May contain preliminary information or unverified claims
  • This scraper collects both types; always verify critical information from official sources

Duplicate Detection

The scraper automatically detects and skips duplicate articles based on URL matching.

Data Freshness

RSS feeds provide near real-time updates. The scheduled scraping ensures regular data refreshes.

Alternative Approaches

RSS Feed Consumption

This project primarily uses RSS feeds, which is the recommended approach because:

  • Structured data format
  • Intended for automated consumption
  • Lower server load
  • More reliable than HTML scraping

HTML Scraping

For sites without RSS feeds, BeautifulSoup can be used to parse HTML. The scraper includes fallback HTML parsing for article content.

API Integration

Some security news sources offer APIs. Consider using official APIs when available for more reliable data access.

Threat Intelligence Workflows

Integration with Incident Response

  1. Monitor: Set up scheduled scraping with notifications for critical vulnerabilities
  2. Assess: Use the dashboard to filter CVEs by severity and relevance
  3. Respond: Integrate with your incident response process
  4. Report: Generate reports from the database for stakeholders

Custom Workflows

You can extend this scraper by:

  • Adding custom notification channels (e.g., Microsoft Teams, Discord)
  • Integrating with vulnerability management tools
  • Adding custom data enrichment (e.g., fetching CVSS details from NVD)
  • Exporting data to SIEM systems

Troubleshooting

Database Locked Error

If you see "database is locked" errors:

  • Ensure only one instance of the scraper is running
  • Check that the dashboard isn't holding database connections

RSS Feed Errors

If scraping fails for a specific source:

  • Check if the RSS feed URL is still valid
  • Verify your internet connection
  • Some feeds may temporarily be unavailable

Notification Errors

For email notifications:

  • Use an app-specific password for Gmail
  • Check SMTP settings and port
  • Verify firewall isn't blocking SMTP

For Slack notifications:

  • Ensure the webhook URL is correct
  • Check that the webhook has proper permissions

Dependencies

  • beautifulsoup4: HTML parsing
  • requests: HTTP requests
  • flask: Web framework for dashboard
  • apscheduler: Scheduled task execution
  • lxml: XML/HTML parser
  • feedparser: RSS feed parsing
  • python-dateutil: Date parsing

License

This project is provided for educational purposes. Always ensure compliance with applicable laws and website terms of service when using web scrapers.

Contributing

Contributions are welcome! Areas for improvement:

  • Additional news sources
  • Enhanced CVE data enrichment
  • More notification channels
  • Improved search capabilities
  • Data export features

Disclaimer

This tool is for educational and security monitoring purposes only. Users are responsible for ensuring their use complies with applicable laws and website terms of service. The authors are not responsible for misuse of this software.

About

An automated web scraper that collects cybersecurity news and vulnerability information from reputable sources, extracts CVE identifiers, and provides a dashboard for monitoring the latest threats.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 65.5%
  • HTML 34.5%