Security News Scraper

An automated web scraper that collects cybersecurity news and vulnerability information from reputable sources, extracts CVE identifiers, and provides a dashboard for monitoring the latest threats.

Features

Multi-Source Scraping: Collects news from The Hacker News, Bleeping Computer, Krebs on Security, and SecurityWeek
CVE Extraction: Automatically identifies and extracts CVE identifiers from articles
SQLite Database: Stores articles and CVEs with efficient querying capabilities
Web Dashboard: Flask-based interface for browsing and searching collected data
Scheduled Scraping: Automated scraping with APScheduler
Notification System: Email and Slack alerts for critical vulnerabilities
Search & Filtering: Filter articles by source, date, keywords, and CVEs by severity

Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)

Setup

Clone or download this repository
Install dependencies:

pip install -r requirements.txt

Initialize the database:

python main.py init-db

Usage

Commands

The application provides several commands through the main entry point:

# Run the scraper once
python main.py scrape

# Start the web dashboard
python main.py dashboard

# Start the scheduled scraper (runs in background)
python main.py scheduler

# Send a test notification
python main.py test-notify

# Show database statistics
python main.py stats

# Initialize/reset the database
python main.py init-db

Web Dashboard

Start the dashboard with:

python main.py dashboard

The dashboard will be available at http://127.0.0.1:5000

Dashboard Features:

Overview with statistics
Browse all articles with filtering
View CVE database with severity filtering
Search across articles and CVEs
Trigger manual scraping from the web interface

Scheduled Scraping

To run the scraper automatically at regular intervals:

python main.py scheduler

By default, the scraper runs every 60 minutes. You can change this in config.py:

SCRAPE_INTERVAL_MINUTES = 60  # Change to desired interval

Manual Scraping

To scrape news sources once without scheduling:

python main.py scrape

Or run the scraper module directly:

python scraper.py

Configuration

Edit config.py to customize the application:

News Sources

Enable or disable news sources:

NEWS_SOURCES = {
    'the_hacker_news': {
        'name': 'The Hacker News',
        'rss_url': 'https://thehackernews.com/feeds/posts/default',
        'base_url': 'https://thehackernews.com',
        'enabled': True  # Set to False to disable
    },
    # ... other sources
}

Scraping Settings

SCRAPE_INTERVAL_MINUTES = 60  # Scraping interval in minutes
MAX_ARTICLES_PER_SOURCE = 20  # Maximum articles to fetch per source

Notification Settings

Configure email notifications:

NOTIFICATIONS = {
    'enabled': True,
    'email': {
        'enabled': True,
        'smtp_server': 'smtp.gmail.com',
        'smtp_port': 587,
        'sender_email': 'your-email@gmail.com',
        'sender_password': 'your-app-password',
        'recipients': ['recipient@example.com']
    }
}

Configure Slack notifications:

NOTIFICATIONS = {
    'enabled': True,
    'slack': {
        'enabled': True,
        'webhook_url': 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    }
}

Alert Thresholds

Configure when to send alerts:

ALERT_THRESHOLDS = {
    'min_cvss_score': 7.0,  # Alert on CVSS >= 7.0
    'keywords': ['critical', 'zero-day', 'ransomware', 'exploit']
}

Flask Configuration

FLASK_CONFIG = {
    'host': '127.0.0.1',
    'port': 5000,
    'debug': False
}

Project Structure

SecurityNewsScraper/
├── app.py              # Flask web application
├── config.py           # Configuration settings
├── database.py         # SQLite database management
├── main.py             # CLI entry point
├── notifications.py    # Email and Slack notifications
├── scraper.py          # Web scraper and CVE extraction
├── scheduler.py        # Scheduled scraping with APScheduler
├── requirements.txt    # Python dependencies
├── templates/          # HTML templates for Flask
│   ├── base.html
│   ├── index.html
│   ├── articles.html
│   ├── article_detail.html
│   ├── cves.html
│   ├── cve_detail.html
│   └── search.html
└── security_news.db    # SQLite database (created automatically)

Web Scraping Ethics and Legal Considerations

robots.txt

This scraper uses RSS feeds provided by the news sources, which are intended for automated consumption. Always check a website's robots.txt file before scraping.

Rate Limiting

The scraper includes built-in rate limiting by:

Using RSS feeds instead of direct HTML scraping where possible
Limiting the number of articles fetched per source
Respecting the configured scrape interval

Terms of Service

Always review and comply with the terms of service of the websites you scrape. This tool is designed for educational and personal security monitoring purposes.

Attribution

When using scraped data, consider providing attribution to the original sources.

Data Quality Considerations

Official vs. Research

Official CVE announcements: Typically come from NVD or vendor advisories
Security research: May contain preliminary information or unverified claims
This scraper collects both types; always verify critical information from official sources

Duplicate Detection

The scraper automatically detects and skips duplicate articles based on URL matching.

Data Freshness

RSS feeds provide near real-time updates. The scheduled scraping ensures regular data refreshes.

Alternative Approaches

RSS Feed Consumption

This project primarily uses RSS feeds, which is the recommended approach because:

Structured data format
Intended for automated consumption
Lower server load
More reliable than HTML scraping

HTML Scraping

For sites without RSS feeds, BeautifulSoup can be used to parse HTML. The scraper includes fallback HTML parsing for article content.

API Integration

Some security news sources offer APIs. Consider using official APIs when available for more reliable data access.

Threat Intelligence Workflows

Integration with Incident Response

Monitor: Set up scheduled scraping with notifications for critical vulnerabilities
Assess: Use the dashboard to filter CVEs by severity and relevance
Respond: Integrate with your incident response process
Report: Generate reports from the database for stakeholders

Custom Workflows

You can extend this scraper by:

Adding custom notification channels (e.g., Microsoft Teams, Discord)
Integrating with vulnerability management tools
Adding custom data enrichment (e.g., fetching CVSS details from NVD)
Exporting data to SIEM systems

Troubleshooting

Database Locked Error

If you see "database is locked" errors:

Ensure only one instance of the scraper is running
Check that the dashboard isn't holding database connections

RSS Feed Errors

If scraping fails for a specific source:

Check if the RSS feed URL is still valid
Verify your internet connection
Some feeds may temporarily be unavailable

Notification Errors

For email notifications:

Use an app-specific password for Gmail
Check SMTP settings and port
Verify firewall isn't blocking SMTP

For Slack notifications:

Ensure the webhook URL is correct
Check that the webhook has proper permissions

Dependencies

beautifulsoup4: HTML parsing
requests: HTTP requests
flask: Web framework for dashboard
apscheduler: Scheduled task execution
lxml: XML/HTML parser
feedparser: RSS feed parsing
python-dateutil: Date parsing

License

This project is provided for educational purposes. Always ensure compliance with applicable laws and website terms of service when using web scrapers.

Contributing

Contributions are welcome! Areas for improvement:

Additional news sources
Enhanced CVE data enrichment
More notification channels
Improved search capabilities
Data export features

Disclaimer

This tool is for educational and security monitoring purposes only. Users are responsible for ensuring their use complies with applicable laws and website terms of service. The authors are not responsible for misuse of this software.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
photos		photos
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
database.py		database.py
main.py		main.py
notifications.py		notifications.py
requirements.txt		requirements.txt
scheduler.py		scheduler.py
scraper.py		scraper.py

Folders and files

Latest commit

History

Repository files navigation

Security News Scraper

Features

Installation

Prerequisites

Setup

Usage

Commands

Web Dashboard

Scheduled Scraping

Manual Scraping

Configuration

News Sources

Scraping Settings

Notification Settings

Alert Thresholds

Flask Configuration

Project Structure

Web Scraping Ethics and Legal Considerations

robots.txt

Rate Limiting

Terms of Service

Attribution

Data Quality Considerations

Official vs. Research

Duplicate Detection

Data Freshness

Alternative Approaches

RSS Feed Consumption

HTML Scraping

API Integration

Threat Intelligence Workflows

Integration with Incident Response

Custom Workflows

Troubleshooting

Database Locked Error

RSS Feed Errors

Notification Errors

Dependencies

License

Contributing

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages