An automated web scraper that collects cybersecurity news and vulnerability information from reputable sources, extracts CVE identifiers, and provides a dashboard for monitoring the latest threats.
- Multi-Source Scraping: Collects news from The Hacker News, Bleeping Computer, Krebs on Security, and SecurityWeek
- CVE Extraction: Automatically identifies and extracts CVE identifiers from articles
- SQLite Database: Stores articles and CVEs with efficient querying capabilities
- Web Dashboard: Flask-based interface for browsing and searching collected data
- Scheduled Scraping: Automated scraping with APScheduler
- Notification System: Email and Slack alerts for critical vulnerabilities
- Search & Filtering: Filter articles by source, date, keywords, and CVEs by severity
- Python 3.8 or higher
- pip (Python package manager)
-
Clone or download this repository
-
Install dependencies:
pip install -r requirements.txt- Initialize the database:
python main.py init-dbThe application provides several commands through the main entry point:
# Run the scraper once
python main.py scrape
# Start the web dashboard
python main.py dashboard
# Start the scheduled scraper (runs in background)
python main.py scheduler
# Send a test notification
python main.py test-notify
# Show database statistics
python main.py stats
# Initialize/reset the database
python main.py init-dbStart the dashboard with:
python main.py dashboardThe dashboard will be available at http://127.0.0.1:5000
Dashboard Features:
- Overview with statistics
- Browse all articles with filtering
- View CVE database with severity filtering
- Search across articles and CVEs
- Trigger manual scraping from the web interface
To run the scraper automatically at regular intervals:
python main.py schedulerBy default, the scraper runs every 60 minutes. You can change this in config.py:
SCRAPE_INTERVAL_MINUTES = 60 # Change to desired intervalTo scrape news sources once without scheduling:
python main.py scrapeOr run the scraper module directly:
python scraper.pyEdit config.py to customize the application:
Enable or disable news sources:
NEWS_SOURCES = {
'the_hacker_news': {
'name': 'The Hacker News',
'rss_url': 'https://thehackernews.com/feeds/posts/default',
'base_url': 'https://thehackernews.com',
'enabled': True # Set to False to disable
},
# ... other sources
}SCRAPE_INTERVAL_MINUTES = 60 # Scraping interval in minutes
MAX_ARTICLES_PER_SOURCE = 20 # Maximum articles to fetch per sourceConfigure email notifications:
NOTIFICATIONS = {
'enabled': True,
'email': {
'enabled': True,
'smtp_server': 'smtp.gmail.com',
'smtp_port': 587,
'sender_email': 'your-email@gmail.com',
'sender_password': 'your-app-password',
'recipients': ['recipient@example.com']
}
}Configure Slack notifications:
NOTIFICATIONS = {
'enabled': True,
'slack': {
'enabled': True,
'webhook_url': 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
}
}Configure when to send alerts:
ALERT_THRESHOLDS = {
'min_cvss_score': 7.0, # Alert on CVSS >= 7.0
'keywords': ['critical', 'zero-day', 'ransomware', 'exploit']
}FLASK_CONFIG = {
'host': '127.0.0.1',
'port': 5000,
'debug': False
}SecurityNewsScraper/
├── app.py # Flask web application
├── config.py # Configuration settings
├── database.py # SQLite database management
├── main.py # CLI entry point
├── notifications.py # Email and Slack notifications
├── scraper.py # Web scraper and CVE extraction
├── scheduler.py # Scheduled scraping with APScheduler
├── requirements.txt # Python dependencies
├── templates/ # HTML templates for Flask
│ ├── base.html
│ ├── index.html
│ ├── articles.html
│ ├── article_detail.html
│ ├── cves.html
│ ├── cve_detail.html
│ └── search.html
└── security_news.db # SQLite database (created automatically)
This scraper uses RSS feeds provided by the news sources, which are intended for automated consumption. Always check a website's robots.txt file before scraping.
The scraper includes built-in rate limiting by:
- Using RSS feeds instead of direct HTML scraping where possible
- Limiting the number of articles fetched per source
- Respecting the configured scrape interval
Always review and comply with the terms of service of the websites you scrape. This tool is designed for educational and personal security monitoring purposes.
When using scraped data, consider providing attribution to the original sources.
- Official CVE announcements: Typically come from NVD or vendor advisories
- Security research: May contain preliminary information or unverified claims
- This scraper collects both types; always verify critical information from official sources
The scraper automatically detects and skips duplicate articles based on URL matching.
RSS feeds provide near real-time updates. The scheduled scraping ensures regular data refreshes.
This project primarily uses RSS feeds, which is the recommended approach because:
- Structured data format
- Intended for automated consumption
- Lower server load
- More reliable than HTML scraping
For sites without RSS feeds, BeautifulSoup can be used to parse HTML. The scraper includes fallback HTML parsing for article content.
Some security news sources offer APIs. Consider using official APIs when available for more reliable data access.
- Monitor: Set up scheduled scraping with notifications for critical vulnerabilities
- Assess: Use the dashboard to filter CVEs by severity and relevance
- Respond: Integrate with your incident response process
- Report: Generate reports from the database for stakeholders
You can extend this scraper by:
- Adding custom notification channels (e.g., Microsoft Teams, Discord)
- Integrating with vulnerability management tools
- Adding custom data enrichment (e.g., fetching CVSS details from NVD)
- Exporting data to SIEM systems
If you see "database is locked" errors:
- Ensure only one instance of the scraper is running
- Check that the dashboard isn't holding database connections
If scraping fails for a specific source:
- Check if the RSS feed URL is still valid
- Verify your internet connection
- Some feeds may temporarily be unavailable
For email notifications:
- Use an app-specific password for Gmail
- Check SMTP settings and port
- Verify firewall isn't blocking SMTP
For Slack notifications:
- Ensure the webhook URL is correct
- Check that the webhook has proper permissions
- beautifulsoup4: HTML parsing
- requests: HTTP requests
- flask: Web framework for dashboard
- apscheduler: Scheduled task execution
- lxml: XML/HTML parser
- feedparser: RSS feed parsing
- python-dateutil: Date parsing
This project is provided for educational purposes. Always ensure compliance with applicable laws and website terms of service when using web scrapers.
Contributions are welcome! Areas for improvement:
- Additional news sources
- Enhanced CVE data enrichment
- More notification channels
- Improved search capabilities
- Data export features
This tool is for educational and security monitoring purposes only. Users are responsible for ensuring their use complies with applicable laws and website terms of service. The authors are not responsible for misuse of this software.



