A web scraper built using Scrapy, a free open-source web crawling framework written in Python.
- "Any content that can be viewed on a webpage can be scraped. Period."
A prevalent problem faced by society today is that of fake news. This issue can be combatted using a machine learning based tool that classifies articles or parts of articles as being untrue. However, in order to use machine learning, one needs a lot of data. By building a robust web scraper, I hope to be able to gather the necessary data to develop a dataset for a fake news detector tool.
- Python 3.5+
- Scrapy
The quick way:
pip install scrapy
See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.
overcome four distinct threat defense mechanisms
- User agent filtering
- Obfuscated javascript redirects
- Captchas
- Header consistency checks