Skip to content

A web scraper built using Scrapy, a free open-source web crawling framework written in Python.

Notifications You must be signed in to change notification settings

hshah032/Scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Scrapy

Overview

A web scraper built using Scrapy, a free open-source web crawling framework written in Python.

  • "Any content that can be viewed on a webpage can be scraped. Period."

Purpose

A prevalent problem faced by society today is that of fake news. This issue can be combatted using a machine learning based tool that classifies articles or parts of articles as being untrue. However, in order to use machine learning, one needs a lot of data. By building a robust web scraper, I hope to be able to gather the necessary data to develop a dataset for a fake news detector tool.

Requirements

  • Python 3.5+
  • Scrapy

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Problems

overcome four distinct threat defense mechanisms

  • User agent filtering
  • Obfuscated javascript redirects
  • Captchas
  • Header consistency checks

About

A web scraper built using Scrapy, a free open-source web crawling framework written in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages