Skip to content

korhanyuzbas/python-articlecrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Article Crawler

Article Crawler is a tool for crawling article from websites. Supports Python 3.5 or newer.

Features

  • Crawling articles from HTML websites (PDF content still in progress).
  • Collecting article's images URLs.
  • Exporting article data into JSON or SQL (Only SQLite3 supported)

Usage

Checkout the code:

git clone https://github.com/korhanyuzbas/python-articlecrawler.git
cd python-articlecrawler

virtualenv

virtualenv -p python3 env
source env/bin/activate
pip install -r requirements.txt
python main.py <URL> --export=sql

Test

python -m unittest

TODO

  • More SQL support.
  • Better documentation.
  • Performance improvements.
  • Test cases.

Releases

No releases published

Packages

No packages published

Languages