Skip to content

hannu/crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawl aka. UutisDiff

Crawl is experimental project that tracks popular finnish news sites and reveals what changes are made to the articles since they are published.

Modular structure supports also external site parsers.

GIT is used to store the articles. Article changes could be browsed using GIT tools, but project also contains web front-end built on Sinatra and Backbone.js.

Setup

Install required gems (nokogiri, json, grit, thin, sinatra) with

bundle install

Intialize empty GIT repository. Default path is ./repository

git init repository

Sinatra backend could be started with command

bundle exec ruby sinatra-backend.rb

Usage

To detect changes crawlers should be executed for example every hour (cron task is recommended). Run all crawlers in ./crawlers directory with command

bundle exec ruby crawler.rb

crawler.rb takes also list of files as parameter to run specific crawlers.

About

Experimental project that follows popular finnish news sites and reveals article changes in diff format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published