Skip to content

internetarchive/crawling-for-nomore404

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

crawling-for-nomore404

Crawling-related code for no-more-404s projects.

There are multiple projects that are mostly independent of each other. Here are a summary of each projects. Look for README in respective project subdirectory for more details.

wikipedia

this project scrapes wikipedia IRC channel for updated article, extracts newly added citations, and feed those URLs for crawling. scraper and crawl-scheduler are communicating through Kafka messaging, so other apps can also read a feed of new citations as well as original IRC notifications.

wordpress

this project reads WordPress's official blog update stream, and schedules each permalink URL of new post for crawling. it is implemented as single application at this moment.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published