Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

+New module which parses and dumps URLs to a file for use by archiver…

… daemon

Allows rate-limiting of requests; see file headers.
  • Loading branch information...
commit 9ba38be24a91ff30f028e6aa40ec9b2083117711 1 parent 1bc9242
@gwern gwern authored
Showing with 39 additions and 1 deletion.
  1. +6 −1 plugins/WebArchiver.hs
  2. +33 −0 plugins/WebArchiverBot.hs
View
7 plugins/WebArchiver.hs
@@ -1,11 +1,16 @@
{-| Scans page of Markdown looking for http links. When it finds them, it submits them
to webcitation.org / https://secure.wikimedia.org/wikipedia/en/wiki/WebCite
(It will also submit them to Alexa (the source for the Internet Archive), but Alexa says that
-its bots take weeks to visit and may not ever.)
+its bots take weeks to visit and may not ever.) See also the WebArchiverBot.hs plugin and the
+archiver daemon <http://hackage.haskell.org/package/archiver>.
Limitations:
* Only parses Markdown, not ReST or any other format; this is because 'readMarkdown'
is hardwired into it.
+* No rate limitation or choking; will fire off all requests as fast as possible.
+ If pages have more than 20 external links or so, this may result in your IP being temporarily
+ banned by WebCite. To avoid this, you can use WebArchiverBot.hs instead, which will parse & dump
+ URLs into a file processed by the archiver daemon (which *is* rate-limited).
By: Gwern Branwen; placed in the public domain -}
View
33 plugins/WebArchiverBot.hs
@@ -0,0 +1,33 @@
+{-| Scans page of Markdown looking for http links; when found it prints them out to a default file.
+This plugin is meant to be run in conjunction with archiver <http://hackage.haskell.org/package/archiver>.
+If you do not wish to run it (for example, you have no more than a dozen external http links on any page),
+then you should use the original WebArchiver.hs plugin.
+
+Limitations:
+* Only parses Markdown, not ReST or any other format; this is because 'readMarkdown'
+is hardwired into it.
+
+By: Gwern Branwen; placed in the public domain -}
+
+module WebArchiverBot (plugin) where
+
+import System.Directory (getHomeDirectory)
+import Network.Gitit.Interface (liftIO, bottomUpM, Plugin(PreCommitTransform), Inline(Link))
+import Text.Pandoc (defaultParserState, readMarkdown)
+
+plugin :: Plugin
+plugin = PreCommitTransform archivePage
+
+-- archivePage :: (MonadIO m) => String -> m String
+archivePage x = do let p = readMarkdown defaultParserState x
+ -- force evaluation and archiving side-effects
+ _p' <- liftIO $ bottomUpM archiveLinks p
+ return x -- note: this is read-only - don't actually change page!
+
+archiveLinks :: Inline -> IO Inline
+archiveLinks x@(Link _ ('!':_, _)) = return x -- skip interwiki links
+archiveLinks x@(Link _ ('#':_, _)) = return x -- skip section links
+archiveLinks x@(Link _ (uln, _)) = do homedir <- getHomeDirectory
+ appendFile (homedir++"/.urls.txt") (uln++"\n")
+ return x
+archiveLinks x = return x
Please sign in to comment.
Something went wrong with that request. Please try again.