Skip to content
This repository
Newer
Older
100644 63 lines (51 sloc) 3.192 kb
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
1 {-| Scans page of Markdown looking for http links. When it finds them, it submits them
2 to webcitation.org / https://secure.wikimedia.org/wikipedia/en/wiki/WebCite
3 (It will also submit them to Alexa (the source for the Internet Archive), but Alexa says that
9ba38be2 »
2011-04-07 +New module which parses and dumps URLs to a file for use by archiver…
4 its bots take weeks to visit and may not ever.) See also the WebArchiverBot.hs plugin and the
5 archiver daemon <http://hackage.haskell.org/package/archiver>.
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
6
7 Limitations:
8 * Only parses Markdown, not ReST or any other format; this is because 'readMarkdown'
9 is hardwired into it.
9ba38be2 »
2011-04-07 +New module which parses and dumps URLs to a file for use by archiver…
10 * No rate limitation or choking; will fire off all requests as fast as possible.
11 If pages have more than 20 external links or so, this may result in your IP being temporarily
12 banned by WebCite. To avoid this, you can use WebArchiverBot.hs instead, which will parse & dump
13 URLs into a file processed by the archiver daemon (which *is* rate-limited).
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
14
15 By: Gwern Branwen; placed in the public domain -}
1a3c2313 »
2009-06-25 Changed plugin names, made them all work.
16
17 module WebArchiver (plugin) where
18
0a939797 »
2010-07-06 Made WebArchiver plugin more parallel (gwern).
19 import Control.Concurrent (forkIO)
1a3c2313 »
2009-06-25 Changed plugin names, made them all work.
20 import Control.Monad (when)
21 import Control.Monad.Trans (MonadIO)
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
22 import Data.Maybe (fromJust)
23 import Network.Browser (browse, formToRequest, request, Form(..))
24 import Network.HTTP (getRequest, rspBody, simpleHTTP, RequestMethod(POST))
25 import Network.URI (isURI, parseURI, uriPath)
26
27 import Network.Gitit.Interface (askUser, liftIO, processWithM, uEmail, Plugin(PreCommitTransform), Inline(Link))
28 import Text.Pandoc (defaultParserState, readMarkdown)
1a3c2313 »
2009-06-25 Changed plugin names, made them all work.
29
30 plugin :: Plugin
31 plugin = PreCommitTransform archivePage
32
33 -- archivePage :: (MonadIO m) => String -> ReaderT (Config, Maybe User) (StateT IO) String
34 archivePage x = do mbUser <- askUser
35 let email = case mbUser of
36 Nothing -> "nobody@mailinator.com"
37 Just u -> uEmail u
38 let p = readMarkdown defaultParserState x
39 -- force evaluation and archiving side-effects
40 _p' <- liftIO $ processWithM (archiveLinks email) p
41 return x -- note: this is read-only - don't actually change page!
42
43 archiveLinks :: String -> Inline -> IO Inline
0a939797 »
2010-07-06 Made WebArchiver plugin more parallel (gwern).
44 archiveLinks e x@(Link _ (uln, _)) = forkIO (checkArchive e uln) >> return x
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
45 archiveLinks _ x = return x
1a3c2313 »
2009-06-25 Changed plugin names, made them all work.
46
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
47 -- | Error check the URL and then archive it both ways
1a3c2313 »
2009-06-25 Changed plugin names, made them all work.
48 checkArchive :: (MonadIO m) => String -> String -> m ()
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
49 checkArchive email url = when (isURI url) $ liftIO (webciteArchive email url >> alexaArchive url)
1a3c2313 »
2009-06-25 Changed plugin names, made them all work.
50
0a939797 »
2010-07-06 Made WebArchiver plugin more parallel (gwern).
51 webciteArchive :: String -> String -> IO ()
52 webciteArchive email url = ignore $ openURL ("http://www.webcitation.org/archive?url=" ++ url ++ "&email=" ++ email)
1a3c2313 »
2009-06-25 Changed plugin names, made them all work.
53 where openURL = simpleHTTP . getRequest
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
54 ignore = fmap $ const ()
55
56 alexaArchive :: String -> IO ()
112e8b06 »
2010-02-26 Plugins: hlint and import cleanup (gwern).
57 alexaArchive url = do let archiveform = Form POST
58 (fromJust $ parseURI "http://www.alexa.com/help/crawlrequest")
59 [("url", url), ("submit", "")]
77fb4358 »
2009-12-21 Modified WebArchiver plugin to make Alexa requests (gwern).
60 (uri, resp) <- browse $ request $ formToRequest archiveform
61 when (uriPath uri /= "/help/crawlthanks") $
62 error $ "Request failed! Did Alexa change webpages? Response:" ++ rspBody resp
Something went wrong with that request. Please try again.