Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 57 lines (45 sloc) 2.759 kB
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
1 {-| Scans page of Markdown looking for http links. When it finds them, it submits them
2 to webcitation.org / https://secure.wikimedia.org/wikipedia/en/wiki/WebCite
3 (It will also submit them to Alexa (the source for the Internet Archive), but Alexa says that
4 its bots take weeks to visit and may not ever.)
5
6 Limitations:
7 * Only parses Markdown, not ReST or any other format; this is because 'readMarkdown'
8 is hardwired into it.
9
10 By: Gwern Branwen; placed in the public domain -}
1a3c231 @jgm Changed plugin names, made them all work.
authored
11
12 module WebArchiver (plugin) where
13
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
14 import Control.Concurrent (forkIO, ThreadId)
1a3c231 @jgm Changed plugin names, made them all work.
authored
15 import Control.Monad (when)
16 import Control.Monad.Trans (MonadIO)
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
17 import Data.Maybe (fromJust)
18 import Network.Browser (browse, formToRequest, request, Form(..))
19 import Network.HTTP (getRequest, rspBody, simpleHTTP, RequestMethod(POST))
20 import Network.URI (isURI, parseURI, uriPath)
21
22 import Network.Gitit.Interface (askUser, liftIO, processWithM, uEmail, Plugin(PreCommitTransform), Inline(Link))
23 import Text.Pandoc (defaultParserState, readMarkdown)
1a3c231 @jgm Changed plugin names, made them all work.
authored
24
25 plugin :: Plugin
26 plugin = PreCommitTransform archivePage
27
28 -- archivePage :: (MonadIO m) => String -> ReaderT (Config, Maybe User) (StateT IO) String
29 archivePage x = do mbUser <- askUser
30 let email = case mbUser of
31 Nothing -> "nobody@mailinator.com"
32 Just u -> uEmail u
33 let p = readMarkdown defaultParserState x
34 -- force evaluation and archiving side-effects
35 _p' <- liftIO $ processWithM (archiveLinks email) p
36 return x -- note: this is read-only - don't actually change page!
37
38 archiveLinks :: String -> Inline -> IO Inline
39 archiveLinks e x@(Link _ (uln, _)) = checkArchive e uln >> return x
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
40 archiveLinks _ x = return x
1a3c231 @jgm Changed plugin names, made them all work.
authored
41
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
42 -- | Error check the URL and then archive it both ways
1a3c231 @jgm Changed plugin names, made them all work.
authored
43 checkArchive :: (MonadIO m) => String -> String -> m ()
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
44 checkArchive email url = when (isURI url) $ liftIO (webciteArchive email url >> alexaArchive url)
1a3c231 @jgm Changed plugin names, made them all work.
authored
45
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
46 webciteArchive :: String -> String -> IO ThreadId
47 webciteArchive email url = forkIO (ignore $ openURL ("http://www.webcitation.org/archive?url=" ++ url ++ "&email=" ++ email))
1a3c231 @jgm Changed plugin names, made them all work.
authored
48 where openURL = simpleHTTP . getRequest
77fb435 @jgm Modified WebArchiver plugin to make Alexa requests (gwern).
authored
49 ignore = fmap $ const ()
50
51 alexaArchive :: String -> IO ()
52 alexaArchive url = do let archiveform = Form POST (fromJust $ parseURI "http://www.alexa.com/help/crawlrequest")
53 [("url", url), ("submit", "")]
54 (uri, resp) <- browse $ request $ formToRequest archiveform
55 when (uriPath uri /= "/help/crawlthanks") $
56 error $ "Request failed! Did Alexa change webpages? Response:" ++ rspBody resp
Something went wrong with that request. Please try again.