New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xkcd #21

Open
fazo96 opened this Issue Sep 16, 2015 · 25 comments

Comments

Projects
None yet
@fazo96

fazo96 commented Sep 16, 2015

I plan to archive all the comics in http://xkcd.com/

I think i'll use (comicnumber)-(comictitle).png for the image and figure out how to save the alt-text in the png metadata

Please post if you want to keep a copy of the archive or you manage to create it before I do :)

@whyrusleeping

This comment has been minimized.

Member

whyrusleeping commented Sep 16, 2015

@fazo96 how are you going to manage the comics that are dynamic or contain multiple sequential images? Or the map ones that have a larger version available on click?

I would loooooove to have this. But we should make sure that randall is okay with it first, i'm not sure if there are any sort of copyrights involved here. (too bad he doesnt use github, or we could just ping him)

@jbenet

This comment has been minimized.

Member

jbenet commented Sep 16, 2015

I would loooooove to have this. But we should make sure that randall is okay with it first, i'm not sure if there are any sort of copyrights involved here.

absolutely. thanks for saying this.

looks like everything is released CC-BY-NC

(too bad he doesnt use github, or we could just ping him)

i'm sure has an account. just have to find it \o/

@fazo96

  • would be great to include a web viewer with the archive.
  • maybe make a dir for every comic
  • put the image in both image.png and <original-img-filename> so that we respect his filenames too, but also make them predictably linked
  • put the alt text in a file, like alt.txt

(Alternatively, mirror the RSS feed?)

@rnhmjoj

This comment has been minimized.

rnhmjoj commented Sep 16, 2015

The title and the alt text could be stored in the png metadata. You can use ImageMagic: see this.

@fazo96

This comment has been minimized.

fazo96 commented Sep 16, 2015

Looks like license is not an issue as long as we provide credit to randall and include a copy of the license.

Also:

  • Storing title and alt-text in the png metadata looks like the way to go!
  • @whyrusleeping As far as unconventional comics, we'll figure a solution out for every comic
  • @jbenet a viewer would be great, but at this point, what do you guys think about including the entire website?

Uhm, I just found this in the About page of xkcd.com:

Is there an interface for automated systems to access comics and metadata?
Yes. You can get comics through the JSON interface, at URLs like http://xkcd.com/info.0.json (current comic) and http://xkcd.com/614/info.0.json (comic #614).

Getting the data will be a lot easier this way (no html parsing involved)

EDIT:

I wrote a node script that downloads and organizes data from xkcd.com and it worked!

I created a partial copy of xkcd.com to see if you like the setup (so that we can create a full copy later). I included Randall's about and license pages and my script in the folder 👍

You can check it out here: QmSeYATNaa2fSR3eMqRD8uXwujVLT2JU9wQvSjCd1Rf8pZ

I'm thinking about writing a simple index.html to include in every comic's folder so that alt-text, image (and transcript) can all be seen comfortably on the same browser tab

@davidar

This comment has been minimized.

Member

davidar commented Sep 18, 2015

👍

a viewer would be great, but at this point, what do you guys think about including the entire website?

I think #7 is also quite relevant here.

@davidar davidar added the in progress label Sep 19, 2015

@fazo96

This comment has been minimized.

fazo96 commented Sep 26, 2015

I completed the archive (now every image file and more is available via ipfs), it just a needs a viewer and probably better folder structure.

Here you go: QmPVP4sDre9rtYahGvcjv3Fqet3oQyqrH5xS33d4YBVFme

@davidar

This comment has been minimized.

Member

davidar commented Oct 1, 2015

@fazo96 👏

@cryptix

This comment has been minimized.

cryptix commented Oct 1, 2015

@fazo96 👍

Can we zero pad the numbers on the next pass? :)

@fazo96

This comment has been minimized.

fazo96 commented Oct 1, 2015

@cryptix yeah I figured it was necessary :) if you'd like a try, the script I used to generate the directory tree is included in the directory. It's named xkcd-downloader.js

If I have time I'll implement it

@mateon1

This comment has been minimized.

mateon1 commented Aug 11, 2016

I have scraped the entirety of xkcd.com and some of it's subdomains (Apparently cross-subdomain interlinking didn't work), the result is a very well functioning copy, available at the end of this comment.
EDIT:
Instructions for updating the archive:

  1. Download and install HTTrack. (Windows/linux/OSX)
  2. run httrack xkcd.com -d -%F "" -%N1 -n +*.css +*.js +*.png +*.jpg +*.jpeg +*.gif -*.pdf -O $mirror,$cache (or httrack xkcd.com what-if.xkcd.com ... to archive what-if as well)
  3. The command should be done within 10 minutes a few hours on a decent link.
  4. There may be some .delayed files in imgs.xkcd.com/comics; they contain proper data but have an invalid name. I have no clean solution, so use this command to fix it up:
    cd $mirror/imgs.xkcd.com/comics && ls -1 | awk -F. '/delayed/ {print $0 " " $1".png"}' | xargs -n 2 mv
  5. ipfs add -r $mirror

Switch explanation:

  • -d - Allow to mirror subdomains (edit: Doesn't seem to work for some reason.)
  • -%F "" - Disable footer text (by default including timestamp), allowing deduplication of HTML across updates.
  • -%N1 - Untested, but should fix the 'delayed' files for known file extensions.
  • -n - Archive resources "near" an HTML file, (scripts, css, images)
  • +*.css +*.js +*.png +*.jpg +*.jpeg +*.gif - Also archive all css, js, and images seen outside of HTML (included from JS or CSS, for example)
  • -*.pdf - Don't download external PDFs (when archiving what-if.xkcd.com)
  • -O $mirror,$cache - The resulting webpage is put into $mirror, while httrack runtime info, logs, caches are put into $cache.
  • (optional) -%v2 - Add a progress and statistics display during crawl.

Archiving notes:

  • While HTTrack supports an --update switch, it's broken if the -%F option has an empty argument, so we need to re-crawl the site completely to update.
  • I don't recommend archiving what-if.xkcd.com using the command above, as for some reason, the crawler enters Wikipedia and downloads way too much.
  • TODO: Check how well m.xkcd.com archives
  • TODO: Archive "Hoverboard" game/comic (+ other interactive, if sensible)

Archive links (newest to oldest): (My IPNS entry might be more up to date)

Date Last Comic Size Hash Notes
2016-09-30 1740 169MB QmNogExCdnMJwWE1bpEweMUQyo3X2LP6tuWVvmLYJxUc6o
2016-09-28 1739 169MB QmTGXzCqJNRpKVWmt84oQFmHLiSPQ43JLMpshY95Xkfy1N
2016-09-26 1738 169MB Qmam87KnuC93dVF2PidnDh1KpH8U1V3osWx41tkYCQfont Includes uncorrected 1738: Moon Shapes image
2016-09-21 1736 169MB QmZR6JT1nnNdcBcPjnA4GfT3uqRsHmYrg2fKWrT2BEiTmk
2016-09-21 1735 168MB QmRtXAxyXHWA5krxXMrRJHKJ5qFYXpsz48htquiHp9KbUs
2016-09-15 1733 167MB QmTfagPa7QTtpcZVLYSsKBNMSZz4ytSwStAsNX5mJXhyEF
2016-09-12 1732 167MB QmWoJ5aLozwkNuPQh7RSX7RCn5eXLRwSQezxNBNKsWxsc2
2016-09-09 1731 166MB QmZLdQQJHMCZFZ8jVSSwpmeGfHw6fF2V2QBzo2SJVerWHN
2016-09-05 1729 166MB QmXzDGjRT7McpuLHfRP42ST6bZbjX2KvDGxAZ68gXFdbBz
2016-09-02 1728 166MB Qmc5MG1kL2rR5PNVqr7uqZKAi7g7FcRgvf8mPxtWhb3tNp
2016-09-01 1727 166MB QmTz7tvjVCYz5GPN3YZYNQHadbrSmSzCLp4RWKrh664pJL
2016-08-29 1726 166MB QmRZGA4dMVn13acXQsQL32c8ANcpLSsuTMkx5t98y5oeJL
2016-08-26 1725 165MB QmbLgCaps5oiEh1KSBcnAXAado3tVmpocqsgshVrU2jLoR
2016-08-24 1724 166MB QmNvQQwupNbfUkkTvGSxSyoDjC1WVbEob6NhzVh9qFydCR
2016-08-23 1723 165MB QmRrbEHYyDLSF3d7ghVSpRS2TqrLRhkFxXCsJSFBjuSaCs
2016-08-19 1722 165MB QmdDtTn5W1cQyKDQjubVwEACazjhhP2f7VaNew5bZaBsk7
2016-08-17 1721 164MB QmPMrtopMKBmsW2AMtNNkvdYu9VoHAEwAYgWTFKZgNNqe2
2016-08-15 1720 165MB QmXfn9kftq3DNHPoEbwonYdmEdwyH6BRENMCHscGXaymRm
2016-08-13 1719 165MB QmZJHTHXjGnZN4FtrxzZprNtSyFRg8x9t2pLpuE2jjrzad Fix .delayed files causing some comics to be broken.
2016-08-13 1719 165MB QmauMY4ux6jQVkGmphhzWwULasZ4RPMYxvWkppGv1ZpAL3
2016-08-11 1718 172MB QmbcvivamWCKUuQjdTbCHNBy74qehU6uWTCdyBw3sN8X6b this archive unfortunately includes the cache folder.

@flyingzumwalt flyingzumwalt added backlog and removed in progress labels Jan 15, 2017

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Feb 7, 2017

Looks like the currently referenced version on the website isn't fully available.

@leerspace

This comment has been minimized.

Contributor

leerspace commented Feb 8, 2017

@fazo96 do you have the original archive that's currently linked to on the archives.ipfs.io site? https://ipfs.io/ipfs/QmPVP4sDre9rtYahGvcjv3Fqet3oQyqrH5xS33d4YBVFme

It doesn't currently seem to be fully available, but if you still have it I can pin it to my ipfs node. I'd try to reproduce the archive using the script in the archive, but I could only guess what the exact text was in the about and license files.

@leerspace

This comment has been minimized.

Contributor

leerspace commented Feb 10, 2017

FWIW I just generated a new version of fazo96's archive that's linked to from the site and pinned it to my ipfs node, so the comics that I couldn't access through the gateway before (in the archive linked to from archives.ipfs.io) now seem to be accessible. The about and license files still seem to be unavailable, but I just added the relevant pages from the website to the version of the archive I just created.

Qmb8wsGZNXt5VXZh1pEmYynjB6Euqpq3HYyeAdw2vScTkQ

@lgierth

This comment has been minimized.

Member

lgierth commented Feb 10, 2017

Awesome, gonna pull that onto one of our storage nodes too. @leerspace wanna make a PR to update the site?

@lgierth

This comment has been minimized.

Member

lgierth commented Feb 10, 2017

Cool thanks, I just updated https://archives.ipfs.io

@fazo96

This comment has been minimized.

fazo96 commented Feb 13, 2017

@leerspace sorry for replying late, looks like I lost my copy of the original archive. Thanks for updating it! 👍

@kenXengineering

This comment has been minimized.

kenXengineering commented Jul 12, 2017

Hello, I've updated the archive using the xkcd-downloader.js script offered in the repo, and it now has all comics up to the latest today (1862). It is currently pinned on my laptop, but I will pin it to my server when I get home so it will be available at all times.

QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm

@lgierth

This comment has been minimized.

Member

lgierth commented Jul 12, 2017

Awesome, thanks @chosenken -- also pinned it on nihal.i.ipfs.io

@kenXengineering

This comment has been minimized.

kenXengineering commented Jul 17, 2017

Updated again to 1864, but this time attached it to an ipns: QmTaW8vRj4SkM6JhqVhAsibQE9PdJb5PQ2FMwPPc6gBi2h. I might work on a script that pulls new comics down and updates the ipns when it changes.

@carsonfarmer

This comment has been minimized.

carsonfarmer commented Jun 7, 2018

I'd like to update this one again, but to facilitate programmatic access, I'd like to change the structure slightly to something more like:

/ipfs/Qmahash/1/1 - Barrel - Part 1.png
...
/ipfs/Qmbhash/2003/2003 - Presidential Succession.png

where the comic files are contained within a 'folder' defined by the number rather than number and name. Any issues with this? I can host on our server, but I'd also be happy to submit a PR to update the archives.

@olizilla

This comment has been minimized.

Member

olizilla commented Jul 4, 2018

@carsonfarmer that'd be rad. I've no objection to simplifying the folder structure.

@olizilla

This comment has been minimized.

Member

olizilla commented Jul 4, 2018

I plan to feature this data set on the start page of the new IPLD Explorer page in the ipfs-webui.

@olizilla

This comment has been minimized.

Member

olizilla commented Jul 4, 2018

@carsonfarmer could we get some zero padding on those indexes?

/ipfs/QmHash/0001/0001 - Barrel - Part 1.png
...
/ipfs/Qmbhash/2003/2003 - Presidential Succession.png
@carsonfarmer

This comment has been minimized.

carsonfarmer commented Jul 18, 2018

Ah sorry, was on vacation. Yes I'll update the indexes and post here when ready.

@HugoReeves

This comment has been minimized.

HugoReeves commented Aug 24, 2018

I've written a new program in go that creates an archive such as the following, /ipfs/QmdAChzF2JQCx9icrmYHZhFdRSv9TpRjq5q1v5b3ANpxRf. It also includes a csv with an index of post titles, published date and post number. I have submitted a pr, ipfs/awesome-ipfs#193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment