Skip to content

Commit

Permalink
Merge pull request #67 from ipfs/fix/build-q1-2020
Browse files Browse the repository at this point in the history
docs:  updated notes on snapshot generation
  • Loading branch information
lidel committed Feb 7, 2020
2 parents 2106f78 + bb9f48c commit 918d684
Show file tree
Hide file tree
Showing 6 changed files with 128 additions and 19 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.DS_Store
.cache
*.zim
out
IPFS_PATH
80 changes: 67 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,40 +65,94 @@ First, download the latest wiki lists using `bash getzim.sh cache_update`
After that create a download command using `bash getzim.sh choose`

### Step 2: Unpack the ZIM snapshot
Unpack the ZIM snapshot using https://github.com/dignifiedquire/zim/commit/a283151105ab4c1905d7f5cb56fb8eb2a854ad67

### Step 3: Enable Directory Sharding on your IPFS Node
Unpack the ZIM snapshot using `extract_zim` from https://github.com/dignifiedquire/zim:

```sh
$ extract_zim --skip-link ./wikipedia_en_all_maxi_2018-10.zim --out ./out
Extracting file: ./wikipedia_en_all_maxi_2018-10.zim to ./out
Creating map
Extracting entries: 126688
Spawning 126687 tasks across 64 threads
Extraction done in 315434ms
Main page is User:Stephane_(Kiwix)_Landing.html
```

> ### ℹ️ Main page
> The string after `Main page is` as it is the name
of the landing page set for the ZIM archive.
> You may also decide to use the original landing page instead (eg. `Main_Page` in `en`, `Anasayfa` in `tr` etc)
> Main page needs to be passed as `--main` when calling `execute-changes.sh` later.
### Step 3: Configure your IPFS Node

#### Enable Directory Sharding

Configure your IPFS node to enable directory sharding
```sh
$ ipfs config --json 'Experimental.ShardingEnabled' true
```

### Step 4: Add the data to IPFS
Add all the data to your node using `ipfs add`. Use the following command, replacing `$unpacked_wiki` with the path to the unpacked ZIM snapshot that you created in Step 2. **Don't share the hash yet.**
#### Optional: Switch to `badgerds`

Consider using a [datastore backed by BadgerDB](https://github.com/ipfs/go-ds-badger) for improved performance.
Existing repository can be converted to badgerds with [ipfs-ds-convert](https://github.com/ipfs/ipfs-ds-convert):

```sh
$ ipfs config profile apply badgerds
$ ipfs-ds-convert convert
```

### Step 4: Import unpacked data to IPFS

### Add immutable copy

Add all the data to your node using `ipfs add`. Use the following command, replacing `$unpacked_wiki` with the path to the unpacked ZIM snapshot that you created in Step 2 (`./out`). **Don't share the hash yet.**

```sh
$ ipfs add -r --cid-version 1 $unpacked_wiki
```

If you find it takes too long, and your IPFS node is located on the same machine,
consider running this step in offline mode:

```sh
$ ipfs add -r --cid-version 1 --offline $unpacked_wiki
```

Save the last hash of the output from the above process.
It is the CID representing data in the original `$unpacked_wiki`.

### Create mutable copy

Now, copy immutable snapshot to `/root` on [MFS](https://docs-beta.ipfs.io/concepts/file-systems/#mutable-file-system-mfs):

```sh
$ ipfs add -w -r --raw-leaves $upacked_wiki
$ ipfs files cp /ipfs/$ROOT_CID /root
```

Save the last hash of the output from that process. You will use that in the next step.
**Tip:** if anything goes wrong later, remove `/root` from MFS and create it again with the above command.


### Step 5: Add mirror info and search bar to the snapshot
**IMPORTANT: The snapshots must say who disseminated them.** This effort to mirror Wikipedia snapshots is not affiliated with the Wikimedia foundation and is not connected to the volunteers whose contributions are contained in the snapshots. _The snapshots must include information explaining that they were created and disseminated by independent parties, not by Wikipedia._

We have provided a script that adds the necessary information. It also adds a decentralized, serverless search utility to the page.
We intend to make this part easier. See [the issue](https://github.com/ipfs/distributed-wikipedia-mirror/issues/21).

Write a copy of the snapshot from IPFS to `/root` on your machine
Within `execute-changes.sh` update `IPNS_HASH` and `SNAP_DATE`. `IPNS_HASH` value should be the IPNS hash for the language-verison of Wikipedia you're adding. `SNAP_DATE` should be today's date.

```sh
$ ipfs files cp /ipfs/$YOUR_WIKI_HASH /root
```
----
### 🚧 **Warning:** The `execute-changes.sh` script does not work correctly with file structures present in latest ZIM files.

Fixing this step is tracked in [here](https://github.com/ipfs/distributed-wikipedia-mirror/issues/64). Comment there if you have spare time and want to help.

_[We intend to make this part easier. See [the issue](https://github.com/ipfs/distributed-wikipedia-mirror/issues/21)]_ Within `execute-changes.sh` update `IPNS_HASH` and `SNAP_DATE`. `IPNS_HASH` value should be the IPNS hash for the language-verison of Wikipedia you're adding. `SNAP_DATE` should be today's date.
----

Now run the script. It will process the content you copied into `/root`
Now run the script. It will process the content you copied into `/root`:

```sh
$ ./execute-changes.sh
$ ./execute-changes.sh /root
```

This will apply the modifications to your snapshot, add the modified version of the snapshot to IPFS, and return the hash of your new, modified version. That is the hash you want to share.
Expand Down
12 changes: 6 additions & 6 deletions execute-changes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,27 +83,27 @@ NEW_BODYJS=$(
cat - <(sed -e 's/{{SEARCH_CID}}/'"$SEARCH"'/' scripts/search-shim.js)
else
cat -
fi | ipfs add -Q
fi | ipfs add --cid-version 1 -Q
)

ipfs-replace "-/j/body.js" "/ipfs/$NEW_BODYJS"
ipfs-replace "I/s/Wikipedia-logo-v2-200px-transparent.png" \
"/ipfs/$(ipfs add -q assets/wikipedia-on-ipfs-small-flat-cropped-offset-min.png)"
"/ipfs/$(ipfs add --cid-version 1 -q assets/wikipedia-on-ipfs-small-flat-cropped-offset-min.png)"
ipfs-replace "I/s/wikipedia-on-ipfs.png" \
"/ipfs/$(ipfs add -Q assets/wikipedia-on-ipfs-100px.png)"
"/ipfs/$(ipfs add --cid-version 1 -Q assets/wikipedia-on-ipfs-100px.png)"

if [ -n "$SEARCH" ]; then
ipfs-replace "-/j/search.js" "/ipfs/$(ipfs add -Q scripts/search.js)"
ipfs-replace "-/j/search.js" "/ipfs/$(ipfs add --cid-version 1 -Q scripts/search.js)"
fi

# comment out some debug stuff in head.js
HEAD_JS_LOCATION="$(ipfs files stat --hash "$ROOT")/-/j/head.js"
HEAD_JS_HASH="$(ipfs cat "$HEAD_JS_LOCATION" | sed -e "s|^\tdocument.getElementsByTagName( 'head' )|//\0|" | ipfs add -Q)"
HEAD_JS_HASH="$(ipfs cat "$HEAD_JS_LOCATION" | sed -e "s|^\tdocument.getElementsByTagName( 'head' )|//\0|" | ipfs add --cid-version 1 -Q)"

ipfs-replace "-/j/head.js" "/ipfs/$HEAD_JS_HASH"

ipfs-replace "/wiki/index.html" "$ROOT/wiki/$MAIN"
ipfs-replace "/index.html" "/ipfs/$(ipfs add -Q redirect-page/index_root.html)"
ipfs-replace "/index.html" "/ipfs/$(ipfs add --cid-version 1 -Q redirect-page/index_root.html)"

ipfs files flush "$ROOT"
echo "We are done !!!"
Expand Down
Empty file modified getzim.sh
100644 → 100755
Empty file.
21 changes: 21 additions & 0 deletions tools/find_main_page_name.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
# vim: set ts=2 sw=2:

set -euo pipefail

# Every Wikipedia version uses different name of the main page

usage() {
echo "USAGE:"
echo " $0 <lang code>";
echo ""
exit 2
}

if [ -z "${1-}" ]; then
echo "Missing language code"
usage
fi

MAIN_PAGE=$(curl -Ls -o /dev/null -w %{url_effective} https://${1}.wikipedia.org | cut -d"/" -f5)
echo -n "${MAIN_PAGE}.html"
32 changes: 32 additions & 0 deletions tools/find_original_main_page_url.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash
# vim: set ts=2 sw=2:

set -euo pipefail

# Landing pages shipping with ZIM file are either truncated or Kiwix-specific.
# This script finds the URL of original version of the langing page
# mathing the timestamp of snapshot in unpacked ZIM directory

usage() {
echo "USAGE:"
echo " $0 <main page name> <unpacked zim dir>";
echo ""
exit 2
}

if [ -z "${1-}" ]; then
echo "Missing main page name (eg. Main_Page.html) "
usage
fi

if [ -z "${2-}" ]; then
echo "Missing unpacked zim dir (eg. ./out) "
usage
fi

MAIN_PAGE=$1
ZIM_ROOT=$2

SNAPSHOT_URL=$(grep -io 'https://[^"]*oldid=[^"]*' "$ZIM_ROOT/A/$MAIN_PAGE")

echo $SNAPSHOT_URL

0 comments on commit 918d684

Please sign in to comment.