Create selections with the best articles of a WM project
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README
add_target_ids_to_pagelinks.pl
build_all_selections.sh
build_biggest_wikipedia_list.sh
build_en_vital_articles_list.sh
build_projects_lists.pl
build_scores.pl
build_selections.sh
merge_lists.pl

README

This tool allows to easily build non-topic centric selections of
Wikipedia articles.

== Requirements ==

To run it, you need:
* MANDATORY: a GNU/Linux system
* MANDATORY: an access to Internet
* MANDATORY: an access to a Wikipedia database
* OPTION: an access to enwp10 rating database for Wikipedia in English

== Context ==

Many Wikipedias, in different languages, are over 500.000 articles and
even if we can provide offline versions with a reasonnable size, this
is still too much for many devices. That's why we need to build
offline versions with only a selections with the TOP best articles.

== Principle ==

This tool builds lists of key values (pageviews, links, ...) about
Wikipedia articles and put them in a directory. These key values are
everything we have as input to build smart selection algorithms. To
get more detalis about the list, read the README in the language based
directory.

== Tools ==

* build_biggest_wikipedia_list.sh give you the list of all
  wikipedia/languages with more than 500.000 entries.

* build_selections.sh takes a language code ('en' for example) as first
  argument and create the directory with all the key values.

* build_all_selections.sh to build/upload lists for all Wikipedia with
  more than 500.000 pages.

* add_target_ids_to_pagelinks.pl generates on the output a copy of
  pagelinks with in the last column the the link target page id.

* build_en_vital_articles_list.sh generates a the list Wikipedia in
  English vital articles
  (https://en.wikipedia.org/wiki/Wikipedia:Vital_articles)

* build_projects_lists.pl generates the list for projects with
  articles sorted (reverse order) by scores.

== Download ==

You can download the output of that scripts directly from
download.kiwix.org/wp1/ using FTP, HTTP(s) or rsync.

You might be interested by downloading only the last version, here is
a small command (based on rsync) to retrieve the right directory name.

for ENTRY in `rsync --recursive --list-only download.kiwix.org::download.kiwix.org/wp1/ \
| tr -s ' ' | cut -d ' ' -f5 | grep wiki | grep -v '/' | sort -r` ; \
do RADICAL=`echo $ENTRY | sed 's/_20[0-9][0-9]-[0-9][0-9]//g'`; \
if [[ $LAST != $RADICAL ]] ; then echo $ENTRY ; LAST=$RADICAL ; fi ; done