Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Create selections with the best articles of a WM project
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This tool allows to easily build non-topic centric selections of Wikipedia articles. == Requirements == To run it, you need: * MANDATORY: a GNU/Linux system * MANDATORY: an access to Internet * MANDATORY: an access to a Wikipedia database * OPTION: an access to enwp10 rating database for Wikipedia in English == Context == Many Wikipedias, in different languages, are over 500.000 articles and even if we can provide offline versions with a reasonnable size, this is still too much for many devices. That's why we need to build offline versions with only a selections with the TOP best articles. == Principle == This tool builds lists of key values (pageviews, links, ...) about Wikipedia articles and put them in a directory. These key values are everything we have as input to build smart selection algorithms. To get more detalis about the list, read the README in the language based directory. == Tools == * build_biggest_wikipedia_list.sh give you the list of all wikipedia/languages with more than 500.000 entries. * build_selections.sh takes a language code ('en' for example) as first argument and create the directory with all the key values. * build_all_selections.sh to build/upload lists for all Wikipedia with more than 500.000 pages. * add_target_ids_to_pagelinks.pl generates on the output a copy of pagelinks with in the last column the the link target page id. * build_en_vital_articles_list.sh generates a the list Wikipedia in English vital articles (https://en.wikipedia.org/wiki/Wikipedia:Vital_articles) * build_projects_lists.pl generates the list for projects with articles sorted (reverse order) by scores. == Download == You can download the output of that scripts directly from download.kiwix.org/wp1/ using FTP, HTTP(s) or rsync. You might be interested by downloading only the last version, here is a small command (based on rsync) to retrieve the right directory name. for ENTRY in `rsync --recursive --list-only download.kiwix.org::download.kiwix.org/wp1/ \ | tr -s ' ' | cut -d ' ' -f5 | grep wiki | grep -v '/' | sort -r` ; \ do RADICAL=`echo $ENTRY | sed 's/_20[0-9][0-9]-[0-9][0-9]//g'`; \ if [[ $LAST != $RADICAL ]] ; then echo $ENTRY ; LAST=$RADICAL ; fi ; done