Skip to content
[WIP] Attributing scores to Wikidata items, making those available via a web API and dumps, under a CC0 license
JavaScript Shell HTML TypeScript
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
assets
config
lib
scripts
server
test
.gitignore
README.md
package.json

README.md

Wikidata Rank

STATUS: WIP

Attributing scores to Wikidata items, making those available via a web API and dumps, under a CC0 license.

Motivation: when re-using Wikidata data, it can be useful to be able to sort a bunch of items by some kind of score [1], [2]. So instead of spamming query.wikidata.org with one SPARQL request per item, we pre-calculate a score for all items from a Wikidata Dump, and serve them in bulk.

There are already pre-existing works on a Wikidata Page Rank, but no API to cherry-pick items of interest, and the data isn't in CC0. Other motivations may include traces of just having fun with scoring algorithms.

Summary

Web API

GET /scores?ids=Q8027|Q1001|Q216092|Q79969
GET /scores?ids=Q8027|Q1001|Q216092|Q79969&subscores=true

Dumps

coming soon

Development setup

Dependencies

  • NodeJS >v6.4.0 (recommanded way to install: NVM)

Install

git clone https://github.com/maxlath/wikidata-rank
cd wikidata-rank
npm install
# Starts the server on port 7264 and watch for files changes to restart
npm run watch

At this point, your server is setup, but it has nothing to serve: we need to populate the database with items scores

Calculate scores

Base scores

item base score = number of labels
+ number of descriptions * 0.5
+ number of aliases * 0.25
+ number of statements * 2
+ number of qualifiers
+ number of references
+ number of sitelinks * 4

wget -c https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
cat latest-all.json.gz | gzip -d | ./scripts/calculate_base_scores

Network scores

item network score = sum of the base scores of items linking to the item

./scripts/calculate_network_scores

Secondary network scores

item secondary network score = sum of the network scores of items linking to the item

./scripts/calculate_secondary_network_scores

Total scores

item total score = base score + network score * 0.25 + secondary network score * 0.1

./scripts/calculate_total_scores

All scores

You can alternatively calculate all those scores at once:

./scripts/calculate_all_scores dump.json

Deploy to Toolforge

See the Hub deploy doc, simply replacing hub by wd-rank, especially on step 4:

echo "module.exports = {
  host: 'https://tools.wmflabs.org',
  // Customize root to match the URL passed by Nginx
  root: '/wd-rank'
}" > config/local.js

install NodeJS with NVM

We can't access wikidata entities dump at /mnt/nfs/dumps-labstore1006.wikimedia.org/xmldatadumps/public/wikidatawiki/entities/latest-all.json.gz from the NodeJS webservice (see Phabricator ticket T193646), so a work-around is to install our own NodeJS using NVM:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
. $HOME/.nvm/nvm.sh
# Use the same version as `webservice --backend=kubernetes nodejs shell`
nvm install 6.11.0

run with custom NodeJS

npm operations still need to be done from the webservice I can't find a way to make the environment take that new node binary into account rather that /usr/bin/node

webservice --backend=kubernetes nodejs shell
cd ~/www/js
npm install
exit
# Force the use of our custom node binary
sed -i 's@node "./scripts@~/.nvm/versions/node/v6.11.0/bin/node "./scripts@' ./scripts/calculate_all_scores
./scripts/calculate_all_scores

run as a daemon

cd
git clone https://github.com/AvianFlu/aeternum.git
cd aeternum
make
cd ~/www/js
~/aeternum/aeternum -o ./calculate_all_scores.log -e ./calculate_all_scores.err -- ./scripts/calculate_all_scores
# Follow the logs
tail -f ./calculate_all_scores*
You can’t perform that action at this time.