Skip to content

[WIP] Attributing scores to Wikidata items, making those available via a web API and dumps, under a CC0 license

Notifications You must be signed in to change notification settings

maxlath/wikidata-rank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikidata Rank

STATUS: WIP

Attributing scores to Wikidata items, making those available via a web API and dumps, under a CC0 license.

Motivation: when re-using Wikidata data, it can be useful to be able to sort a bunch of items by some kind of score [1], [2]. So instead of spamming query.wikidata.org with one SPARQL request per item, we pre-calculate a score for all items from a Wikidata Dump, and serve them in bulk.

There are already pre-existing works on a Wikidata Page Rank, but no API to cherry-pick items of interest, and the data isn't in CC0. Other motivations may include traces of just having fun with scoring algorithms.

Summary

Web API

GET /scores?ids=Q8027|Q1001|Q216092|Q79969
GET /scores?ids=Q8027|Q1001|Q216092|Q79969&subscores=true

Dumps

coming soon

Development setup

Dependencies

  • NodeJS >v6.4.0 (recommanded way to install: NVM)

Install

git clone https://github.com/maxlath/wikidata-rank
cd wikidata-rank
npm install
# Starts the server on port 7264 and watch for files changes to restart
npm run watch

At this point, your server is setup, but it has nothing to serve: we need to populate the database with items scores

Calculate scores

Base scores

item base score = number of labels
+ number of descriptions * 0.5
+ number of aliases * 0.25
+ number of statements * 2
+ number of qualifiers
+ number of references
+ number of sitelinks * 4

wget -c https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
cat latest-all.json.gz | gzip -d | ./scripts/calculate_base_scores

Network scores

item network score = sum of the base scores of items linking to the item

./scripts/calculate_network_scores

Secondary network scores

item secondary network score = sum of the network scores of items linking to the item

./scripts/calculate_secondary_network_scores

Total scores

item total score = base score + network score * 0.25 + secondary network score * 0.1

./scripts/calculate_total_scores

All scores

You can alternatively calculate all those scores at once:

./scripts/calculate_all_scores dump.json

Deploy to Toolforge

See the Hub deploy doc, simply replacing hub by wd-rank, especially on step 4:

echo "module.exports = {
  host: 'https://tools.wmflabs.org',
  // Customize root to match the URL passed by Nginx
  root: '/wd-rank'
}" > config/local.js

install NodeJS with NVM

We can't access wikidata entities dump at /mnt/nfs/dumps-labstore1006.wikimedia.org/xmldatadumps/public/wikidatawiki/entities/latest-all.json.gz from the NodeJS webservice (see Phabricator ticket T193646), so a work-around is to install our own NodeJS using NVM:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
. $HOME/.nvm/nvm.sh
# Use the same version as `webservice --backend=kubernetes nodejs shell`
nvm install 6.11.0

run with custom NodeJS

npm operations still need to be done from the webservice I can't find a way to make the environment take that new node binary into account rather that /usr/bin/node

webservice --backend=kubernetes nodejs shell
cd ~/www/js
npm install
exit
# Force the use of our custom node binary
sed -i 's@node "./scripts@~/.nvm/versions/node/v6.11.0/bin/node "./scripts@' ./scripts/calculate_all_scores
./scripts/calculate_all_scores

run as a daemon

cd
git clone https://github.com/AvianFlu/aeternum.git
cd aeternum
make
cd ~/www/js
~/aeternum/aeternum -o ./calculate_all_scores.log -e ./calculate_all_scores.err -- ./scripts/calculate_all_scores
# Follow the logs
tail -f ./calculate_all_scores*

About

[WIP] Attributing scores to Wikidata items, making those available via a web API and dumps, under a CC0 license

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published