||Mapping from Freebase to DBpedia IDs||+4||+|
||Entity surface forms from DBpedia||+5||+6|
||Entity surface forms from FACC||+7||+|
||Word2vec trained on Google News||+8|
- 1 for entity ID-based lookup and DBpedia2Freebase mapping functionalities
- 2 only for building the Elastic entity index; not used later in the retrieval process
- 3 for entity-centric TTI method
- 4 for Freebase2DBpedia mapping functionality
- 5 for entity surface form lookup from DBpedia
- 6 for all EL methods other than "commonness"
- 7 for entity surface form lookup from FACC
- 8 for LTR TTI method
||DBpedia URI-only index||+3|
||DBpedia types index||+4|
- 1 for all EL methods other than "commonness"
- 2 only for entity-centric TTI method
- 3 only for ELR entity ranking method
- 4 only for type-centric TTI method
Raw data sources
DBpedia is distributed, among other formats, as a set of .ttl.bz2 files. We use a selection of these .ttl files, as defined in data/config/dbpedia2mongo.config.json. You can download these files from DBpedia Website. We provide a minimal sample from DBpedia under data/dbpedia-2015-10-sample, which can be used for testing Nordlys on a local machine. Check data/raw-data/dbpedia-2015-10 for detailed information.
The Freebase Annotations of the ClueWeb Corpora (FACC) is used for building entity surface form dictionary. You can download the collection from its main Website. and further process it using our scripts. Alternatively, you can download the preprocessed data from our server. Check the README file under data/raw-data/facc for detailed information.
Word2Vec vectors (300D) trained on Google News corpus, which canbe dowloaded from the its Website. Check the README file under data/raw-data/word2vec for detailed information.