Skip to content

Latest commit

 

History

History
67 lines (52 loc) · 4.52 KB

data.rst

File metadata and controls

67 lines (52 loc) · 4.52 KB

Data components

MongoDB collections

Name Description EC ER EL TTI
dbpedia-2015-10 DBpedia +1 +2 +3
fb2dbp-2015-10 Mapping from Freebase to DBpedia IDs +4 +
surface_forms_dbpedia Entity surface forms from DBpedia +5 +6
surface_forms_facc Entity surface forms from FACC +7 +
word2vec-googlenews Word2vec trained on Google News +8
  • 1 for entity ID-based lookup and DBpedia2Freebase mapping functionalities
  • 2 only for building the Elastic entity index; not used later in the retrieval process
  • 3 for entity-centric TTI method
  • 4 for Freebase2DBpedia mapping functionality
  • 5 for entity surface form lookup from DBpedia
  • 6 for all EL methods other than "commonness"
  • 7 for entity surface form lookup from FACC
  • 8 for LTR TTI method

Elastic indices

Name Description ER EL TTI
dbpedia_2015_10 DBpedia index + +1 +2
dbpedia_2015_10_uri DBpedia URI-only index +3
dbpedia_2015_10_types DBpedia types index +4
  • 1 for all EL methods other than "commonness"
  • 2 only for entity-centric TTI method
  • 3 only for ELR entity ranking method
  • 4 only for type-centric TTI method

Raw data sources

DBpedia

DBpedia is distributed, among other formats, as a set of .ttl.bz2 files. We use a selection of these .ttl files, as defined in data/config/dbpedia2mongo.config.json. You can download these files from DBpedia Website. We provide a minimal sample from DBpedia under data/dbpedia-2015-10-sample, which can be used for testing Nordlys on a local machine. Check data/raw-data/dbpedia-2015-10 for detailed information.

FACC

The Freebase Annotations of the ClueWeb Corpora (FACC) is used for building entity surface form dictionary. You can download the collection from its main Website. and further process it using our scripts. Alternatively, you can download the preprocessed data from our server. Check the README file under data/raw-data/facc for detailed information.

Word2Vec

Word2Vec vectors (300D) trained on Google News corpus, which canbe dowloaded from the its Website. Check the README file under data/raw-data/word2vec for detailed information.