Data components

MongoDB collections

Name	Description	EC	ER	EL	TTI
`dbpedia-2015-10`	DBpedia	+¹	+²		+³
`fb2dbp-2015-10`	Mapping from Freebase to DBpedia IDs	+⁴		+
`surface_forms_dbpedia`	Entity surface forms from DBpedia	+⁵		+⁶
`surface_forms_facc`	Entity surface forms from FACC	+⁷		+
`word2vec-googlenews`	Word2vec trained on Google News				+⁸

¹ for entity ID-based lookup and DBpedia2Freebase mapping functionalities
² only for building the Elastic entity index; not used later in the retrieval process
³ for entity-centric TTI method
⁴ for Freebase2DBpedia mapping functionality
⁵ for entity surface form lookup from DBpedia
⁶ for all EL methods other than "commonness"
⁷ for entity surface form lookup from FACC
⁸ for LTR TTI method

Elastic indices

Name	Description	ER	EL	TTI
`dbpedia_2015_10`	DBpedia index	+	+¹	+²
`dbpedia_2015_10_uri`	DBpedia URI-only index	+³
`dbpedia_2015_10_types`	DBpedia types index			+⁴

¹ for all EL methods other than "commonness"
² only for entity-centric TTI method
³ only for ELR entity ranking method
⁴ only for type-centric TTI method

Raw data sources

DBpedia

DBpedia is distributed, among other formats, as a set of .ttl.bz2 files. We use a selection of these .ttl files, as defined in data/config/dbpedia2mongo.config.json. You can download these files from DBpedia Website. We provide a minimal sample from DBpedia under data/dbpedia-2015-10-sample, which can be used for testing Nordlys on a local machine. Check data/raw-data/dbpedia-2015-10 for detailed information.

FACC

The Freebase Annotations of the ClueWeb Corpora (FACC) is used for building entity surface form dictionary. You can download the collection from its main Website. and further process it using our scripts. Alternatively, you can download the preprocessed data from our server. Check the README file under data/raw-data/facc for detailed information.

Word2Vec

Word2Vec vectors (300D) trained on Google News corpus, which canbe dowloaded from the its Website. Check the README file under data/raw-data/word2vec for detailed information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.rst

data.rst

Data components

MongoDB collections

Elastic indices

Raw data sources

DBpedia

FACC

Word2Vec

Files

data.rst

Latest commit

History

data.rst

File metadata and controls

Data components

MongoDB collections

Elastic indices

Raw data sources

DBpedia

FACC

Word2Vec