Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


The multistream corpora (StoryLens) created for Recognyze eval in InVID project.


If you use this corpora in your evaluations, please cite the following paper (BibTeX):

        author = {Adrian M. P. Bra{\c{s}}oveanu and Lyndon J.B. Nixon and Albert Weichselbraun},
        title  = {StoryLens: A Multiple Views Corpus for Location and Event Detection},
        booktitle = {Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics (WIMS 2018)},
        address = {Novi Sad, Serbia},
        publisher = {ACM},
        year   = {2018},
        date   = {25-27 June 2018}


A multistream corpora contains content from different types of streams.

The current corpora contains annotations based on the following stream types:

  • news - 100 documents
  • twitter - 200 documents
  • youtube - 100 documents

We might consider adding more documents in time.


The YouTube, Twitter and newsmedia documents are not provided with this corpus due to copyright reasons.

The original documents can be retrieved by crawling their URLs. In order to provide third parties with the possibility to do this we provide a list of Document Ids in the following folder: List. Here are the links to the individual lists:

The output for the Twitter partition of the corpora only contains the annotations due to copyright restrictions, but the actual texts of the tweets can be downloaded by ids using free scripts\footnote{Tweet Downloader by ID example:}.


The focus is on location entities, therefore all types of conflicts between locations and other types of entities are included.

The annotations taken into account when building the gold standard files are the following:

  • Natural Location (LOC) - e.g., Danube River, Alps
  • Geo-Political Entity (GPE) - e.g., Vienna, Austria
  • Facility (FAC) - e.g., Brooklyn Bridge, Interstate 66
  • Person (PER) - e.g., Prince Charles, Donald Trump
  • Organization (ORG) - e.g., Google, Apple
  • Product (PROD) - e.g., IPhone, Samsung Galaxy 8
  • Work (WORK) - e.g., Mona Lisa, Star Trek
  • Event (EVENT) - e.g., 9/11, Grenfell Tower fire
  • misc (MISC) - any other type of entity

The ontology can be found here: Recognyze Ontology.


The Annotation Guideline is based on TAC and ACE guidelines.

It can be found in the following folder: Guideline.


The Gold folder contains the judged results.

The links provided are based on the current LIVE DBpedia (September - December 2017) version that would correspond to DBpedia 2017-10 or 2018-04, therefore link changes can occur.

In case you find one of the following error types please feel free to contact us in order to update it:

  • New entities that were not annotated
  • Different possibilities to annotate various entities
  • New links (where no entitiy was found before or where NIL entities currently exist)


The Lenses folder contains some exmple lenses.

We currently provide:

  • Long - longest match for any entity
  • Embedded - includes embedded entities
  • (DBpediaLens - lens related to a certain DBpedia version (e.g., 2016-10 or 2016-04) - currently in preparation)

For future versions of the corpora we will also include:

  • events - arguably only named events (EVENT) such as Grenfell Tower Disaster
  • stories - the narratives focused around big events


Due to the fact that the publication associated with this dataset is still under review and the DBpedia LIVE version used during annotations is not available as a dump, we reserve the right to change small parts of this dataset in the near future.

Example updates might include:

  • New entities - typically entities detected during evaluations or reported by third-party users
  • New Links - if available
  • New Lenses - if needed for a particular use case


In order to download the full tweets please use any tweet downloader, for example Tweet Downloader


If there is a need to use this corpora in other formats than the ones provided by us, please contact us.


Official version is published on GitHub without the original documents due to copyright reasons.

If you plan to use this corpora in an evaluation suite please contact us.

If you discover various errors in this dataset (e.g., missing annotation, wrong types, etc,) feel free to contact us and we will update it.

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)


A mixed document corpus for evaluating Named Entity Recognition and Linking (NER/NEL) systems.







No releases published


No packages published