The Context Tracking Dataset

Overview

The Context Tracking dataset consists of annotated human-human social conversations. Each conversation contains annotations for the people and location entities mentioned, their properties and the relationships between them. The annotated data enables several subtasks like slot tagging, coreference resolution, resolving plural mentions and entity linking. In addition, the turn-by-turn annotations also enable the development of computationally efficient models which can handle long conversations. The conversations cover a wide range of topics mentioning multiple people and locations. The dataset contains unseen entity names and topics in the test set to quantify the generalization capability of various modeling approaches.

The datasets are provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of this dataset.

Important Links

Data

The Context Tracking dataset consists of human-human conversations with annotations for people and location entities along with additional data for each entity. These conversations are meant to represent typical interactions on instant messaging platforms like Google Messages, iMessage, WhatsApp etc, between two people. For each turn or message, the person sending the message is referred to as the "sender". The conversations and annotations were created by hundreds of paid crowd-workers, followed by a round of verification and filtering. Each conversation is seeded by a scenario describing the overall flow of the conversation. Scenarios are not repeated across the train and test sets. The data is in the format of text files and can be processed using the data processing code here which also contains the implementation of a baseline end-to-end model for this task. Additional information about the dataset, the data collection process and quality control details can be found in the paper.

Conversation Representation

Each conversation contains metadata comprising of the conversation ID, the names of the participants and the initial annotation. The metadata is followed by a sequence of turns. To mimic natural conversations, the sequence of turns is not always strictly alternating between the two participants. Each turn consists of 3 parts separated by the | character. The first part represents the sender/speaker of the turn. The second part represents the text content of the turn. For data processing convenience, all non alpha-numeric and non-space characters have been removed. The third part contains the annotations descriced in the Annotation Format section. Conversations are separated by a blank line.

Annotation Format

As mentioned in the Converstation Representation section, the third part of each turn contains the annotation section. The annotations contain information related to the people and location entities mentioned in the conversation. The annotation data for each entity reference is enclosed by square brackets []. Each annotation consists of 3 parts:

The name or unique identifier of the entity in the conversation.
The properties of the entity that can be inferred from the conversation.
The span of tokens representing each entity reference.

For entities representing people, the properties annotated include the grammatical gender, plurality (typically this refers to a group of people) and in case of plural entities, the constituent singular entities, if they have been mentioned so far in the conversation. For location entities, the properties indicate the plurality of the entity and the constituent singular entities for a plural entity. The span is represented as start_token_index-end_token_index where the token indexes represent 0-based indices into the tokens of the conversation turn. Both entries of the span are inclusive,

License

The Context Tracking dataset are released under CC BY-SA 4.0 license. For the full license, see LICENSE. Please cite the following paper if you use the dataset in your work:

@article{DBLP:journals/corr/abs-2201-12409,
  author    = {Ulrich R{\"{u}}ckert and
               Srinivas Sunkara and
               Abhinav Rastogi and
               Sushant Prakash and
               Pranav Khaitan},
  title     = {A Unified Approach to Entity-Centric Context Tracking in Social Conversations},
  journal   = {CoRR},
  volume    = {abs/2201.12409},
  year      = {2022},
  url       = {https://arxiv.org/abs/2201.12409},
  eprinttype = {arXiv},
  eprint    = {2201.12409},
  timestamp = {Wed, 02 Feb 2022 15:00:01 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2201-12409.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name Context Tracking dataset

alternateName Contrack dataset

url https://github.com/google-research-datasets/context-tracking

description The dataset consists of annotated human-human conversations in various social settings. Along with the conversations, the dataset contains annotations for people and location entities present in the conversation along with the properties of those entities and their relationships. The annotated data enables several subtasks like slot tagging, coreference resolution, resolving plural mentions and entity linking.

provider

property	value
name	`Google`
sameAs	`https://en.wikipedia.org/wiki/Google`

citation https://identifiers.org/arxiv:2201.12409

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Context Tracking Dataset

Overview

Important Links

Data

Conversation Representation

Annotation Format

License

Dataset Metadata

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
test		test
train		train
LICENSE		LICENSE
README.md		README.md

License

google-research-datasets/contrack

Folders and files

Latest commit

History

Repository files navigation

The Context Tracking Dataset

Overview

Important Links

Data

Conversation Representation

Annotation Format

License

Dataset Metadata

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages