Skip to content

ChatterBot Corpus Specification

Gunther Cox edited this page Jan 6, 2017 · 7 revisions

Specification status

This document is currently a work-in-progress. The specification for the future format of ChatterBot's corpus data files is still in the process of being researched.

Design goals

The format for the ChatterBot dialog corpus should be flexible enough that it can accommodate the addition of new fields in the future while remaining backwards compatible with older versions of the corpus that still lack those fields.

File extension

  • Each corpus file should end with the .yml file extension.

Data representation schema

Serialized dialog data will be represented in YAML format. YAML is intended to be easily readable which may help avoid syntax errors while also visually representing the data better.

- - text: Hello, how are you doing today?
  - text: I am doing well, thank you.
- - text: I cannot find my keys.
  - text: Where was the last place you remember having them?
  • Responses are indicated by consecutive statements in each list.
  • The representation of having responses to the same statement would be done by listing a new conversation for each different response. Essentially, each differing response is a different conversation so the data should respect that separation.

Required attributes

  • text: The text of each statement

Optional attributes

  • created_at: The datetime that the statement was spoken at.

Undecided features

This section lists features, ideas and functionality that is yet to be determined for the structure of the ChatterBot corpus.

Support for wildcards

Other existing chat-bot language notations support features such as wildcards in statements. For example, a statement such as "My favorite color is {color}" could be filled in by any valid color.

  • Pros: It makes it easy for the developer to teach the chat bot valid formats for a particular statement.
  • Cons: Different conversations are possible based on the color specified. The color isn't arbitrary. For example, if a person tells the bot" "My favorite color is red.", the bot may reply: "Did you know 27% of people favor the color red?". This response has a dependency that is strictly placed on the color itself.
    • TODO: Consider the possibility that there is a difference between language structure and response structure and that these are items that are taught independently.
    • Grammars, learning to build a grammar

References

  1. https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
  2. https://github.com/rkadlec/ubuntu-ranking-dataset-creator
  3. http://www.alicebot.org/aiml/aaa/
  4. https://github.com/bwilcox-1234/ChatScript