Natural Stories Corpus
This is a corpus of naturalistic stories meant to contain varied, low-frequency syntactic constructions. There are a variety of annotations and psycholinguistic measures available for the stories.
The stories in with their various annotations are coordinated around the file
words.tsv, which specifies a unique code for each token in the story under a variety of different tokenization schemes.
For example, the following lines in
words.tsv cover the phrase
the long-bearded mill owners.:
1.54.whole the 1.54.word the 1.54.1 the 1.55.whole long - bearded 1.55.word long - bearded 1.55.1 long 1.55.2 - 1.55.3 bearded 1.56.whole mill 1.56.word mill 1.56.1 mill 1.57.whole owners . 1.57.word owners 1.57.1 owners 1.57.2 .
The first column is the token code; the second is the token itself. For example,
1.57.whole represents the token
1.57.word represents the token
The token code consists of three fields:
- The id of the story the token is found in,
- The number of the token in the story,
- An additional field whose value is
wholefor the entire token including punctuation,
wordfor the token stripped of punctuation to the left and right, and then 1 through n for each sub-token in
wholeas segmented by NLTK's TreebankWordTokenizer.
The various annotations (frequencies, parses, RTs, etc.) should reference these codes so that we can track tokens uniformly.