CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info #1064

jacksonllee · 2015-08-04T00:06:56Z

Fixes for `_get_words` in the class `CHILDESCorpusReader`

After some standard code like the following is run:

>>> import nltk
>>> corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-NA-MOR/')
>>> from nltk.corpus.reader import CHILDESCorpusReader
>>> valian = CHILDESCorpusReader(corpus_root, "Valian/.*.xml")

The corpus reader does not seem to handle word tokens with multiple stems properly (note "Lastname's" below):

>>> valian.words('Valian/01a.xml')[:5]
['at', 'Parent', "Lastname's", 'house', 'with']
>>> valian.words('Valian/01a.xml', stem=True)[:5]
['at', 'Parent', 'Lastname', 's', 'house']
>>> valian.tagged_words('Valian/01a.xml')[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ("Lastname's", 'n:prop'), ('house', 'n'), ('with', 'prep')]
>>> valian.tagged_words('Valian/01a.xml', stem=True)[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ('Lastname', 'n:prop'), 's', ('house', 'n')]

Issues:

When stem is True, stems are all treated as distinct strings, making it difficult or impossible to align which stems correspond to which words.
If the stem is treated as a suffix in the XML data, then its PoS tag info is lost.

Results after the fixes are implemented:

>>> valian.words('Valian/01a.xml')[:5]
['at', 'Parent', "Lastname's", 'house', 'with']
>>> valian.words('Valian/01a.xml', stem=True)[:5]
['at', 'Parent', 'Lastname~s', 'house', 'with']
>>> valian.tagged_words('Valian/01a.xml')[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ("Lastname's", 'n:prop~poss'), ('house', 'n'), ('with', 'prep')]
>>> valian.tagged_words('Valian/01a.xml', stem=True)[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ('Lastname~s', 'n:prop~poss'), ('house', 'n'), ('with', 'prep')]

(~ is used as the separator for multiple "stems" of a word, following the CHILDES convention.)

stevenbird · 2015-08-05T18:37:55Z

@tomonori-nagano or @alexisdimi – would you please review this?

stevenbird · 2015-09-01T03:55:18Z

These changes appear reasonable, and in the absence of any objection from @tomonori-nagano or @alexisdimi I'm merging them

CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info

jacksonllee added 2 commits August 3, 2015 12:56

stems from the same word now treated as one word

f676b7e

glue multiple stems for a word; salvage the lost PoS info of suffixes

d69cb90

jacksonllee changed the title ~~multiple stems correspond to their source token; keep suffix PoS info~~ CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info Aug 4, 2015

stevenbird added this to the 3.0.5 milestone Aug 5, 2015

stevenbird self-assigned this Aug 5, 2015

stevenbird added a commit that referenced this pull request Sep 1, 2015

Merge pull request #1064 from JacksonLLee/develop

7082b8b

CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info

stevenbird merged commit 7082b8b into nltk:develop Sep 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info #1064

CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info #1064

jacksonllee commented Aug 4, 2015

stevenbird commented Aug 5, 2015

stevenbird commented Sep 1, 2015

CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info #1064

CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info #1064

Conversation

jacksonllee commented Aug 4, 2015

Fixes for _get_words in the class CHILDESCorpusReader

stevenbird commented Aug 5, 2015

stevenbird commented Sep 1, 2015

Fixes for `_get_words` in the class `CHILDESCorpusReader`