Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info #1064

Merged
merged 2 commits into from Sep 1, 2015

Conversation

jacksonllee
Copy link
Contributor

Fixes for _get_words in the class CHILDESCorpusReader

After some standard code like the following is run:

>>> import nltk
>>> corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-NA-MOR/')
>>> from nltk.corpus.reader import CHILDESCorpusReader
>>> valian = CHILDESCorpusReader(corpus_root, "Valian/.*.xml")

The corpus reader does not seem to handle word tokens with multiple stems properly (note "Lastname's" below):

>>> valian.words('Valian/01a.xml')[:5]
['at', 'Parent', "Lastname's", 'house', 'with']
>>> valian.words('Valian/01a.xml', stem=True)[:5]
['at', 'Parent', 'Lastname', 's', 'house']
>>> valian.tagged_words('Valian/01a.xml')[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ("Lastname's", 'n:prop'), ('house', 'n'), ('with', 'prep')]
>>> valian.tagged_words('Valian/01a.xml', stem=True)[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ('Lastname', 'n:prop'), 's', ('house', 'n')]

Issues:

  • When stem is True, stems are all treated as distinct strings, making it difficult or impossible to align which stems correspond to which words.
  • If the stem is treated as a suffix in the XML data, then its PoS tag info is lost.

Results after the fixes are implemented:

>>> valian.words('Valian/01a.xml')[:5]
['at', 'Parent', "Lastname's", 'house', 'with']
>>> valian.words('Valian/01a.xml', stem=True)[:5]
['at', 'Parent', 'Lastname~s', 'house', 'with']
>>> valian.tagged_words('Valian/01a.xml')[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ("Lastname's", 'n:prop~poss'), ('house', 'n'), ('with', 'prep')]
>>> valian.tagged_words('Valian/01a.xml', stem=True)[:5]
[('at', 'prep'), ('Parent', 'n:prop'), ('Lastname~s', 'n:prop~poss'), ('house', 'n'), ('with', 'prep')]

(~ is used as the separator for multiple "stems" of a word, following the CHILDES convention.)

@jacksonllee jacksonllee changed the title multiple stems correspond to their source token; keep suffix PoS info CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info Aug 4, 2015
@stevenbird
Copy link
Member

@tomonori-nagano or @alexisdimi – would you please review this?

@stevenbird stevenbird added this to the 3.0.5 milestone Aug 5, 2015
@stevenbird stevenbird self-assigned this Aug 5, 2015
@stevenbird
Copy link
Member

These changes appear reasonable, and in the absence of any objection from @tomonori-nagano or @alexisdimi I'm merging them

stevenbird added a commit that referenced this pull request Sep 1, 2015
CHILDESCorpusReader: multiple stems correspond to their source token; keep suffix PoS info
@stevenbird stevenbird merged commit 7082b8b into nltk:develop Sep 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants