Programmatic access to collocation/concordance lists #2196

NQNStudios · 2018-11-30T18:34:26Z

The functions nltk.Text.collocations() and nltk.Text.concordances() print their output directly to the console, but I would like to be able to access and manipulate those lists through code. I'd propose adding helper functions get_collocations() and get_concordances() that return lists of objects (ConcordanceInfo and CollocationInfo, or maybe just tuples) that contain raw information, and refactoring the existing methods to print output based on those helper results.

I'll open a PR soon when I start working on this.

The text was updated successfully, but these errors were encountered:

alvations · 2018-12-17T16:09:18Z

@NQNStudios There's a Text.concordance_list() function from #1333 and #1910

For example:

from nltk.corpus import gutenberg
from nltk.text import Text

corpus = gutenberg.words('melville-moby_dick.txt')
text = Text(corpus)
con_list = text.concordance_list("monstrous")

[out]:

ConcordanceLine(left=['Whales', 'and', 'other', 'monsters', 'of', 'the', 'sea', ',', 'appeared', '.', 'Among', 'the', 'former', ',', 'one', 'was', 'of', 'a', 'most'], query='monstrous', right=['size', '.', '...', 'This', 'came', 'towards', 'us', ',', 'open', '-', 'mouthed', ',', 'raising', 'the', 'waves', 'on', 'all', 'sides'], offset=899, left_print='ong the former , one was of a most', right_print='size . ... This came towards us , ', line='ong the former , one was of a most monstrous size . ... This came towards us , ')

Each ConcordanceLine object contains the left, right side context of the query word, the line where query word occurs and the offset is the position of the start of the query word. The left_print and right_print is the words that should be printed in the concordance.

Collocations on the other hand relies heavily on hard-coded BigramCollocationFinder, TriramCollocationFinder and QuadgramCollocationFinder

Example:

 finder = BigramCollocationFinder.from_words(tokens, window_size)
 finder.apply_freq_filter(2)
 finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
 bigram_measures = BigramAssocMeasures()
 collocations = finder.nbest(bigram_measures.likelihood_ratio, num)

So it might not be as easy as it seems.

If the proposal is for collocations in the Text object to have a similar structure as the concordance_list, then I would suggest using Text.collocations_list as the function name.

NQNStudios · 2019-01-10T16:24:39Z

@alvations I think I see what you mean. Returning a list that simply contains the word pairs would be easy, but returning context info would require refactoring all three ngram finders in ways that might create side effects. Is that the gist of the problem?

For my purposes, a simple list of the word pairs would suffice. Would a PR that provides that in a collocations_list function be a good start?

alvations added corpus text labels Dec 17, 2018

NQNStudios mentioned this issue Feb 7, 2019

Updating collocations.py docs #2226

Closed

NQNStudios added a commit to NQNStudios/nltk that referenced this issue Feb 7, 2019

Close nltk#2196 by refactoring Text.collocations()

f81ccb2

stevenbird closed this as completed in 4115ed1 Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Programmatic access to collocation/concordance lists #2196

Programmatic access to collocation/concordance lists #2196

NQNStudios commented Nov 30, 2018

alvations commented Dec 17, 2018

NQNStudios commented Jan 10, 2019

Programmatic access to collocation/concordance lists #2196

Programmatic access to collocation/concordance lists #2196

Comments

NQNStudios commented Nov 30, 2018

alvations commented Dec 17, 2018

NQNStudios commented Jan 10, 2019