Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programmatic access to collocation/concordance lists #2196

Closed
NQNStudios opened this issue Nov 30, 2018 · 2 comments
Closed

Programmatic access to collocation/concordance lists #2196

NQNStudios opened this issue Nov 30, 2018 · 2 comments

Comments

@NQNStudios
Copy link
Contributor

The functions nltk.Text.collocations() and nltk.Text.concordances() print their output directly to the console, but I would like to be able to access and manipulate those lists through code. I'd propose adding helper functions get_collocations() and get_concordances() that return lists of objects (ConcordanceInfo and CollocationInfo, or maybe just tuples) that contain raw information, and refactoring the existing methods to print output based on those helper results.

I'll open a PR soon when I start working on this.

@alvations
Copy link
Contributor

@NQNStudios There's a Text.concordance_list() function from #1333 and #1910

For example:

from nltk.corpus import gutenberg
from nltk.text import Text

corpus = gutenberg.words('melville-moby_dick.txt')
text = Text(corpus)
con_list = text.concordance_list("monstrous")

[out]:

ConcordanceLine(left=['Whales', 'and', 'other', 'monsters', 'of', 'the', 'sea', ',', 'appeared', '.', 'Among', 'the', 'former', ',', 'one', 'was', 'of', 'a', 'most'], query='monstrous', right=['size', '.', '...', 'This', 'came', 'towards', 'us', ',', 'open', '-', 'mouthed', ',', 'raising', 'the', 'waves', 'on', 'all', 'sides'], offset=899, left_print='ong the former , one was of a most', right_print='size . ... This came towards us , ', line='ong the former , one was of a most monstrous size . ... This came towards us , ')

Each ConcordanceLine object contains the left, right side context of the query word, the line where query word occurs and the offset is the position of the start of the query word. The left_print and right_print is the words that should be printed in the concordance.


Collocations on the other hand relies heavily on hard-coded BigramCollocationFinder, TriramCollocationFinder and QuadgramCollocationFinder

Example:

 finder = BigramCollocationFinder.from_words(tokens, window_size)
 finder.apply_freq_filter(2)
 finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
 bigram_measures = BigramAssocMeasures()
 collocations = finder.nbest(bigram_measures.likelihood_ratio, num)

So it might not be as easy as it seems.

If the proposal is for collocations in the Text object to have a similar structure as the concordance_list, then I would suggest using Text.collocations_list as the function name.

@NQNStudios
Copy link
Contributor Author

@alvations I think I see what you mean. Returning a list that simply contains the word pairs would be easy, but returning context info would require refactoring all three ngram finders in ways that might create side effects. Is that the gist of the problem?

For my purposes, a simple list of the word pairs would suffice. Would a PR that provides that in a collocations_list function be a good start?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants