New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Programmatic access to collocation/concordance lists #2196
Comments
@NQNStudios There's a For example: from nltk.corpus import gutenberg
from nltk.text import Text
corpus = gutenberg.words('melville-moby_dick.txt')
text = Text(corpus)
con_list = text.concordance_list("monstrous") [out]:
Each Collocations on the other hand relies heavily on hard-coded Example: finder = BigramCollocationFinder.from_words(tokens, window_size)
finder.apply_freq_filter(2)
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
bigram_measures = BigramAssocMeasures()
collocations = finder.nbest(bigram_measures.likelihood_ratio, num) So it might not be as easy as it seems. If the proposal is for |
@alvations I think I see what you mean. Returning a list that simply contains the word pairs would be easy, but returning context info would require refactoring all three ngram finders in ways that might create side effects. Is that the gist of the problem? For my purposes, a simple list of the word pairs would suffice. Would a PR that provides that in a |
The functions
nltk.Text.collocations()
andnltk.Text.concordances()
print their output directly to the console, but I would like to be able to access and manipulate those lists through code. I'd propose adding helper functionsget_collocations()
andget_concordances()
that return lists of objects (ConcordanceInfo
andCollocationInfo
, or maybe just tuples) that contain raw information, and refactoring the existing methods to print output based on those helper results.I'll open a PR soon when I start working on this.
The text was updated successfully, but these errors were encountered: