Interface to very large corpora #53

alexrudnick · 2012-01-17T06:24:55Z

Suggested by Mark Liberman:

A high-performance search (whether term-based or based on other things,
like dates) would yield a result denoting smallish segments of a very large
corpus, and this result could be interpreted by NLTK to delimit areas for
further processing.

(This could use the PyLucene API to interrogate Lucene; assuming a
text-only version of the annotated corpus was already indexed by Lucene.)

Migrated from http://code.google.com/p/nltk/issues/detail?id=268

alvations · 2016-04-11T00:45:25Z

@alexrudnick pylucene could be replaced by whoosh. There's an easy example of whoosh usage with text corpora at https://github.com/alvations/shiva-something/blob/master/ngram_context.py that we can easily port to nltk.

Syncing with bleeding edge

alvations added good first issue enhancement labels Apr 11, 2016

stevenbird pushed a commit that referenced this issue Aug 25, 2016

Merge pull request #53 from nltk/develop

f8b11ea

Syncing with bleeding edge

stevenbird added the inactive label Aug 12, 2019

stevenbird closed this as completed Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interface to very large corpora #53

Interface to very large corpora #53

alexrudnick commented Jan 17, 2012

alvations commented Apr 11, 2016

Interface to very large corpora #53

Interface to very large corpora #53

Comments

alexrudnick commented Jan 17, 2012

alvations commented Apr 11, 2016