Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface to very large corpora #53

Closed
alexrudnick opened this issue Jan 17, 2012 · 1 comment
Closed

Interface to very large corpora #53

alexrudnick opened this issue Jan 17, 2012 · 1 comment

Comments

@alexrudnick
Copy link
Member

Suggested by Mark Liberman:

A high-performance search (whether term-based or based on other things,
like dates) would yield a result denoting smallish segments of a very large
corpus, and this result could be interpreted by NLTK to delimit areas for
further processing.

(This could use the PyLucene API to interrogate Lucene; assuming a
text-only version of the annotated corpus was already indexed by Lucene.)

Migrated from http://code.google.com/p/nltk/issues/detail?id=268

@alvations
Copy link
Contributor

@alexrudnick pylucene could be replaced by whoosh. There's an easy example of whoosh usage with text corpora at https://github.com/alvations/shiva-something/blob/master/ngram_context.py that we can easily port to nltk.

stevenbird pushed a commit that referenced this issue Aug 25, 2016
Syncing with bleeding edge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants