Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expanded functionality for nltk.Text #546

Closed
stevenbird opened this issue Nov 10, 2013 · 3 comments
Closed

expanded functionality for nltk.Text #546

stevenbird opened this issue Nov 10, 2013 · 3 comments

Comments

@stevenbird
Copy link
Member

Add common corpus reader methods to nltk.Text, including words(), sents(), tagged_words(), tagged_sents(). Use punkt for sentence tokenisation, and the stanford tagger for tagging. Possibly add parsed_sents() and interface to the stanford parser.

@stevenbird
Copy link
Member Author

Support loading of local files and URLs, e.g., load the contents of a URL, stripping HTML markup using BeautifulSoup.get_text(), and tokenize using nltk.word_tokenize();

Text("http://www.gutenberg.org/files/2554/2554.txt").words()

@kmike
Copy link
Member

kmike commented Nov 13, 2013

I personally think it makes more sense to initialize Text with text, and provide a classmethod for loading from URL or a file.

Some API ideas can be stolen from TextBlob - hey, they even bundle NLTK inside their package :) I made a brief look and some APIs looks controversial, but it still worths to look at.

@jnothman
Copy link
Contributor

Use punkt for sentence tokenisation

  • Punkt assumes pre-tokenised text.
  • Do you intend to use a pre-trained model (do we need a parameter in Text that specifies language, or otherwise a set of models to use?), or to train and predict on the text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants