expanded functionality for nltk.Text #546

stevenbird · 2013-11-10T00:34:35Z

Add common corpus reader methods to nltk.Text, including words(), sents(), tagged_words(), tagged_sents(). Use punkt for sentence tokenisation, and the stanford tagger for tagging. Possibly add parsed_sents() and interface to the stanford parser.

The text was updated successfully, but these errors were encountered:

stevenbird · 2013-11-13T05:31:34Z

Support loading of local files and URLs, e.g., load the contents of a URL, stripping HTML markup using BeautifulSoup.get_text(), and tokenize using nltk.word_tokenize();

Text("http://www.gutenberg.org/files/2554/2554.txt").words()

kmike · 2013-11-13T09:35:30Z

I personally think it makes more sense to initialize Text with text, and provide a classmethod for loading from URL or a file.

Some API ideas can be stolen from TextBlob - hey, they even bundle NLTK inside their package :) I made a brief look and some APIs looks controversial, but it still worths to look at.

jnothman · 2013-11-20T06:05:47Z

Use punkt for sentence tokenisation

Punkt assumes pre-tokenised text.
Do you intend to use a pre-trained model (do we need a parameter in Text that specifies language, or otherwise a set of models to use?), or to train and predict on the text?

stevenbird mentioned this issue Nov 10, 2013

Text class nltk/nltk_book#36

Open

ghost assigned stevenxxiu and stevenbird Nov 28, 2013

stevenbird added the enhancement label Jul 8, 2014

stevenbird added the inactive label Aug 22, 2019

stevenbird closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expanded functionality for nltk.Text #546

expanded functionality for nltk.Text #546

stevenbird commented Nov 10, 2013

stevenbird commented Nov 13, 2013

kmike commented Nov 13, 2013

jnothman commented Nov 20, 2013

expanded functionality for nltk.Text #546

expanded functionality for nltk.Text #546

Comments

stevenbird commented Nov 10, 2013

stevenbird commented Nov 13, 2013

kmike commented Nov 13, 2013

jnothman commented Nov 20, 2013