Add `TextDirectoryCorpus` that yields one doc per file recursively read from directory #1387

macks22 · 2017-06-02T16:05:15Z

Description

Plain text corpuses are sometimes represented using a directory structure with, for instance a top-level directory and subdirectories that represent categories. Then within each subdirectory, each file might be a document. The nesting may run deeper in such directory structures to reflect additional categorization. The popular 20 newsgroups dataset has such a structure. Here is a subset of that structure:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119
|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005

It would be useful to have a gensim corpus available to handle this sort of corpus structure.

Steps/Code/Corpus to Reproduce

I'm envisioning something like this:

class TextDirectoryCorpus(TextCorpus):
    """Read documents recursively from a directory,
    where each file is interpreted as a plain text document.
    """

    def __init__(self, input, metadata=False, min_depth=0, max_depth=None, pattern=None,
                 exclude_pattern=None, **kwargs):
        self.min_depth = min_depth
        self.max_depth = sys.maxsize if max_depth is None else max_depth
        self.pattern = pattern
        self.exclude_pattern = exclude_pattern
        super(TextDirectoryCorpus, self).__init__(input, metadata, **kwargs)
    ....

corpus = textcorpus.TextDirectoryCorpus('20_newsgroups', exclude_pattern="README.*")
paths = corpus.iter_filepaths()
texts = corpus.get_texts()

Expected Results

Actual Results

Versions

This should be available in all versions.

The text was updated successfully, but these errors were encountered:

…e recursively read from directory.

…rpus` change.

piskvorky · 2017-06-03T12:29:52Z

I'm not sure this is something we want in core gensim... there's no end to number of similar utility functions and their combinations we could potentially support, a Pandora's box.

I summarized my thoughts on this topic in a blog post: API Bondage.

@menshikh-iv @gojomo @cscorley @tmylk your thoughts?

menshikh-iv · 2017-06-03T15:53:22Z

I think It's a useful feature. I have written similar wrappers many times.
Moreover, it seems this is one of the most popular use-cases.

I understand your position @piskvorky, but where is the line after which we no longer want to add new features?

macks22 · 2017-06-03T21:04:43Z

I can certainly understand not wanting to bloat a code base. I've also written similar wrappers many times. I figured since there's already a TextCorpus, this is a similar thing and likely just as useful. However, I can also conceive of a separate repo that aggregates together corpus classes for parsing various data sources in a variety of formats. I would think that if this shouldn't be in gensim, then the wikicorpus probably should live somewhere else as well. @piskvorky have you considered splitting out the corpora classes into another repo?

One additional benefit of grouping this and similar corpus classes: it would open up opportunities to provide the distribution used by the wikicorpus for arbitrary text corpora. With a bit of redesign, the chunking and distribution code could be moved to TextCorpus and used across all subclasses. The subclasses would then provide hooks for preprocessing (e.g. preprocess_text in the PR) and possibly filtering out garbage texts (e.g. those that are empty or marked as garbage during preprocessing). Or some alternative method could be provided using callbacks on the stream, or something of that sort. Doing this nicely might require using dill (#558) or using one of the more lightweight solutions based on copy_reg.

piskvorky · 2017-06-04T02:35:05Z

Yeah, a separate dedicated subpackage (or even a repo) that focuses on various efficient, streamed (parallelized?) readers, for sundry data formats, sounds good to me.

I don't mind including this "subdirs reader" as a blueprint example, it's a common use case as you say. But I'd be -1 on adding many such readers in an ad-hoc manner, at ad-hoc locations, throughout gensim. Having a clear structure and plan behind it sounds better.

macks22 · 2017-06-04T13:48:20Z

Having a dedicated subpackage opens up some interesting options. What do you think of the following proposal? It's a tad bit long, but I hope it's closer to a clear structure and plan.

(1) Add a new datasets subpackage that will provide various Dataset classes. The Dataset class will be a simple extension of the corpus interface that includes a URL (or list of URLs) where a dataset can be obtained from, along with code for downloading them. This will satisfy #717 and provide an obvious place to put one-off or narrowly applicable corpora code. In this case, my motivation for implementing the TextDirectoryCorpus is really to use it for the 20 newsgroups dataset (which is a nice starter dataset for exploration and clustering use cases). I've implemented that in this gist. That code would fit well in the proposed datasets subpackage.

(2) Combine the various text-based corpus classes into gensim.corpora.textcorpus or transfer them to the new datasets subpackage. From what I can tell, there are currently 8 of these, including the WikiCorpus. It may be best to keep the WikiCorpus where it is, so let's consider the other 7 for now (8 if you include TextDirectoryCorpus). These are:

gensim.textcorpus.TextCorpus
gensim.test_miislita.CorpusMiislita: simple case of TextCorpus that does stopword filtering in addition to unicode conversion, lowercasing, and tokenization. Augmenting TextCorpus with an optional stopwords list would allow it to subsume this. This could be included in a Dataset in the datasets package if it corresponds to some known URLs for download.
gensim.word2vec.BrownCorpus: simple case of TextDirectoryCorpus that does unicode conversion, lowercasing, tokenization, and accounts for the POS tags in the Brown corpus. Good candidate to move to datasets package.
gensim.word2vec.LineSentence: special case of TextCorpus in which each line, instead of being a single document, may be one or more. Also introduces an option to limit the number of lines considered. Performs unicode conversion and tokenization. This could either be reimplemented by composing or subclassing TextCorpus and overriding getstream (new variant from the PR).
gensim.word2vec.Text8Corpus: simple case of LineSentence with one line. Performs unicode conversion, tokenization, and whitespace stripping. Could be implemented by overriding the getstream method in the modified LineSentence discussed above.
gensim.doc2vec.TaggedBrownCorpus: almost the exact same code as BrownCorpus but wraps the words output in a TaggedDocument. Could also be moved to datasets package.
gensim.doc2vec.TaggedLineDocument: input is same as that to TextCorpus, but only does unicode conversion and tokenization before wrapping in a TaggedDocument with the tag derived from the line number. Since this format is specific to the Doc2Vec model, it makes sense to leave it where it is, but it could leverage all of the other data formats if it assumed its input was an iterator of lists of tokens.

So, common functionality includes taking in one or more text files as either file handles or FS paths, opening and reading them, performing unicode conversion and tokenization, then yielding the words in some format. Each line may be converted into one or more documents, and lowercasing and stopword removal may be performed. Additional preprocessing may occur (as in the POS tag handling). Finally, the words may be wrapped up in some other wrapper (as in TaggedDocument).

So to reorganize, my initial thought would be this:

gensim.corpora.textcorpus

TextCorpus
TextDirectoryCorpus
LineSentence
Text8Corpus

gensim.datasets

BrownCorpus
MiislitaCorpus
TaggedBrownCorpus

An alternative might be to put tagged corpora in a datasets.tagged subpackage or something like that. One could also argue that the Text8Corpus should go in datasets, but I think that format is not uncommon for word2vec datasets. So the parsing code should probably be in textcorpus with a simple wrapper Dataset class.

piskvorky · 2017-06-05T01:06:42Z

Awesome! Thanks for the thoughtful investigation @macks22 .

Since this looks like an architectural change, let's also include the bundling of datasets and models into the discussion. We've been wanting to include some common datasets and trained models for playing around into gensim (beyond the tiny data we have as a part of unit tests now).

Since it's related to how we handle the corpus inputs, so I think it should be discussed at the same place.

macks22 · 2017-06-05T16:53:44Z

So you want to discuss how to include pre-trained models or make them accessible for download? For the datasets, I would think you could just provide code to download them from public URLs (as scikit does). For models, I'm not sure what the best approach is. It might be worth checking out how SpaCy handles that, since they do have some mechanism for loading pre-trained GloVe word vectors.

Are these the sorts of concerns you are wanting to include in the discussion?

…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.

…or `TextDirectoryCorpus`.

…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.

…e recursively read from directory.

…rpus` change.

…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.

…or `TextDirectoryCorpus`.

…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.

…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.

…ings.

…ke modifying preprocessing steps more modular. Consolidate tests in `test_corpora`.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 2, 2017

piskvorky#1387: Add TextDirectoryCorpus that yields one doc per fil…

47c0a46

…e recursively read from directory.

macks22 mentioned this issue Jun 2, 2017

#1387: Add TextDirectoryCorpus that yields one doc per file recursi… #1388

Closed

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 2, 2017

piskvorky#1387: Fix test failures in test_miislita based on `TextCo…

53a9623

…rpus` change.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 8, 2017

piskvorky#1387: Update docs for TextCorpus and fix length caching f…

c71f670

…or `TextDirectoryCorpus`.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 8, 2017

piskvorky#1387: Remove whitespace in lines in TextCorpus docstring.

910cbf9

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017

piskvorky#1387: Add TextDirectoryCorpus that yields one doc per fil…

68d08e2

…e recursively read from directory.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017

piskvorky#1387: Fix test failures in test_miislita based on `TextCo…

0425bef

…rpus` change.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017

piskvorky#1387: Update docs for TextCorpus and fix length caching f…

d881fba

…or `TextDirectoryCorpus`.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017

piskvorky#1387: Remove whitespace in lines in TextCorpus docstring.

1c002f7

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017

piskvorky#1387: Resolve conflicts with rebasing on upstream develop b…

72b6706

…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 24, 2017

piskvorky#1387: fix flake8 formatting issue and add a few more docstr…

1e919c7

…ings.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 24, 2017

piskvorky#1387: Fix corpora deaccent tests for Python 3.

66b535a

macks22 pushed a commit to macks22/gensim that referenced this issue Jul 3, 2017

piskvorky#1387: Add TextDirectoryCorpus and refactor TextCorpus to ma…

a38a30f

…ke modifying preprocessing steps more modular. Consolidate tests in `test_corpora`.

macks22 mentioned this issue Jul 3, 2017

#1387: Add TextDirectoryCorpus and refactor TextCorpus #1459

Merged

menshikh-iv closed this as completed in #1459 Jul 5, 2017

menshikh-iv pushed a commit that referenced this issue Jul 5, 2017

Add TextDirectoryCorpus & refactor TextCorpus. Fix #1387 (#1459)

5300c3b

macks22 mentioned this issue Jul 9, 2017

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `TextDirectoryCorpus` that yields one doc per file recursively read from directory #1387

Add `TextDirectoryCorpus` that yields one doc per file recursively read from directory #1387

macks22 commented Jun 2, 2017

piskvorky commented Jun 3, 2017 •

edited

menshikh-iv commented Jun 3, 2017

macks22 commented Jun 3, 2017 •

edited

piskvorky commented Jun 4, 2017 •

edited

macks22 commented Jun 4, 2017 •

edited

piskvorky commented Jun 5, 2017 •

edited

macks22 commented Jun 5, 2017

Add TextDirectoryCorpus that yields one doc per file recursively read from directory #1387

Add TextDirectoryCorpus that yields one doc per file recursively read from directory #1387

Comments

macks22 commented Jun 2, 2017

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

piskvorky commented Jun 3, 2017 • edited

menshikh-iv commented Jun 3, 2017

macks22 commented Jun 3, 2017 • edited

piskvorky commented Jun 4, 2017 • edited

macks22 commented Jun 4, 2017 • edited

piskvorky commented Jun 5, 2017 • edited

macks22 commented Jun 5, 2017

Add `TextDirectoryCorpus` that yields one doc per file recursively read from directory #1387

Add `TextDirectoryCorpus` that yields one doc per file recursively read from directory #1387

piskvorky commented Jun 3, 2017 •

edited

macks22 commented Jun 3, 2017 •

edited

piskvorky commented Jun 4, 2017 •

edited

macks22 commented Jun 4, 2017 •

edited

piskvorky commented Jun 5, 2017 •

edited