Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TextDirectoryCorpus that yields one doc per file recursively read from directory #1387

Closed
macks22 opened this issue Jun 2, 2017 · 7 comments

Comments

@macks22
Copy link
Contributor

macks22 commented Jun 2, 2017

Description

Plain text corpuses are sometimes represented using a directory structure with, for instance a top-level directory and subdirectories that represent categories. Then within each subdirectory, each file might be a document. The nesting may run deeper in such directory structures to reflect additional categorization. The popular 20 newsgroups dataset has such a structure. Here is a subset of that structure:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119
|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005 

It would be useful to have a gensim corpus available to handle this sort of corpus structure.

Steps/Code/Corpus to Reproduce

I'm envisioning something like this:

class TextDirectoryCorpus(TextCorpus):
    """Read documents recursively from a directory,
    where each file is interpreted as a plain text document.
    """

    def __init__(self, input, metadata=False, min_depth=0, max_depth=None, pattern=None,
                 exclude_pattern=None, **kwargs):
        self.min_depth = min_depth
        self.max_depth = sys.maxsize if max_depth is None else max_depth
        self.pattern = pattern
        self.exclude_pattern = exclude_pattern
        super(TextDirectoryCorpus, self).__init__(input, metadata, **kwargs)
    ....

corpus = textcorpus.TextDirectoryCorpus('20_newsgroups', exclude_pattern="README.*")
paths = corpus.iter_filepaths()
texts = corpus.get_texts()

Expected Results

Actual Results

Versions

This should be available in all versions.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 2, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 2, 2017
@piskvorky
Copy link
Owner

piskvorky commented Jun 3, 2017

I'm not sure this is something we want in core gensim... there's no end to number of similar utility functions and their combinations we could potentially support, a Pandora's box.

I summarized my thoughts on this topic in a blog post: API Bondage.

@menshikh-iv @gojomo @cscorley @tmylk your thoughts?

@menshikh-iv
Copy link
Contributor

I think It's a useful feature. I have written similar wrappers many times.
Moreover, it seems this is one of the most popular use-cases.

I understand your position @piskvorky, but where is the line after which we no longer want to add new features?

@macks22
Copy link
Contributor Author

macks22 commented Jun 3, 2017

I can certainly understand not wanting to bloat a code base. I've also written similar wrappers many times. I figured since there's already a TextCorpus, this is a similar thing and likely just as useful. However, I can also conceive of a separate repo that aggregates together corpus classes for parsing various data sources in a variety of formats. I would think that if this shouldn't be in gensim, then the wikicorpus probably should live somewhere else as well. @piskvorky have you considered splitting out the corpora classes into another repo?

One additional benefit of grouping this and similar corpus classes: it would open up opportunities to provide the distribution used by the wikicorpus for arbitrary text corpora. With a bit of redesign, the chunking and distribution code could be moved to TextCorpus and used across all subclasses. The subclasses would then provide hooks for preprocessing (e.g. preprocess_text in the PR) and possibly filtering out garbage texts (e.g. those that are empty or marked as garbage during preprocessing). Or some alternative method could be provided using callbacks on the stream, or something of that sort. Doing this nicely might require using dill (#558) or using one of the more lightweight solutions based on copy_reg.

@piskvorky
Copy link
Owner

piskvorky commented Jun 4, 2017

Yeah, a separate dedicated subpackage (or even a repo) that focuses on various efficient, streamed (parallelized?) readers, for sundry data formats, sounds good to me.

I don't mind including this "subdirs reader" as a blueprint example, it's a common use case as you say. But I'd be -1 on adding many such readers in an ad-hoc manner, at ad-hoc locations, throughout gensim. Having a clear structure and plan behind it sounds better.

@macks22
Copy link
Contributor Author

macks22 commented Jun 4, 2017

Having a dedicated subpackage opens up some interesting options. What do you think of the following proposal? It's a tad bit long, but I hope it's closer to a clear structure and plan.

(1) Add a new datasets subpackage that will provide various Dataset classes. The Dataset class will be a simple extension of the corpus interface that includes a URL (or list of URLs) where a dataset can be obtained from, along with code for downloading them. This will satisfy #717 and provide an obvious place to put one-off or narrowly applicable corpora code. In this case, my motivation for implementing the TextDirectoryCorpus is really to use it for the 20 newsgroups dataset (which is a nice starter dataset for exploration and clustering use cases). I've implemented that in this gist. That code would fit well in the proposed datasets subpackage.

(2) Combine the various text-based corpus classes into gensim.corpora.textcorpus or transfer them to the new datasets subpackage. From what I can tell, there are currently 8 of these, including the WikiCorpus. It may be best to keep the WikiCorpus where it is, so let's consider the other 7 for now (8 if you include TextDirectoryCorpus). These are:

  1. gensim.textcorpus.TextCorpus
  2. gensim.test_miislita.CorpusMiislita: simple case of TextCorpus that does stopword filtering in addition to unicode conversion, lowercasing, and tokenization. Augmenting TextCorpus with an optional stopwords list would allow it to subsume this. This could be included in a Dataset in the datasets package if it corresponds to some known URLs for download.
  3. gensim.word2vec.BrownCorpus: simple case of TextDirectoryCorpus that does unicode conversion, lowercasing, tokenization, and accounts for the POS tags in the Brown corpus. Good candidate to move to datasets package.
  4. gensim.word2vec.LineSentence: special case of TextCorpus in which each line, instead of being a single document, may be one or more. Also introduces an option to limit the number of lines considered. Performs unicode conversion and tokenization. This could either be reimplemented by composing or subclassing TextCorpus and overriding getstream (new variant from the PR).
  5. gensim.word2vec.Text8Corpus: simple case of LineSentence with one line. Performs unicode conversion, tokenization, and whitespace stripping. Could be implemented by overriding the getstream method in the modified LineSentence discussed above.
  6. gensim.doc2vec.TaggedBrownCorpus: almost the exact same code as BrownCorpus but wraps the words output in a TaggedDocument. Could also be moved to datasets package.
  7. gensim.doc2vec.TaggedLineDocument: input is same as that to TextCorpus, but only does unicode conversion and tokenization before wrapping in a TaggedDocument with the tag derived from the line number. Since this format is specific to the Doc2Vec model, it makes sense to leave it where it is, but it could leverage all of the other data formats if it assumed its input was an iterator of lists of tokens.

So, common functionality includes taking in one or more text files as either file handles or FS paths, opening and reading them, performing unicode conversion and tokenization, then yielding the words in some format. Each line may be converted into one or more documents, and lowercasing and stopword removal may be performed. Additional preprocessing may occur (as in the POS tag handling). Finally, the words may be wrapped up in some other wrapper (as in TaggedDocument).

So to reorganize, my initial thought would be this:

gensim.corpora.textcorpus

  • TextCorpus
  • TextDirectoryCorpus
  • LineSentence
  • Text8Corpus

gensim.datasets

  • BrownCorpus
  • MiislitaCorpus
  • TaggedBrownCorpus

An alternative might be to put tagged corpora in a datasets.tagged subpackage or something like that. One could also argue that the Text8Corpus should go in datasets, but I think that format is not uncommon for word2vec datasets. So the parsing code should probably be in textcorpus with a simple wrapper Dataset class.

@piskvorky
Copy link
Owner

piskvorky commented Jun 5, 2017

Awesome! Thanks for the thoughtful investigation @macks22 .

Since this looks like an architectural change, let's also include the bundling of datasets and models into the discussion. We've been wanting to include some common datasets and trained models for playing around into gensim (beyond the tiny data we have as a part of unit tests now).

Since it's related to how we handle the corpus inputs, so I think it should be discussed at the same place.

@macks22
Copy link
Contributor Author

macks22 commented Jun 5, 2017

So you want to discuss how to include pre-trained models or make them accessible for download? For the datasets, I would think you could just provide code to download them from public URLs (as scikit does). For models, I'm not sure what the best approach is. It might be worth checking out how SpaCy handles that, since they do have some mechanism for loading pre-trained GloVe word vectors.

Are these the sorts of concerns you are wanting to include in the discussion?

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 5, 2017
…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 8, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 8, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 23, 2017
…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 24, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 24, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jul 3, 2017
…ke modifying preprocessing steps more modular. Consolidate tests in `test_corpora`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants