Bug in sklearn_api.hdp and in sklearn_api.ldamodel #1676

mmunozm · 2017-10-31T18:31:14Z

Description

The new sklearn_api.hdp (and also ldamodel) modules and/or their combination with matutils.Sparse2Corpus yield the wrong results when fitting models from sklearn vectorizers or other sklearn-styled sparse matrices, since they are stored in CSR format. This error might occur in other sklearn_api classes, too.

I believe that either the default value of Sparse2Corpus constructor's documents_columns parameter should be changed:

class Sparse2Corpus(object):
    """
    Convert a matrix in scipy.sparse format into a streaming gensim corpus.
    This is the mirror function to `corpus2csc`.
    """

    # Change this to def __init__(self, sparse, documents_columns=False) ?
    def __init__(self, sparse, documents_columns=True):
        if documents_columns:
            self.sparse = sparse.tocsc()
        else:
            self.sparse = sparse.tocsr().T  # make sure shape[1]=number of docs (needed in len())

or the following call should include that documents_columns=False:

# Same happens in lda model
class HdpTransformer(TransformerMixin, BaseEstimator):
    ...
    def fit(self, X, y=None):
        """
        Fit the model according to the given training data.
        Calls gensim.models.HdpModel
        """
        if sparse.issparse(X):
            corpus = matutils.Sparse2Corpus(X) # Change to matutils.Sparse2Corpus(X, False) ?
        ...

Steps/Code/Corpus to Reproduce

Example: The number of processed documents corresponds to the number of features.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_api.hdp import HdpTransformer

# Small dataset configuration
ngs = fetch_20newsgroups()
samples = ngs.data[:100]

# Simple count vectorization
vectorizer = CountVectorizer()
# x is a sparse matrix
x = vectorizer.fit_transform(samples)
print("%d documents, %d features" % x.shape) # 100 documents, 6547 features

inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}

# Train a HDP
hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x)

# Should be 100 but got 6547
print("Processed documents %d" % hdp_transformer.gensim_model.m_num_docs_processed) # 6547

Expected Results

We should expect that the hdp gensim model had processed the 100 documents in samples.

print("%d documents, %d features" % x.shape)
# 100 documents, 6547 features

Actual Results

HDP processed 6547 documents

print("Processed documents: %d" % hdp_transformer.gensim_model.m_num_docs_processed)
# Processed documents: 6547

Workaround, for the record

To make it work correctly, the sparse matrix x should be transposed.

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(samples)
inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}

hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x.T)

Versions

Windows-8.1-6.3.9600
('Python', '2.7.13 |Anaconda custom (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.12.1')
('SciPy', '1.0.0')
('gensim', '3.0.1')
('FAST_VERSION', 0)

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2017-11-01T05:48:09Z

Thanks for the detailed report @mmunozm, add explicitly documents_columns to wrappers is a better idea I think (than change default behavior).

menshikh-iv · 2017-11-01T06:11:57Z

@chinmayapancholi13 are you agree? can you fix this?

chinmayapancholi13 · 2017-11-02T06:34:39Z

@menshikh-iv I agree that it would be better to change the call being made from the sklearn_wrapper code than to change the default behaviour, as doing that might affect other parts of the existing code as well.
Yes, I can get this done (along with the issue in fasttext's load method) within this week. My schedule has been a little swamped recently so sorry again for the delay and thanks a lot for your patience! :)

(piskvorky#1704) * updated param passed to 'Sparse2Corpus' * updated unit test * updated 'partial_fit' and passed named params

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Nov 1, 2017

chinmayapancholi13 mentioned this issue Nov 9, 2017

Update 'documents_columns' param passed to 'Sparse2Corpus' #1704

Merged

menshikh-iv closed this as completed in fff6529 Nov 20, 2017

VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this issue Nov 26, 2017

Fix sklearn_api bug with documents_columns parameter. Fix piskvorky#1676

76983da

(piskvorky#1704) * updated param passed to 'Sparse2Corpus' * updated unit test * updated 'partial_fit' and passed named params

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in sklearn_api.hdp and in sklearn_api.ldamodel #1676

Bug in sklearn_api.hdp and in sklearn_api.ldamodel #1676

mmunozm commented Oct 31, 2017 •

edited

menshikh-iv commented Nov 1, 2017

menshikh-iv commented Nov 1, 2017

chinmayapancholi13 commented Nov 2, 2017

Bug in sklearn_api.hdp and in sklearn_api.ldamodel #1676

Bug in sklearn_api.hdp and in sklearn_api.ldamodel #1676

Comments

mmunozm commented Oct 31, 2017 • edited

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Workaround, for the record

Versions

menshikh-iv commented Nov 1, 2017

menshikh-iv commented Nov 1, 2017

chinmayapancholi13 commented Nov 2, 2017

mmunozm commented Oct 31, 2017 •

edited