Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in sklearn_api.hdp and in sklearn_api.ldamodel #1676

Closed
mmunozm opened this issue Oct 31, 2017 · 3 comments
Closed

Bug in sklearn_api.hdp and in sklearn_api.ldamodel #1676

mmunozm opened this issue Oct 31, 2017 · 3 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@mmunozm
Copy link

mmunozm commented Oct 31, 2017

Description

The new sklearn_api.hdp (and also ldamodel) modules and/or their combination with matutils.Sparse2Corpus yield the wrong results when fitting models from sklearn vectorizers or other sklearn-styled sparse matrices, since they are stored in CSR format. This error might occur in other sklearn_api classes, too.

I believe that either the default value of Sparse2Corpus constructor's documents_columns parameter should be changed:

class Sparse2Corpus(object):
    """
    Convert a matrix in scipy.sparse format into a streaming gensim corpus.
    This is the mirror function to `corpus2csc`.
    """

    # Change this to def __init__(self, sparse, documents_columns=False) ?
    def __init__(self, sparse, documents_columns=True):
        if documents_columns:
            self.sparse = sparse.tocsc()
        else:
            self.sparse = sparse.tocsr().T  # make sure shape[1]=number of docs (needed in len())

or the following call should include that documents_columns=False:

# Same happens in lda model
class HdpTransformer(TransformerMixin, BaseEstimator):
    ...
    def fit(self, X, y=None):
        """
        Fit the model according to the given training data.
        Calls gensim.models.HdpModel
        """
        if sparse.issparse(X):
            corpus = matutils.Sparse2Corpus(X) # Change to matutils.Sparse2Corpus(X, False) ?
        ...

Steps/Code/Corpus to Reproduce

Example: The number of processed documents corresponds to the number of features.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_api.hdp import HdpTransformer

# Small dataset configuration
ngs = fetch_20newsgroups()
samples = ngs.data[:100]

# Simple count vectorization
vectorizer = CountVectorizer()
# x is a sparse matrix
x = vectorizer.fit_transform(samples)
print("%d documents, %d features" % x.shape) # 100 documents, 6547 features

inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}

# Train a HDP
hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x)

# Should be 100 but got 6547
print("Processed documents %d" % hdp_transformer.gensim_model.m_num_docs_processed) # 6547

Expected Results

We should expect that the hdp gensim model had processed the 100 documents in samples.

print("%d documents, %d features" % x.shape)
# 100 documents, 6547 features

Actual Results

HDP processed 6547 documents

print("Processed documents: %d" % hdp_transformer.gensim_model.m_num_docs_processed)
# Processed documents: 6547

Workaround, for the record

To make it work correctly, the sparse matrix x should be transposed.

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(samples)
inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}

hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x.T)

Versions

Windows-8.1-6.3.9600
('Python', '2.7.13 |Anaconda custom (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.12.1')
('SciPy', '1.0.0')
('gensim', '3.0.1')
('FAST_VERSION', 0)
@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Nov 1, 2017
@menshikh-iv
Copy link
Contributor

Thanks for the detailed report @mmunozm, add explicitly documents_columns to wrappers is a better idea I think (than change default behavior).

@menshikh-iv
Copy link
Contributor

@chinmayapancholi13 are you agree? can you fix this?

@chinmayapancholi13
Copy link
Contributor

@menshikh-iv I agree that it would be better to change the call being made from the sklearn_wrapper code than to change the default behaviour, as doing that might affect other parts of the existing code as well.
Yes, I can get this done (along with the issue in fasttext's load method) within this week. My schedule has been a little swamped recently so sorry again for the delay and thanks a lot for your patience! :)

VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this issue Nov 26, 2017
 (piskvorky#1704)

* updated param passed to 'Sparse2Corpus'

* updated unit test

* updated 'partial_fit' and passed named params
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

No branches or pull requests

3 participants