Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D2VTransformer raises if passed a Pandas series without index key 0 #2556

Closed
rpetchler opened this issue Jul 14, 2019 · 10 comments
Closed

D2VTransformer raises if passed a Pandas series without index key 0 #2556

rpetchler opened this issue Jul 14, 2019 · 10 comments
Assignees
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix Hacktoberfest Issues marked for hacktoberfest impact LOW Low impact on affected users reach MEDIUM Affects a significant number of users

Comments

@rpetchler
Copy link

D2VTransformer raises if passed a Pandas series with an index that does not contain the key 0:

import pandas as pd
from gensim.sklearn_api import D2VTransformer
from gensim.test.utils import common_texts

series = pd.Series(common_texts)
series.index += 1  # Increment the index so that it does not contain the key 0

transformer = D2VTransformer(min_count=1, size=5)
transformer.fit(series)

Output:

Traceback (most recent call last):
  File "main.py", line 9, in <module>
    transformer.fit(series)
  File "venv/lib/python3.7/site-packages/gensim/sklearn_api/d2vmodel.py", line 162, in fit
    if isinstance(X[0], doc2vec.TaggedDocument):
  File "venv/lib/python3.7/site-packages/pandas/core/series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4375, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

This occurs because the fit and transform methods of D2VTransformer require __getitem__ on the passed iterable not to raise an exception for key 0.

Versions:

Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 09:23:15) [Clang 10.0.1 (clang-1001.0.46.3)]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1

@mpenkov mpenkov added the bug Issue described a bug label Jul 21, 2019
@mpenkov mpenkov self-assigned this Sep 28, 2019
@piskvorky piskvorky added difficulty easy Easy issue: required small fix Hacktoberfest Issues marked for hacktoberfest impact LOW Low impact on affected users reach MEDIUM Affects a significant number of users labels Oct 8, 2019
@piskvorky
Copy link
Owner

piskvorky commented Oct 8, 2019

TODO: check other sklearn_api models too – same issue there as in D2VTransformer?

@Hiyorimi
Copy link
Contributor

Hiyorimi commented Oct 9, 2019

So I need to check here for the first element, not element at index 0?
Will that be enough to close the issue?

Hiyorimi added a commit to Hiyorimi/gensim that referenced this issue Oct 10, 2019
@Hiyorimi
Copy link
Contributor

@piskvorky ready for review and, hopefully, a merge.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 12, 2019

@Hiyorimi just looking at this code again, I've realized we're making a copy of a list. This is unnecessary:

        if isinstance([i for i in X[:1]][0], doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

You could do something like:

def _get_first(some_list):
    for elem in some_list:
        return elem

...

        if isinstance(_get_first(X), doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

WDYT?

@mpenkov mpenkov reopened this Oct 12, 2019
@Hiyorimi
Copy link
Contributor

Are you sure that it is better rather than making a copy of a slice?

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 13, 2019

Not 100%. Making a copy of the list seems wasteful, though. Do you disagree?

@Hiyorimi
Copy link
Contributor

It is 1 line + copy of 1 element
vs
3 lines of function

I have no clue what is better here.

@piskvorky
Copy link
Owner

piskvorky commented Oct 13, 2019

The standard way to get the first element of a repeatable iterable is next(iter(x)).

For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change x itself.

What is the type of X here?

[i for i in X[:1]][0] is too opaque, I'm +1 on expressing the logic more clearly.

@Hiyorimi
Copy link
Contributor

Since

>>> a = pd.DataFrame([[1,2], [2,4]], columns=['1', '2'])
>>> a.index += 1
>>> _get_first = lambda X: next(iter(X))
>>> _get_first(a)
'1'

I just added method.

@Hiyorimi
Copy link
Contributor

Opened a PR

@mpenkov mpenkov closed this as completed in 8624aa2 Nov 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix Hacktoberfest Issues marked for hacktoberfest impact LOW Low impact on affected users reach MEDIUM Affects a significant number of users
Projects
None yet
Development

No branches or pull requests

4 participants