Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] device should be attribute on SentenceEncoder #33

Closed
CarloLepelaars opened this issue Dec 27, 2022 · 4 comments · Fixed by #34
Closed

[BUG] device should be attribute on SentenceEncoder #33

CarloLepelaars opened this issue Dec 27, 2022 · 4 comments · Fixed by #34

Comments

@CarloLepelaars
Copy link
Contributor

CarloLepelaars commented Dec 27, 2022

The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message:
AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction:
Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder()
repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase):
    .
    .
    def __init__(self, name="all-MiniLM-L6-v2", device=None):
        if not device:
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.device = device
        self.name = name
        self.tfm = SBERT(name, device=self.device)
@CarloLepelaars CarloLepelaars changed the title [BUG] repr breaks on SentenceEncoder [BUG] device should be attribute on SentenceEncoder Dec 27, 2022
@koaning
Copy link
Owner

koaning commented Dec 27, 2022

Feel free to make a PR. It might make sense to have the actual reference to the torch object to be self._device if that makes the __repr__ look better.

@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Dec 27, 2022

Cool! I made PR #34 which solves this and adds some tests for SentenceEncoder. Defining self._device additionally does not seem necessary as __repr__ output is already clean.

As for testing text embedders it looks like Sense2VecEncoder is a lot harder to test since it depends on loading a file from disk. Any ideas to test that or do you think its not necessary?

@koaning
Copy link
Owner

koaning commented Dec 27, 2022

There's no harm in adding 'self.path' I think. Feel free to make that PR as well!

@CarloLepelaars
Copy link
Contributor Author

There's no harm in adding 'self.path' I think. Feel free to make that PR as well!

Cool! Created #35 for the Sense2Vec attribute fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants