Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for word embeddings #26

Closed
stephantul opened this issue Nov 5, 2022 · 3 comments
Closed

Support for word embeddings #26

stephantul opened this issue Nov 5, 2022 · 3 comments

Comments

@stephantul
Copy link

Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

  • A filename to a local embedding file (e.g., glove.6b.100d.txt)
  • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
  • A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course.
If you think it's a good idea, I can do the PR somewhere next week.

Stéphan

@koaning
Copy link
Owner

koaning commented Nov 5, 2022

The whatlies library supports that, which I've also written. Downside of supporting everything is that many of those models are trained on dated datasets and that pooling word embeddings for longer sentences diminishes the information.

@stephantul
Copy link
Author

Ok, cool, I guess that means it's a no go. I didn't know whatlies contained static word embedders, nice.

@koaning
Copy link
Owner

koaning commented Nov 5, 2022

In a way, whatlies is the precursor to this package. But the goal for embetter is also to embed more than just text and to also keep it relatively simple by mainly focusing on sensible defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants