Support for word embeddings #26

stephantul · 2022-11-05T12:43:48Z

Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)
Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course.
If you think it's a good idea, I can do the PR somewhere next week.

Stéphan

The text was updated successfully, but these errors were encountered:

koaning · 2022-11-05T12:45:43Z

The whatlies library supports that, which I've also written. Downside of supporting everything is that many of those models are trained on dated datasets and that pooling word embeddings for longer sentences diminishes the information.

stephantul · 2022-11-05T12:50:58Z

Ok, cool, I guess that means it's a no go. I didn't know whatlies contained static word embedders, nice.

koaning · 2022-11-05T13:00:52Z

In a way, whatlies is the precursor to this package. But the goal for embetter is also to embed more than just text and to also keep it relatively simple by mainly focusing on sensible defaults.

stephantul closed this as completed Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for word embeddings #26

Support for word embeddings #26

stephantul commented Nov 5, 2022

koaning commented Nov 5, 2022

stephantul commented Nov 5, 2022

koaning commented Nov 5, 2022 •

edited

Support for word embeddings #26

Support for word embeddings #26

Comments

stephantul commented Nov 5, 2022

koaning commented Nov 5, 2022

stephantul commented Nov 5, 2022

koaning commented Nov 5, 2022 • edited

koaning commented Nov 5, 2022 •

edited