Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embetter: better embeddings #15

Closed
koaning opened this issue Oct 31, 2021 · 3 comments
Closed

embetter: better embeddings #15

koaning opened this issue Oct 31, 2021 · 3 comments
Assignees

Comments

@koaning
Copy link
Owner

koaning commented Oct 31, 2021

This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

Problem Statement

When you submit where is my phoone and you get similarities you may get things like:

  • where is my phone
  • where is my credit card

Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

image

The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

Similar Issue

Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

@koaning
Copy link
Owner Author

koaning commented Oct 31, 2021

Embetter: making better embeddings.

So how might we go about making our embeddings a bit more "specific"?

I think the main thing to do is to have a human steer it by labeling.

image

But how do we connect the two? By training an embedding on top of the encoder!

image

@koaning
Copy link
Owner Author

koaning commented Oct 31, 2021

The idea is that this will allow us to "fine-tune" what similarity actually means in our embedded space.

image

I'm not sure if this is best done by having another package out there called embetter or if it should be a submodule here. Either way, I wanted to have this idea written down somewhere so that I might discuss it with certain folks.

@koaning koaning changed the title embetter submodule embetter: better embeddings Oct 31, 2021
@koaning koaning self-assigned this Oct 31, 2021
@koaning
Copy link
Owner Author

koaning commented Oct 31, 2021

Aaaaand it's going in a seperate repo. https://github.com/koaning/embetter

@koaning koaning closed this as completed Oct 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant