You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.
Problem Statement
When you submit where is my phoone and you get similarities you may get things like:
where is my phone
where is my credit card
Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;
The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.
Similar Issue
Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.
The text was updated successfully, but these errors were encountered:
The idea is that this will allow us to "fine-tune" what similarity actually means in our embedded space.
I'm not sure if this is best done by having another package out there called embetter or if it should be a submodule here. Either way, I wanted to have this idea written down somewhere so that I might discuss it with certain folks.
koaning
changed the title
embetter submodule
embetter: better embeddings
Oct 31, 2021
This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.
Problem Statement
When you submit
where is my phoone
and you get similarities you may get things like:where is my phone
where is my credit card
Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;
The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.
Similar Issue
Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.
The text was updated successfully, but these errors were encountered: