Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
BERT Vector Space shows issues with unknown words #164
I'm comparing via Cosine Similarity the embedding vectors of sentences. A simple version is like
def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
This works ok in most cases, but I have some limite cases I don't understand. The model is the cased english model -
I compare short sentences to unknown terms - in this case for testing purposes random string of 3 chars:
import sys import time from random import choice import string import numpy as np from service.client import BertClient def GenRandomText(length=8, chars=string.ascii_letters + string.digits): return ''.join([choice(chars) for i in range(length)]) def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) if __name__ == '__main__': from service.client import BertClient bc = BertClient(ip='localhost', port=5555) for i in range(1, 10): leftTerm = GenRandomText(3,string.ascii_letters) rightTerm = "how are you today?" leftV = bc.encode([leftTerm]) rightV = bc.encode([rightTerm ]) cosine_similarity = cosine_sim(leftV,rightV) print("left: %s right: %s distance: %f" % (leftTerm,rightTerm,cosine_similarity) )
This is what happens. The similarity shows high values for these embedding:
If I do a cosine similarity or WMD similarity on these sentences and term I get something different:
This are the outputs for
We have also tried different metrics, the results seems to confirm this issue. Here by example given sentences
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).
hey @jacobdevlin-google I don't understand from your answer if this is an issue or an expected behavior of BERT when getting the embedding from "not meaningful words" representations or if it is due to the average pooling. Maybe @hanxiao have a better idea?
Let's consider that this behavior in any case isn't what I would expect from a sentence embedding, even when using centroids from words tokens. In fact, as I have showed above the Cosine Similarity (or better the Word Mover's) will have a reasonable "similarity" values among those kinds of token sequences.
To make a real-world example, with a behavior like that, it would be impossible to represent ham or spam tokens (let's say for a classifier task), since the latter tokens seem to be equidistant to all the others (!).
Thank you guys in advance.
@loretoparisi there is a bug in my avg. pooling, max pooling and concat mean max pooling. it is fixed in latest master https://github.com/hanxiao/bert-as-service please check it out and may produce different result.
in principle, this bug affects the most when max_seq_len is much longer than the actual sequence length.