New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT Vector Space shows issues with unknown words #164

Closed
loretoparisi opened this Issue Nov 22, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@loretoparisi
Copy link

loretoparisi commented Nov 22, 2018

I'm comparing via Cosine Similarity the embedding vectors of sentences. A simple version is like

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

This works ok in most cases, but I have some limite cases I don't understand. The model is the cased english model - cased_L-12_H-768_A-12 and I'm using bert-as-service to test this issue.

I compare short sentences to unknown terms - in this case for testing purposes random string of 3 chars:

import sys
import time

from random import choice
import string

import numpy as np

from service.client import BertClient

def GenRandomText(length=8, chars=string.ascii_letters + string.digits):
    return ''.join([choice(chars) for i in range(length)])

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

if __name__ == '__main__':
    from service.client import BertClient
    bc = BertClient(ip='localhost', port=5555)

    for i in range(1, 10):
        
        leftTerm = GenRandomText(3,string.ascii_letters)
        rightTerm = "how are you today?"

        leftV = bc.encode([leftTerm])
        rightV = bc.encode([rightTerm ])

        cosine_similarity = cosine_sim(leftV[0],rightV[0])
        
        print("left: %s right: %s distance: %f" % (leftTerm,rightTerm,cosine_similarity) )

This is what happens. The similarity shows high values for these embedding:

left: Dzq right: how are you today? similarity: 0.803445
left: qqC right: how are you today? similarity: 0.713830
left: HSQ right: how are you today? similarity: 0.745146
left: jMB right: how are you today? similarity: 0.831154
left: naR right: how are you today? similarity: 0.861142
left: Bzi right: how are you today? similarity: 0.833868
left: dCc right: how are you today? similarity: 0.815975
left: qCp right: how are you today? similarity: 0.784781
left: wQM right: how are you today? similarity: 0.836569

If I do a cosine similarity or WMD similarity on these sentences and term I get something different:

This are the outputs for left: Dzq right: how are you today? sentences:

{
            "wmd_similarities_norm": [
                -0.11699746160433133
            ],
            "cosine_similarities": [
                0.19988850682356737
            ],
            "wmd_similarities": [
                0.44150126919783433
            ]
        }

where wmd_similarities is the Word Mover's Similarity based on Word Mover's Distance, while cosine_similarities is the Cosine Similarity.
The WMD was calculated using gensim functionality over the FastText Wikipedia model here.

We have also tried different metrics, the results seems to confirm this issue. Here by example given sentences ["drive a coupe you can stand in (it's lit)"] and ["dfg"]

Euclidean distance is 16.9716377258
Manhattan distance is 367.4368
Chebyshev similarity is 0.309262271971
Canberra distance is 533.25599833
Cosine similarity is 0.824640512466
WMT similarity (WORD2VEC) 0.250232081318
@hanxiao

This comment has been minimized.

Copy link

hanxiao commented Nov 23, 2018

related to hanxiao/bert-as-service#44

@jacobdevlin-google

This comment has been minimized.

Copy link
Collaborator

jacobdevlin-google commented Nov 23, 2018

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

@loretoparisi

This comment has been minimized.

Copy link
Author

loretoparisi commented Nov 24, 2018

hey @jacobdevlin-google I don't understand from your answer if this is an issue or an expected behavior of BERT when getting the embedding from "not meaningful words" representations or if it is due to the average pooling. Maybe @hanxiao have a better idea?

Let's consider that this behavior in any case isn't what I would expect from a sentence embedding, even when using centroids from words tokens. In fact, as I have showed above the Cosine Similarity (or better the Word Mover's) will have a reasonable "similarity" values among those kinds of token sequences.

To make a real-world example, with a behavior like that, it would be impossible to represent ham or spam tokens (let's say for a classifier task), since the latter tokens seem to be equidistant to all the others (!).

Thank you guys in advance.

@hanxiao

This comment has been minimized.

Copy link

hanxiao commented Dec 5, 2018

@loretoparisi there is a bug in my avg. pooling, max pooling and concat mean max pooling. it is fixed in latest master https://github.com/hanxiao/bert-as-service please check it out and may produce different result.

in principle, this bug affects the most when max_seq_len is much longer than the actual sequence length.

@loretoparisi

This comment has been minimized.

Copy link
Author

loretoparisi commented Dec 5, 2018

@hanxiao thank you very much for your investigation and fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment