Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT encode emojis as [UNK] token #587

Open
rbahumi opened this issue Apr 18, 2019 · 2 comments
Open

BERT encode emojis as [UNK] token #587

rbahumi opened this issue Apr 18, 2019 · 2 comments

Comments

@rbahumi
Copy link

rbahumi commented Apr 18, 2019

Hi,

I have a classification task for texts that have many emojis and I am trying to understand if the latest version of BERT supports them.
The latest commit message in master (d66a146) suggests that the tokenizer now supports emojis:
(1) Updating TF Hub classifier (2) Updating tokenizer to support emojis

However, when I am trying to run the tokenizer following the code in the Colab notebook I am getting the [UNK] token fo emojis.

The BERT code I am using is the installed bert-tensorflow pip package

Here is the code snippet that demonstrates the issue:

# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

def create_tokenizer_from_hub_module():
  """Get the vocab file and casing info from the Hub module."""
  with tf.Graph().as_default():
    bert_module = hub.Module(BERT_MODEL_HUB)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    with tf.Session() as sess:
      vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                            tokenization_info["do_lower_case"]])
      
  return bert.tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()


print(tokenizer.tokenize('I ❤️ you'))
### Prints:  
### ['i', '[UNK]', 'you']

Thanks,
Roei.

@dataislife
Copy link

Hey!
From the commited code, I do not see any change related to emojis despite the 2 points comments. Any further information regarding that?

@Souradeep15
Copy link

where can you Get the vocab file and casing info from the Hub module.??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants