You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a classification task for texts that have many emojis and I am trying to understand if the latest version of BERT supports them.
The latest commit message in master (d66a146) suggests that the tokenizer now supports emojis: (1) Updating TF Hub classifier (2) Updating tokenizer to support emojis
However, when I am trying to run the tokenizer following the code in the Colab notebook I am getting the [UNK] token fo emojis.
The BERT code I am using is the installed bert-tensorflow pip package
Here is the code snippet that demonstrates the issue:
# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
def create_tokenizer_from_hub_module():
"""Get the vocab file and casing info from the Hub module."""
with tf.Graph().as_default():
bert_module = hub.Module(BERT_MODEL_HUB)
tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
with tf.Session() as sess:
vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
tokenization_info["do_lower_case"]])
return bert.tokenization.FullTokenizer(
vocab_file=vocab_file, do_lower_case=do_lower_case)
tokenizer = create_tokenizer_from_hub_module()
print(tokenizer.tokenize('I ❤️ you'))
### Prints:
### ['i', '[UNK]', 'you']
Thanks,
Roei.
The text was updated successfully, but these errors were encountered:
Hi,
I have a classification task for texts that have many emojis and I am trying to understand if the latest version of BERT supports them.
The latest commit message in master (d66a146) suggests that the tokenizer now supports emojis:
(1) Updating TF Hub classifier (2) Updating tokenizer to support emojis
However, when I am trying to run the tokenizer following the code in the Colab notebook I am getting the [UNK] token fo emojis.
The BERT code I am using is the installed bert-tensorflow pip package
Here is the code snippet that demonstrates the issue:
Thanks,
Roei.
The text was updated successfully, but these errors were encountered: