How to load bert embeddings #2

bizbard · 2023-07-02T05:13:11Z

I tried to load bert embeddings of news texts with 'bert-token.yaml' and use 'dcn.yaml' as the recommend model. After preprocess the data with bert_processor.py, i realize it only tokenize the text. When load the data.npy in embedding_loader.py, i print out the embedding and realize there are only tokens and no bert embeddings. How can i extract the bert embeddings and load it to the model?

print out the embedding variable
`{'nid': array([0, 1, 2, ..., 65235, 65236, 65237], dtype=object), 'cat': array([list([9580]), list([2740]), list([2739]), ..., list([2739]), dtype=object),
'title': array([list([1996, 9639, 3035, 3870, 1010, 3159, 2798, 1010, 1998, 3159, 5170, 8415, 2011]),..., list([3901, 1997, 4916, 2237, 5998, 2007, 3571, 2044, 9288]),dtype=object),
'abs': array([list([4497, 1996, 14960, 2015, 1010, 17764, 1010, 1998, 2062, 2008, 1996, 15426, 2064, 1005, 1056, 2444, 2302, 1012]), list([2122, 9428, 19741, 14243, 2024, 3173, 2017, 2067, 1998, 4363, 2017, 2013, 8328, 4667, 2008, 18162, 7579, 6638, 2005, 2204, 1012]),...,list([])], dtype=object)}

the error
Traceback (most recent call last):
File "/Users/chuanqijiao/GNRS-master/worker.py", line 395, in
worker = Worker(config=configuration)
File "/Users/chuanqijiao/GNRS-master/worker.py", line 54, in init
self.config_manager = ConfigManager(
File "/Users/chuanqijiao/GNRS-master/loader/config_manager.py", line 196, in init
self.embedding_manager.load_pretrained_embedding(**Obj.raw(embedding_info))
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_manager.py", line 66, in load_pretrained_embedding
self.pretrained[vocab_name] = EmbeddingInfo(**kwargs).load()
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 39, in load
self.embedding = getter(self.path)
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 21, in get_numpy_embedding
return torch.tensor(embedding, dtype=torch.float32)
TypeError: can't convert np.ndarray of type numpy.object. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.`

Besides, the configs look a little bit confusing to me. If i try to load bert embedding and not use image features, can i use the following config?
mind.yaml-->dcn/din/bst/pnn.yaml-->tt.yaml-->bert-token.yaml

The text was updated successfully, but these errors were encountered:

Jyonn · 2023-07-02T05:22:16Z

You should prepare bert token embeddings yourself. One approach is to use transformers package by:

from transformers import BertModel

bert = BertModel.from_pretrained('bert-base-uncased')
embeds = bert.embeddings.word_embeddings.weight.detach().numpy()

import numpy as np

np.save(embeds, 'bert-token.npy')

The above code is just a illustrative draft and has not been tested.

In this case, the embed configuration should be:

name: bert-token  # can be any name
embeddings:
  -
    vocab_name: english
    vocab_type: numpy
    path: /path/to/bert-token.npy
    frozen: true  # not trainable

Jyonn · 2023-07-02T05:27:17Z

Besides, the configs look a little bit confusing to me. If i try to load bert embedding and not use image features, can i use the following config?
mind.yaml-->dcn/din/bst/pnn.yaml-->tt.yaml-->bert-token.yaml

Sure! Recommend using tt-ctr.yaml for CTR models such as DCN and PNN.

Jyonn closed this as completed Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load bert embeddings #2

How to load bert embeddings #2

bizbard commented Jul 2, 2023

Jyonn commented Jul 2, 2023 •

edited

Loading

Jyonn commented Jul 2, 2023

How to load bert embeddings #2

How to load bert embeddings #2

Comments

bizbard commented Jul 2, 2023

Jyonn commented Jul 2, 2023 • edited Loading

Jyonn commented Jul 2, 2023

Jyonn commented Jul 2, 2023 •

edited

Loading