Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load bert embeddings #2

Closed
bizbard opened this issue Jul 2, 2023 · 2 comments
Closed

How to load bert embeddings #2

bizbard opened this issue Jul 2, 2023 · 2 comments

Comments

@bizbard
Copy link

bizbard commented Jul 2, 2023

I tried to load bert embeddings of news texts with 'bert-token.yaml' and use 'dcn.yaml' as the recommend model. After preprocess the data with bert_processor.py, i realize it only tokenize the text. When load the data.npy in embedding_loader.py, i print out the embedding and realize there are only tokens and no bert embeddings. How can i extract the bert embeddings and load it to the model?

print out the embedding variable
`{'nid': array([0, 1, 2, ..., 65235, 65236, 65237], dtype=object), 'cat': array([list([9580]), list([2740]), list([2739]), ..., list([2739]), dtype=object),
'title': array([list([1996, 9639, 3035, 3870, 1010, 3159, 2798, 1010, 1998, 3159, 5170, 8415, 2011]),..., list([3901, 1997, 4916, 2237, 5998, 2007, 3571, 2044, 9288]),dtype=object),
'abs': array([list([4497, 1996, 14960, 2015, 1010, 17764, 1010, 1998, 2062, 2008, 1996, 15426, 2064, 1005, 1056, 2444, 2302, 1012]), list([2122, 9428, 19741, 14243, 2024, 3173, 2017, 2067, 1998, 4363, 2017, 2013, 8328, 4667, 2008, 18162, 7579, 6638, 2005, 2204, 1012]),...,list([])], dtype=object)}

the error
Traceback (most recent call last):
File "/Users/chuanqijiao/GNRS-master/worker.py", line 395, in
worker = Worker(config=configuration)
File "/Users/chuanqijiao/GNRS-master/worker.py", line 54, in init
self.config_manager = ConfigManager(
File "/Users/chuanqijiao/GNRS-master/loader/config_manager.py", line 196, in init
self.embedding_manager.load_pretrained_embedding(**Obj.raw(embedding_info))
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_manager.py", line 66, in load_pretrained_embedding
self.pretrained[vocab_name] = EmbeddingInfo(**kwargs).load()
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 39, in load
self.embedding = getter(self.path)
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 21, in get_numpy_embedding
return torch.tensor(embedding, dtype=torch.float32)
TypeError: can't convert np.ndarray of type numpy.object
. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.`

Besides, the configs look a little bit confusing to me. If i try to load bert embedding and not use image features, can i use the following config?
mind.yaml-->dcn/din/bst/pnn.yaml-->tt.yaml-->bert-token.yaml

@Jyonn
Copy link
Owner

Jyonn commented Jul 2, 2023

You should prepare bert token embeddings yourself. One approach is to use transformers package by:

from transformers import BertModel

bert = BertModel.from_pretrained('bert-base-uncased')
embeds = bert.embeddings.word_embeddings.weight.detach().numpy()

import numpy as np

np.save(embeds, 'bert-token.npy')

The above code is just a illustrative draft and has not been tested.

In this case, the embed configuration should be:

name: bert-token  # can be any name
embeddings:
  -
    vocab_name: english
    vocab_type: numpy
    path: /path/to/bert-token.npy
    frozen: true  # not trainable

@Jyonn
Copy link
Owner

Jyonn commented Jul 2, 2023

Besides, the configs look a little bit confusing to me. If i try to load bert embedding and not use image features, can i use the following config?
mind.yaml-->dcn/din/bst/pnn.yaml-->tt.yaml-->bert-token.yaml

Sure! Recommend using tt-ctr.yaml for CTR models such as DCN and PNN.

@Jyonn Jyonn closed this as completed Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants