How to load vocfile for Albert #17

gargomeiste · 2019-10-30T17:26:54Z

Hello,

I tried to implement your library in order to fine tune Albert:

def create_model(max_seq_len):
    """Creates a classification model."""
    albert_model_name = "albert_base"
    albert_dir = bert.fetch_tfhub_albert_model(albert_model_name, ".models")

    albert_params = bert.albert_params(albert_model_name)
    l_bert = bert.BertModelLayer.from_params(albert_params, name="albert")
        
    input_ids      = keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="input_ids")
    token_type_ids = keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="token_type_ids")
    output         = l_bert([input_ids, token_type_ids])

    print("bert shape", output.shape)
    output = keras.layers.Lambda(lambda x: x[:, 0, :])(output)
    output = keras.layers.Dense(1, activation="sigmoid")(output)

    model = keras.Model(inputs=[input_ids, token_type_ids], outputs=output)
    model.build(input_shape=[(None, max_seq_len)])


    for weight in l_bert.weights:
        print(weight.name)


    model.compile(optimizer=keras.optimizers.Adam(),
        loss="binary_crossentropy",
        metrics=["accuracy", Precision(), Recall()])

    bert.load_albert_weights(l_bert, albert_dir)
    
    model.summary()
        
    return model

When training, I have an index issue regarding the embeddings (my assumption is that the vocab / vocab_size is different between bert and albert). I tried to import it from tf_hub but wasn't able to find it. Am I missing something?

Thank you in advance! You are doing an amazing work!

The text was updated successfully, but these errors were encountered:

kpe · 2019-10-30T17:55:10Z

@antoinegargot - ALBERT is using sentencepiece for tokenization, so you might need something like:

!pip install sentencepiece
import sentencepiece as spm

sp_model = tf.io.gfile.glob(os.path.join(albert_dir, "assets/*"))[0]
sp = spm.SentencePieceProcessor()
sp.load(sp_model)

tokens = sp.encode_as_pieces("Hello, World!")
token_ids = list(map(sp.piece_to_id, tokens))
# or
token_ids = sp.encode_as_ids("Hello, World!")

kpe closed this as completed Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load vocfile for Albert #17

How to load vocfile for Albert #17

gargomeiste commented Oct 30, 2019

kpe commented Oct 30, 2019

How to load vocfile for Albert #17

How to load vocfile for Albert #17

Comments

gargomeiste commented Oct 30, 2019

kpe commented Oct 30, 2019