Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Biobert pre-trained weights into Bert model with Pytorch bert hugging face run_classifier.py code #457

Closed
sheetalsh456 opened this issue Apr 8, 2019 · 12 comments
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix

Comments

@sheetalsh456
Copy link

These are the steps I followed to get Biobert working with the existing Bert hugging face pytorch code.

  1. I downloaded the pre-trained weights 'biobert_pubmed_pmc.tar.gz' from the Releases page.

  2. I ran this command to convert the tf checkpoint to pytorch model

python pytorch-pretrained-BERT/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path="biobert/pubmed_pmc_470k/biobert_model.ckpt.index" --bert_config_file="biobert/pubmed_pmc_470k/bert_config.json" --pytorch_dump_path="biobert/pubmed_pmc_470k/Pytorch/biobert.model"

This created a file 'biobert.model' in the specified path.

  1. As mentioned in this link , I compressed 'biobert.model' created above and 'biobert/pubmed_pmc_470k/bert_config.json' together into a biobert_model.tar.gz

  2. I then ran the run_classifier.py of hugging face bert with the following command, using the tar.gz created above.

python pytorch-pretrained-BERT/examples/run_classifier.py --data_dir="Data/" --bert_model="biobert_model.tar.gz" --task_name="qqp" --output_dir="OutputModels/Pretrained/" --do_train --do_eval --do_lower_case

I get the error

'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte' 

in the line

tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

Am I doing something wrong?

I just wanted to run run_classifier.py code provided by hugging face with biobert pretrained weights in the same way that we run bert with it. Is there a way to do this?

@thomwolf
Copy link
Member

Have you tried the solutions discussed in the other issues on this topic:

@thomwolf thomwolf added BERT Discussion Discussion on a topic (keep it focused or open a new issue though) labels Apr 11, 2019
@stale
Copy link

stale bot commented Jun 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 10, 2019
@stale stale bot closed this as completed Jun 17, 2019
@nikhilsid
Copy link

Have you tried the solutions discussed in the other issues on this topic:

Hi @thomwolf ,

I followed the instructions here to convert the checkpoint and then placing the files (pytorch_model.bin, bert_config.json, and vocab.txt) in one folder to compress it. I did not need to ignore any weights as mentioned in any solutions you mentioned above ( #312 or #239 )

I copied the compressed folder to the home folder of 'pytorch-transformers'. Then I ran the following command from there itself to run the example code ('examples/run_glue.py') on my data.

Then, I am trying to run the example code given by running the following command,

python ./examples/run_glue.py \ --model_type bert \ --model_name_or_path biobert.gz \ --task_name=sts-b \ --do_train \ --do_eval \ --do_lower_case \ --data_dir=$DIR \ --max_seq_length 128 \ --per_gpu_eval_batch_size=8 \ --per_gpu_train_batch_size=8

But, I get the same error as mentioned in the main discussion:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

at this location:

File "./examples/run_glue.py", line 424, in main config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)

Can you please tell what to change.

@stefan-it
Copy link
Collaborator

Try to pass the extracted folder of your converted bioBERT model to the --model_name_or_path :)

Here's a short example:

  • Download the BioBERT v1.1 (+ PubMed 1M) model (or any other model) from the bioBERT repo
  • Extract the downloaded file, e.g. with tar -xzf biobert_v1.1_pubmed.tar.gz
  • Convert the bioBERT model TensorFlow checkpoint to a PyTorch and PyTorch-Transformers compatible one: pytorch_transformers bert biobert_v1.1_pubmed/model.ckpt-1000000 biobert_v1.1_pubmed/bert_config.json biobert_v1.1_pubmed/pytorch_model.bin
  • Move config mv biobert_v1.1_pubmed/bert_config.json biobert_v1.1_pubmed/config.json

Then pass the folder name to the --model_name_or_path argument. You can run this simple script to check, if everything works:

from pytorch_transformers import BertModel
model = BertModel.from_pretrained('biobert_v1.1_pubmed')

@nt-sumit
Copy link

How we load a tensor as a pretrained model in bert

@nipunsadvilkar
Copy link

@stefan-it As per new transformers-cli third command would change as follows:

transformers-cli convert --model_type bert \
--tf_checkpoint biobert_v1.1_pubmed/model.ckpt-1000000 \
--config biobert_v1.1_pubmed/bert_config.json \
--pytorch_dump_output biobert_v1.1_pubmed/pytorch_model.bin

@JessicaLopezEspejel
Copy link

JessicaLopezEspejel commented Apr 13, 2020

Hello!
Just to complement the @stefan-it instructions in step number 3, it works for me the following code:

import os
from pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch
path_bin = 'my_directory/pytorch_model.bin'
path_bert = 'my_bert_directory/'

if (not os.path.exists(path_bin)):
convert_tf_checkpoint_to_pytorch(
path_bert + "biobert_model.ckpt",
path_bert + "bert_config.json",
path_bert + "pytorch_model.bin"
)

My folder (biobert_v1.0_pmc) was originally 5 files:
3 Tensorflow checkpoint files
A vocab file
A config file

@tokarev-i-v
Copy link

tokarev-i-v commented Apr 16, 2020

  • Move config mv biobert_v1.1_pubmed/bert_config.json biobert_v1.1_pubmed/config.json

Tnank you very much! It did help me!

@abkds
Copy link

abkds commented May 1, 2020

Thank you every one, this works fine now!

@mariaBio
Copy link

Dear all,
Thank you very much for the suggestions on how to prepare the model to be used with hugginfaces. I am trying to use huggingfaces BertTokenizer to perform NER on biomedical data with the pre-trained weights.
It seems to work fine until then point when I want to map annotated tokens to entity labels. I have token ids and prediction ids, but I cannot figure out how/where to get label_list to follow an example of mapping from https://huggingface.co/transformers/usage.html#named-entity-recognition
Thank you very much for any help you can provide!
Maria

@TamMinhVo
Copy link

Dear all,
I am a newbie and I don not have so much experience. Does anyone have a full tutorial or code for a regression task, Please share with me! I would greatly appreciate it! Thank you.

@varunp2k
Copy link

varunp2k commented Jun 3, 2020

@stefan-it @nipunsadvilkar Thank you for your solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix
Projects
None yet
Development

No branches or pull requests