Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #1620

Closed
lipingbj opened this issue Oct 24, 2019 · 7 comments
Closed

Comments

@lipingbj
Copy link

When I load the pretrained model from the local bin file, there is a decoding problem.

@LysandreJik
Copy link
Member

Hi, could you provide more information: e.g. respect the template? Please tell us which model, which bin file, with which command?

@lipingbj
Copy link
Author

Hi, could you provide more information: e.g. respect the template? Please tell us which model, which bin file, with which command?

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/bert-base-cased-pytorch_model.bin")

XLNetModel.from_pretrained("/data2/liping/xlnet/xlnet-base-cased-pytorch_model.bin")
Those two command will make the problem occur.

@stefan-it
Copy link
Collaborator

@lipingbj With the latest versions of transformers you need to pass the path to the PyTorch-compatible model, so in your example use:

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/")

The following files must be located in that folder:

  • vocab.txt - vocabulary file
  • pytorch_model.bin - the PyTorch-compatible (and converted) model
  • config.json - json-based model configuration

Please make sure that these files exist and e.g. rename bert-base-cased-pytorch_model.bin to pytorch_model.bin.

That should work :)

@lipingbj
Copy link
Author

lipingbj commented Oct 24, 2019

@lipingbj With the latest versions of transformers you need to pass the path to the PyTorch-compatible model, so in your example use:

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/")

The following files must be located in that folder:

  • vocab.txt - vocabulary file
  • pytorch_model.bin - the PyTorch-compatible (and converted) model
  • config.json - json-based model configuration

Please make sure that these files exist and e.g. rename bert-base-cased-pytorch_model.bin to pytorch_model.bin.

That should work :)

encoder_model = BertModel.from_pretrained("/home/liping/liping/bert/pytorch-bert-model")
tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/pytorch-bert-model")

vocab.txt, pytorch_model.bin, config.json have included in directory bert/pytorch-bert-model

OSError: Model name '/home/liping/liping/bert/pytorch-bert-model' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed '/home/liping/liping/bert/pytorch-bert-model/config.json' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

@LysandreJik
Copy link
Member

As the error says, "We assumed '/home/liping/liping/bert/pytorch-bert-model/config.json' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url."

Your data does not seem to be in "/home/liping/liping/bert/pytorch-bert-model"

@phaniram-sayapaneni
Copy link

Hello,

I'm trying to load biobert into pytorch, seeing a different error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any hints? @LysandreJik

@khu834
Copy link

khu834 commented Aug 26, 2020

Hello,

I'm trying to load biobert into pytorch, seeing a different error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any hints? @LysandreJik

Can you show the code that you are running to load from pre-trained weights?
For example

model = BertForSequenceClassification.from_pretrained('/path/to/directory/containing/model_artifacts/')

As stefan-it mentioned above, the directory must contain the 3 required files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants