'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #1620

lipingbj · 2019-10-24T12:12:28Z

When I load the pretrained model from the local bin file, there is a decoding problem.

LysandreJik · 2019-10-24T14:26:02Z

Hi, could you provide more information: e.g. respect the template? Please tell us which model, which bin file, with which command?

lipingbj · 2019-10-24T14:46:40Z

Hi, could you provide more information: e.g. respect the template? Please tell us which model, which bin file, with which command?

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/bert-base-cased-pytorch_model.bin")

XLNetModel.from_pretrained("/data2/liping/xlnet/xlnet-base-cased-pytorch_model.bin")
Those two command will make the problem occur.

stefan-it · 2019-10-24T15:04:35Z

@lipingbj With the latest versions of transformers you need to pass the path to the PyTorch-compatible model, so in your example use:

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/")

The following files must be located in that folder:

vocab.txt - vocabulary file
pytorch_model.bin - the PyTorch-compatible (and converted) model
config.json - json-based model configuration

Please make sure that these files exist and e.g. rename bert-base-cased-pytorch_model.bin to pytorch_model.bin.

That should work :)

lipingbj · 2019-10-24T16:36:29Z

@lipingbj With the latest versions of transformers you need to pass the path to the PyTorch-compatible model, so in your example use:
tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/")
The following files must be located in that folder:

vocab.txt - vocabulary file

pytorch_model.bin - the PyTorch-compatible (and converted) model

config.json - json-based model configuration

Please make sure that these files exist and e.g. rename bert-base-cased-pytorch_model.bin to pytorch_model.bin.

That should work :)

encoder_model = BertModel.from_pretrained("/home/liping/liping/bert/pytorch-bert-model")
tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/pytorch-bert-model")

vocab.txt, pytorch_model.bin, config.json have included in directory bert/pytorch-bert-model

OSError: Model name '/home/liping/liping/bert/pytorch-bert-model' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed '/home/liping/liping/bert/pytorch-bert-model/config.json' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

LysandreJik · 2019-10-24T17:29:16Z

As the error says, "We assumed '/home/liping/liping/bert/pytorch-bert-model/config.json' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url."

Your data does not seem to be in "/home/liping/liping/bert/pytorch-bert-model"

phaniram-sayapaneni · 2020-08-05T23:07:40Z

Hello,

I'm trying to load biobert into pytorch, seeing a different error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any hints? @LysandreJik

khu834 · 2020-08-26T18:09:46Z

Hello,

I'm trying to load biobert into pytorch, seeing a different error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any hints? @LysandreJik

Can you show the code that you are running to load from pre-trained weights?
For example

model = BertForSequenceClassification.from_pretrained('/path/to/directory/containing/model_artifacts/')

As stefan-it mentioned above, the directory must contain the 3 required files.

lipingbj closed this as completed Oct 25, 2019

aRookieMan mentioned this issue Jun 9, 2020

加载GPT2LMHeadModel编码错误啊 yangjianxin1/GPT2-chitchat#43

Closed

billpku mentioned this issue Jan 18, 2021

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte billpku/NLP_In_Action#11

Closed

p-karisani mentioned this issue May 7, 2022

question p-karisani/CEPC#1

Closed

caiusy mentioned this issue Nov 9, 2023

读文件 DA-southampton/Read_Bert_Code#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #1620

'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #1620

lipingbj commented Oct 24, 2019

LysandreJik commented Oct 24, 2019

lipingbj commented Oct 24, 2019

stefan-it commented Oct 24, 2019

lipingbj commented Oct 24, 2019 •

edited

LysandreJik commented Oct 24, 2019

phaniram-sayapaneni commented Aug 5, 2020

khu834 commented Aug 26, 2020 •

edited

'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #1620

'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #1620

Comments

lipingbj commented Oct 24, 2019

LysandreJik commented Oct 24, 2019

lipingbj commented Oct 24, 2019

stefan-it commented Oct 24, 2019

lipingbj commented Oct 24, 2019 • edited

LysandreJik commented Oct 24, 2019

phaniram-sayapaneni commented Aug 5, 2020

khu834 commented Aug 26, 2020 • edited

lipingbj commented Oct 24, 2019 •

edited

khu834 commented Aug 26, 2020 •

edited