Unable to get vaild value from layoutlm model #91

NancyNozomi · 2020-03-25T02:51:21Z

Hi there,

Thank you for your works very much and i'd like to use the LayoutLM network to try labeling in seq but I'm in trouble due to the question detail as follow:

for the env, i try to setup as readme.md but in the command
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
there is a trouble so i use my env existed which also meet the requirement;

for the data, i download the FUNSD and preprocessing its testing_data follow the readme.md;

for the main, I run it in the run_seq_labeling.py but not command after preprocessing.I can finish the process of training but can not for the one of predicting. my conf is equivalent to the follow command:
python run_seq_labeling.py
--data_dir data
--model_type layoutlm
--model_name_or_path layout-large-uncased \ (Downloaded form google drive)
--output_dir out
--labels data/labels.txt
--do_predict
--do_lower_case
--overwrite_output_dir
and other command is not change from default. It will be error in
line 357: eval_loss += tmp_eval_loss.item()

and the error is
RuntimeError: CUDA error: device-side assert triggered

I debug it and this error may be created from
line 349: outputs = model(input)

due to the input is a dict of tensor contains {input_ids, attention_mask, labels, bbox, token_type_ids} but output is a tuple which contains 2 tensor but their data are Unable to get repr for 'torch.Tensor';

i run it as I understand it from paper but i do not know whether it is correct, and i have spend a long time in it but get none result, could you please provide some help for me. sincerely thanks;

donglixp · 2020-03-26T09:30:07Z

@wolfshow

ranpox · 2020-03-26T10:32:44Z

Hi @NancyNozomi ,
Thanks for your feedback. I run the predict step again without apex, but I can not reproduce this bug. Could you share the training command you used?

NancyNozomi · 2020-03-27T01:58:36Z

Hi, @ranpox

i know how valuable your time is to you so that thank you for taking the time to respond to email.

I can finish the train after preprocess and run it in the run_seq_labeling.py which conf is equal to the command as follow as detail:

py run_seq_labeling.py
--data_dir data
--model_type layoutlm
--model_name_or_oath layoutlm-large-uncased
--output_dir out
--labels data/labels.txt
--config_name \in the py this is null means the same as model_name as the help explain
--tokenizer_name \ditto
--cache_dir \ditto
--max_seq_length 128 \The default 512 will create the out memory of my GPU
--do_train
--do_lower_case
--overwrite_output_dir

//the follow is the the default value

--per_gpu_train_batch_size 8
--per_gpu_eval_batch_size 8
--grandient_accmulation_steps 1
--learning_rate 5e-5
--weight_decay 0.0
--adam_epsilon 1e-8
--max_grad_norm 1.0
--num_train_epochs 3.0
--max_steps -1
--warmup_steps 0
--logging_steps 50
--save_steps 50
--seed 42
--fp16_opt_level '01'
--local_rank -1
--srv_ip ''
--srv_port ''

i run it as above and would get the train.log and others in the dir out.But i change the --do_train to --do_predict only, it will catch error as the last description.i would show the value of the input and output in the model as detail in the end.Besides i note that there is a error in the running like

THCudaCheck FAIL file=/tmp/pip-req-build-ocx5vxk7/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=59 :
device-side assert triggered /tmp/pip-req-build-ocx5vxk7/aten/src/THCUNN/ClassNLLCriterion.cu:106:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]:
block: [0,0,0], thread: [1,0,0] Assertion 't >= 0 && t < n_classes' failed.

im not sure whether my input is a wrong format due to the value -100 in the labels dict, and i wish my information is useful.

Sincerely thank you at last.

yts19871111 · 2020-03-27T03:38:36Z

Why predict calculate loss？

yts19871111 · 2020-03-27T07:01:59Z

yts19871111 · 2020-03-27T08:30:06Z

When I use batchsize 8 and max_seq_length 512,then inputs['labels'] shape is (8, 512) but logits shape is (8, 512, 2),this case nn.CrossEntroyLoss()

ranpox · 2020-04-02T09:55:40Z

Hi @NancyNozomi ,
I'm sorry for my late reply. I used the exact command you provided to rerun the experiment. But I can not reproduce this bug.
The CrossEntropyLoss will ignore the value "-100" so I think it's OK.
You mentioned that you "preprocessing its testing_data". If you don't mind, could you share your preprocessing steps for the test dataset? Sometimes, the wrong input format triggers this assertion.
Please also check the file "data/labels.txt".

B-ANSWER
B-HEADER
B-QUESTION
E-ANSWER
E-HEADER
E-QUESTION
I-ANSWER
I-HEADER
I-QUESTION
O
S-ANSWER
S-HEADER
S-QUESTION

Thanks.

NancyNozomi · 2020-04-03T02:28:02Z

Hi, @ranpox ,
Thank you for your patient response and i am extremely sorry to trouble you.

My file data/labels.txt seems correct as you discribe and the preprocessing step command is as the follow which equal to my run the python:

wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip dataset.zip && mv dataset data

python scripts/funsd_preprocess.py
//This step i change the param and run in the funsd_preprocess.py and the param is equal to
// python scripts/funsd_preprocess.py
// --data_dir ../data/testing_data/annotations
// --data_split test
// --outout_dir ../data
// --model_name_or_path ../layoutlm-large-uncased
// --max_len 128

cat data/test.txt | cut -d$'\t' -f 2 | grep -v "^$"| sort | uniq > data/labels.txt

i try to find the error where is happened, and find when run to the torch.nn.moudules.loss.py, the line 914 function:
def forward(self, input, target)

where self = CrossEntropyLoss(), input = tensor(...) ,target = tensor(...)
there, input.shape = torch.Size([982, 2]) and target.shape = torch.Size([982])
and target.data is equal to the value of labels in the input of model which is a dict.
For all i know, the cross entropy seems that the same dim is requested?So i debug and evaluate the F.cross_entropy(input, target, ...) so that i get the result which is "Unable to get repr for <class 'torch.Tensor'>", besides the input and target also change to the invalid value like above.I will show it in the end.

So i think there is some worng in the input but i cannot know where lead to the wrong.I sincerely thank you in advance again and look forward to your help.

elnazsn1988 · 2020-04-07T07:36:53Z

@ranpox based on your response to my thread regarding CUDA memory, I also tried @NancyNozomi configuration by setting --per_gpu_train_batch_size= 8, however, same error.

wolfshow · 2020-04-14T07:42:10Z

@NancyNozomi which GPU did you use for the inference?

NancyNozomi · 2020-04-14T07:53:40Z

Hi, @wolfshow
Thank you for taking the time to respond to me.My GPU is 1080Ti and it's the only one GPU in my computer.

wolfshow · 2020-04-14T07:56:56Z

@NancyNozomi have you ever tried updating the pytorch version and reducing the batch size?

NancyNozomi · 2020-04-14T08:33:20Z

@wolfshow,
Thank you for your advice.
For the pytorch version, i set the envirment for layoutlm individually and install the pytorch==1.3.1 as the requirement.And i try to install the pytorch==1.4.0 just now but...it doesn't work yet like before.
For the batch size, i can finish the training and the error happened in eval only.I also reduce it, unfortunatelly , the result is also CUDA error.
At last, Thank you and others helped me sincetely. If you have not seen this case and could not think out the possible reason leads to this, be it so due to i have spend too much time to trouble you, and still thank you very much.

shubhangi27397 · 2020-06-04T06:00:36Z

@wolfshow,
Hi,
Actually I am unable to see the results of this model and also unable to fine-tune the model.
Can you please share me the detailed steps for this.
I want to do Document classification from the LayoutLm model.
Thank you in advance and oshbhu876@gmail.com is my personal email id you can mail me on this also.
Please help me out.

wolfshow closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get vaild value from layoutlm model #91

Unable to get vaild value from layoutlm model #91

NancyNozomi commented Mar 25, 2020 •

edited

donglixp commented Mar 26, 2020

ranpox commented Mar 26, 2020

NancyNozomi commented Mar 27, 2020 •

edited

yts19871111 commented Mar 27, 2020

yts19871111 commented Mar 27, 2020

yts19871111 commented Mar 27, 2020

ranpox commented Apr 2, 2020 •

edited

NancyNozomi commented Apr 3, 2020 •

edited

elnazsn1988 commented Apr 7, 2020

wolfshow commented Apr 14, 2020

NancyNozomi commented Apr 14, 2020

wolfshow commented Apr 14, 2020 •

edited

NancyNozomi commented Apr 14, 2020

shubhangi27397 commented Jun 4, 2020

Unable to get vaild value from layoutlm model #91

Unable to get vaild value from layoutlm model #91

Comments

NancyNozomi commented Mar 25, 2020 • edited

donglixp commented Mar 26, 2020

ranpox commented Mar 26, 2020

NancyNozomi commented Mar 27, 2020 • edited

yts19871111 commented Mar 27, 2020

yts19871111 commented Mar 27, 2020

yts19871111 commented Mar 27, 2020

ranpox commented Apr 2, 2020 • edited

NancyNozomi commented Apr 3, 2020 • edited

elnazsn1988 commented Apr 7, 2020

wolfshow commented Apr 14, 2020

NancyNozomi commented Apr 14, 2020

wolfshow commented Apr 14, 2020 • edited

NancyNozomi commented Apr 14, 2020

shubhangi27397 commented Jun 4, 2020

NancyNozomi commented Mar 25, 2020 •

edited

NancyNozomi commented Mar 27, 2020 •

edited

ranpox commented Apr 2, 2020 •

edited

NancyNozomi commented Apr 3, 2020 •

edited

wolfshow commented Apr 14, 2020 •

edited