Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get vaild value from layoutlm model #91

Closed
NancyNozomi opened this issue Mar 25, 2020 · 14 comments
Closed

Unable to get vaild value from layoutlm model #91

NancyNozomi opened this issue Mar 25, 2020 · 14 comments

Comments

@NancyNozomi
Copy link

NancyNozomi commented Mar 25, 2020

Hi there,

Thank you for your works very much and i'd like to use the LayoutLM network to try labeling in seq but I'm in trouble due to the question detail as follow:

for the env, i try to setup as readme.md but in the command
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
there is a trouble so i use my env existed which also meet the requirement;

for the data, i download the FUNSD and preprocessing its testing_data follow the readme.md;

for the main, I run it in the run_seq_labeling.py but not command after preprocessing.I can finish the process of training but can not for the one of predicting. my conf is equivalent to the follow command:
python run_seq_labeling.py
--data_dir data
--model_type layoutlm
--model_name_or_path layout-large-uncased \ (Downloaded form google drive)
--output_dir out
--labels data/labels.txt
--do_predict
--do_lower_case
--overwrite_output_dir
and other command is not change from default. It will be error in
line 357: eval_loss += tmp_eval_loss.item()

and the error is
RuntimeError: CUDA error: device-side assert triggered

I debug it and this error may be created from
line 349: outputs = model(input)

due to the input is a dict of tensor contains {input_ids, attention_mask, labels, bbox, token_type_ids} but output is a tuple which contains 2 tensor but their data are Unable to get repr for 'torch.Tensor';

i run it as I understand it from paper but i do not know whether it is correct, and i have spend a long time in it but get none result, could you please provide some help for me. sincerely thanks;

@donglixp
Copy link
Contributor

@wolfshow

@ranpox
Copy link
Contributor

ranpox commented Mar 26, 2020

Hi @NancyNozomi ,
Thanks for your feedback. I run the predict step again without apex, but I can not reproduce this bug. Could you share the training command you used?

@NancyNozomi
Copy link
Author

NancyNozomi commented Mar 27, 2020

Hi, @ranpox

i know how valuable your time is to you so that thank you for taking the time to respond to email.

I can finish the train after preprocess and run it in the run_seq_labeling.py which conf is equal to the command as follow as detail:

py run_seq_labeling.py
--data_dir data
--model_type layoutlm
--model_name_or_oath layoutlm-large-uncased
--output_dir out
--labels data/labels.txt
--config_name \in the py this is null means the same as model_name as the help explain
--tokenizer_name \ditto
--cache_dir \ditto
--max_seq_length 128 \The default 512 will create the out memory of my GPU
--do_train
--do_lower_case
--overwrite_output_dir

//the follow is the the default value

--per_gpu_train_batch_size 8
--per_gpu_eval_batch_size 8
--grandient_accmulation_steps 1
--learning_rate 5e-5
--weight_decay 0.0
--adam_epsilon 1e-8
--max_grad_norm 1.0
--num_train_epochs 3.0
--max_steps -1
--warmup_steps 0
--logging_steps 50
--save_steps 50
--seed 42
--fp16_opt_level '01'
--local_rank -1
--srv_ip ''
--srv_port ''

i run it as above and would get the train.log and others in the dir out.But i change the --do_train to --do_predict only, it will catch error as the last description.i would show the value of the input and output in the model as detail in the end.Besides i note that there is a error in the running like

THCudaCheck FAIL file=/tmp/pip-req-build-ocx5vxk7/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=59 :
device-side assert triggered /tmp/pip-req-build-ocx5vxk7/aten/src/THCUNN/ClassNLLCriterion.cu:106:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]:
block: [0,0,0], thread: [1,0,0] Assertion 't >= 0 && t < n_classes' failed.

im not sure whether my input is a wrong format due to the value -100 in the labels dict, and i wish my information is useful.

Sincerely thank you at last.
Selection_028
Selection_029
Selection_031

@yts19871111
Copy link

Why predict calculate loss?

@yts19871111
Copy link

Selection_103

@yts19871111
Copy link

Selection_104
When I use batchsize 8 and max_seq_length 512,then inputs['labels'] shape is (8, 512) but logits shape is (8, 512, 2),this case nn.CrossEntroyLoss()

@ranpox
Copy link
Contributor

ranpox commented Apr 2, 2020

Hi @NancyNozomi ,
I'm sorry for my late reply. I used the exact command you provided to rerun the experiment. But I can not reproduce this bug.
The CrossEntropyLoss will ignore the value "-100" so I think it's OK.
You mentioned that you "preprocessing its testing_data". If you don't mind, could you share your preprocessing steps for the test dataset? Sometimes, the wrong input format triggers this assertion.
Please also check the file "data/labels.txt".

B-ANSWER
B-HEADER
B-QUESTION
E-ANSWER
E-HEADER
E-QUESTION
I-ANSWER
I-HEADER
I-QUESTION
O
S-ANSWER
S-HEADER
S-QUESTION

Thanks.

@NancyNozomi
Copy link
Author

NancyNozomi commented Apr 3, 2020

Hi, @ranpox ,
Thank you for your patient response and i am extremely sorry to trouble you.

My file data/labels.txt seems correct as you discribe and the preprocessing step command is as the follow which equal to my run the python:

wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip dataset.zip && mv dataset data

python scripts/funsd_preprocess.py
//This step i change the param and run in the funsd_preprocess.py and the param is equal to
// python scripts/funsd_preprocess.py
// --data_dir ../data/testing_data/annotations
// --data_split test
// --outout_dir ../data
// --model_name_or_path ../layoutlm-large-uncased
// --max_len 128

cat data/test.txt | cut -d$'\t' -f 2 | grep -v "^$"| sort | uniq > data/labels.txt

i try to find the error where is happened, and find when run to the torch.nn.moudules.loss.py, the line 914 function:
def forward(self, input, target)

where self = CrossEntropyLoss(), input = tensor(...) ,target = tensor(...)
there, input.shape = torch.Size([982, 2]) and target.shape = torch.Size([982])
and target.data is equal to the value of labels in the input of model which is a dict.

For all i know, the cross entropy seems that the same dim is requested?So i debug and evaluate the F.cross_entropy(input, target, ...) so that i get the result which is "Unable to get repr for <class 'torch.Tensor'>", besides the input and target also change to the invalid value like above.I will show it in the end.

So i think there is some worng in the input but i cannot know where lead to the wrong.I sincerely thank you in advance again and look forward to your help.
QQ截图20200403102639
QQ截图20200403101251

@elnazsn1988
Copy link

@ranpox based on your response to my thread regarding CUDA memory, I also tried @NancyNozomi configuration by setting --per_gpu_train_batch_size= 8, however, same error.

@wolfshow
Copy link
Contributor

@NancyNozomi which GPU did you use for the inference?

@NancyNozomi
Copy link
Author

Hi, @wolfshow
Thank you for taking the time to respond to me.My GPU is 1080Ti and it's the only one GPU in my computer.

@wolfshow
Copy link
Contributor

wolfshow commented Apr 14, 2020

@NancyNozomi have you ever tried updating the pytorch version and reducing the batch size?

@NancyNozomi
Copy link
Author

@wolfshow,
Thank you for your advice.
For the pytorch version, i set the envirment for layoutlm individually and install the pytorch==1.3.1 as the requirement.And i try to install the pytorch==1.4.0 just now but...it doesn't work yet like before.
For the batch size, i can finish the training and the error happened in eval only.I also reduce it, unfortunatelly , the result is also CUDA error.
At last, Thank you and others helped me sincetely. If you have not seen this case and could not think out the possible reason leads to this, be it so due to i have spend too much time to trouble you, and still thank you very much.

@shubhangi27397
Copy link

@wolfshow,
Hi,
Actually I am unable to see the results of this model and also unable to fine-tune the model.
Can you please share me the detailed steps for this.
I want to do Document classification from the LayoutLm model.
Thank you in advance and oshbhu876@gmail.com is my personal email id you can mail me on this also.
Please help me out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants