New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LXMERT pre-training tasks #7266
Comments
Tagging the LXMERT implementation author @eltoto1219 |
Hi, "unc-nlp/lxmert-base-uncased" was trained with all tasks specified in the paper (as aforementioned). We have benchmarked the pre-trained model to make sure it reaches the same performance on all QA tasks. If you do run into any troubles though, please let me know! |
Hello @eltoto1219, thank you for the answer! I suppose it was a weird question from my part, since I was asking this to make sure that I am loading a pre-trained LXMERT model and not some random weights. Especially because I look at the New question: Do you know how it can be that LXMERT randomly guesses on cross-modality matching even it was pre-trained to deliver a score (after the Softmax, of course) of smaller 0.5 if the caption is not describing the image and a score bigger 0.5 it the caption and the image match? |
+1 I am also interested in the question/answer! |
I also meet the problem. Does any have ideas about why this happends? @eltoto1219 |
@ecekt Well observed, thank you very much! Now I took a closer look at this and with your proposed solution in #8333 I see how the features change, but not the performance. Still random guessing. |
Hi @LetiP! Super interesting question. I was curious so I run a test on 1000 COCO-Val images with 5013 captions in total. Using the original implementation (without considering #8333, i.e. with wrong color ordering for local files) I received 56 % correct classifications for images with correct captions - so the same results you got 👍. Interestingly this model gets 99.7 % correct for wrong image-caption combinations (image with randomly drawn COCO caption). Hence we have 56 % Recall, but 99.7 % Specificity. Fixing the bug noted in #8333 (see code below), Recall goes up to 71 %, Specificity is at 99.2 %, Precision at 98 %, and Accuracy is at 85 %. From this result I follow that # transformers/examples/lxmert/utils.py
def img_tensorize(im, input_format="RGB"):
[...]
assert img is not None, f"could not connect to: {im}"
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # <=== indent this line, so it works for local and url images.
if input_format == "RGB":
[...] |
Hello @phiyodr , looking at your results, I can not but keep wondering: How did you decide whether the classification is correct or not? I am asking because the cross_relationship_score is a tensor with two logit entries for an image-text pair. How do you decide which logit represents a match and which one a mismatch? |
Indeed the documentation is not super clear whether the first or second value in I used 5013 correct image-caption pairs and 5013 wrong image-caption combinations, then I made a confusion matrix and decided whether the first or second value of the tensor is more plausible for Without considering #8333:
Considering #8333:
Hence the first value is likely to be |
Hello @phiyodr , thank you for your quick response! I think that this issue evolved into the question:
I understand that you pick the logit delivering better results. But I followed the documentation which (I understand) assigns the first logit to
@eltoto1219 Do you perhaps know which logit in the |
Yeah, the docu is actually vice versa. |
Still waiting for confirmation about what is happening here, about the way that the model was trained and which logit was intended to predict the match. I do not see any reason why one can simply invert the logits with wishful thinking. |
Is there any entry-level example of Lxmert? Following example from Lxmert. from transformers import LxmertTokenizer, LxmertModel
import torch
tokenizer = LxmertTokenizer.from_pretrained('unc-nlp/lxmert-base-uncased')
model = LxmertModel.from_pretrained('unc-nlp/lxmert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state comes up
|
Hi @LetiP ! I am super sorry for all of the confusion regarding the correct/incorrect logit for the cross_relationship score. Per the documentation you pointed out, it is indeed a bit ambiguous in where the correct position of the "is_matched" index is. However, while pre-training, one must provide the sentence/image matching labels if including cross modality matching in the loss regime.
The pytorch loss that was used can be found here. This should also indicate the assignments of the indicies. Thus, if the sentence does match the image, the model will have maximized the likelihood of the first index to be 1. In the case of a mismatch, the zero'th index will have been maximized to be 1 (aka True). If for some reason this is not occurring properly, please let me know! Hey @yezhengli-Mr9 ! That is also a very misleading example for lxmert as one must provide the visual-position (normalized bounding boxes) and the FRCNN ROI-pooled visual features in order for the model to run. For all optional/non-optional inputs, please see the docs! I should be able to fix that sometime soon. For now, if needed, you can refer to the LXMERT pytests. I will be making a pull-request to remove the image tensorization from urls as it seems to be outside the scope of the demo and will remove one source of error. I will also formalize the feature-extraction code as using a batch size larger than one entails image padding which, consequently, lowers the quality of the image features. Perhaps, I can include that in an example too. |
Hello @eltoto1219 , thank you, this is the answer I have been looking for! Indeed, the behavior is exactly as you say:
It is good to hear which part of the documentation is correct (or as you say, not ambiguous 😉). For helping everyone and to avoid any confusion, I would suggest to adapt the documentation of
with
would to the job. |
+1 for the suggested documentation change. I was trying to figure this out as well. |
Feel free to open a PR to update the documentation, we'll glady merge it! |
Hello @LetiP @eltoto1219 @ecekt I tried what I believe is the same experiment - predict match/no-match over the MSCOCO 2017 val set. Specifically, I used all image-caption pairs in the val set (25014 pairs over 5000 images) and sampled captions from random different images to create an equal number of negative examples (leading to a total of 50028 examples). I am getting the following results using this I am trying to understand what caused the differences. I used this script provided by @eltoto1219 in #8769 to extract image features and I am getting the prediction by performing
Would appreciate any help trying to figure out what is causing these differences? So far I have not tested on VQA/ GQA. Thanks! |
Hello @LetiP @eltoto1219 @ecekt Reminder if any of you have thoughts/ suggestions about my question. Thanks! |
Hello @aishwaryap , I could exactly reproduce the numbers of @phiyodr 🥳 (here the relevant excerpt):
The script you are referring to for image feature extraction is unknown to me, therefore I did not use it. For reading in the images I closely followed the original LXMERT demo in this Colab Notebook. Can you reproduce my and @phiyodr 's numbers with the data loading code from that Notebook as well? Sorry for the late answer, I had too much going on. |
Hi @LetiP, Thanks a lot for sharing your script! Unfortunately, I am not able to reproduce those numbers using that notebook. Using the first 1000 val images on MSCOCO with all their paired captions as positive examples, and one randomly sampled caption from a different image as negative examples, I get I did have to modify the script you provided in order to run on a remote server, load MSCOCO images and sample negatives, but I don't think I changed anything that should result in different numbers. Just in case, here are the python script used, and the bash script which shows other steps. Overall I'm still confused about why I'm unable to reproduce your and @phiyodr 's results and would appreciate any suggestions. Thanks a lot! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
❓ Questions & Help
Hello, congrats to all contributors for the awesome work with LXMERT! It is exciting to see multimodal transformers coming to hugginface/transformers. Of course, I immediately tried it out and played with the demo.
LXMERT pre-trained model, trained on what exactly?
Question:
Does the line
lxmert_base = LxmertForPreTraining.from_pretrained("unc-nlp/lxmert-base-uncased")
load an already pre-trained LXMERT model on the tasks enumerated in the original paper “(1) masked crossmodality language modeling, (2) masked object prediction via RoI-feature regression, (3) masked object prediction via detected-label classification, (4) cross-modality matching, and (5) image question answering.” (Tan & Bansal, 2019)? If the pre-training tasks are not all the ones from the paper, would that line load pre-trained weights at all and if yes, on what?Thanks in advance! 🤗
A link to original question on the forum/Stack Overflow: Here is the link to the hugginface forum.
The text was updated successfully, but these errors were encountered: