Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LXMERT pre-training tasks #7266

Closed
LetiP opened this issue Sep 20, 2020 · 23 comments
Closed

LXMERT pre-training tasks #7266

LetiP opened this issue Sep 20, 2020 · 23 comments

Comments

@LetiP
Copy link

LetiP commented Sep 20, 2020

❓ Questions & Help

Hello, congrats to all contributors for the awesome work with LXMERT! It is exciting to see multimodal transformers coming to hugginface/transformers. Of course, I immediately tried it out and played with the demo.

LXMERT pre-trained model, trained on what exactly?

Question:
Does the line lxmert_base = LxmertForPreTraining.from_pretrained("unc-nlp/lxmert-base-uncased") load an already pre-trained LXMERT model on the tasks enumerated in the original paper “(1) masked crossmodality language modeling, (2) masked object prediction via RoI-feature regression, (3) masked object prediction via detected-label classification, (4) cross-modality matching, and (5) image question answering.” (Tan & Bansal, 2019)? If the pre-training tasks are not all the ones from the paper, would that line load pre-trained weights at all and if yes, on what?

Thanks in advance! 🤗

A link to original question on the forum/Stack Overflow: Here is the link to the hugginface forum.

@LysandreJik
Copy link
Member

Tagging the LXMERT implementation author @eltoto1219

@eltoto1219
Copy link
Contributor

Hi, "unc-nlp/lxmert-base-uncased" was trained with all tasks specified in the paper (as aforementioned). We have benchmarked the pre-trained model to make sure it reaches the same performance on all QA tasks. If you do run into any troubles though, please let me know!

@LetiP
Copy link
Author

LetiP commented Sep 22, 2020

Hello @eltoto1219, thank you for the answer! I suppose it was a weird question from my part, since I was asking this to make sure that I am loading a pre-trained LXMERT model and not some random weights. Especially because I look at the output_lxmert['cross_relationship_score'] of COCO images and captions (so not on some out of distribution images and captions), after I loaded LXMERT with the aforementioned code lxmert_base = LxmertForPreTraining.from_pretrained("unc-nlp/lxmert-base-uncased"). It seems that on cross-modality matching LXMERT performs with 50% accuracy (random guessing). So I wanted to make sure that I load pre-trained weights on (4) cross-modality matching in the first place.

New question: Do you know how it can be that LXMERT randomly guesses on cross-modality matching even it was pre-trained to deliver a score (after the Softmax, of course) of smaller 0.5 if the caption is not describing the image and a score bigger 0.5 it the caption and the image match?

@iacercalixto
Copy link

+1 I am also interested in the question/answer!

@procedure2012
Copy link

I also meet the problem. Does any have ideas about why this happends? @eltoto1219

@ecekt
Copy link

ecekt commented Nov 5, 2020

Hello @LetiP,

Were you loading images from URLs or locally from image files? I have noticed a discrepancy in how they are processed in FRCNN and I was getting different visual features. I opened an issue about it here: #8333

Best,
Ece

@LetiP
Copy link
Author

LetiP commented Nov 12, 2020

@ecekt Well observed, thank you very much!
I did not notice that difference between URLs and local files, because I did not look closely at the features regarding this aspect. I have conducted my experiments over many samples with local files. However, I tested also with 10-20 images from URLs too and there I observed a similar random guessing behavior regarding the cross_relationship_score.

Now I took a closer look at this and with your proposed solution in #8333 I see how the features change, but not the performance. Still random guessing.

@phiyodr
Copy link
Contributor

phiyodr commented Dec 18, 2020

Hi @LetiP! Super interesting question. I was curious so I run a test on 1000 COCO-Val images with 5013 captions in total.

Using the original implementation (without considering #8333, i.e. with wrong color ordering for local files) I received 56 % correct classifications for images with correct captions - so the same results you got 👍. Interestingly this model gets 99.7 % correct for wrong image-caption combinations (image with randomly drawn COCO caption). Hence we have 56 % Recall, but 99.7 % Specificity.

Fixing the bug noted in #8333 (see code below), Recall goes up to 71 %, Specificity is at 99.2 %, Precision at 98 %, and Accuracy is at 85 %. From this result I follow that "unc-nlp/lxmert-base-uncased" was trained on cross-modality matching. :)

# transformers/examples/lxmert/utils.py
def img_tensorize(im, input_format="RGB"):
    [...]
        assert img is not None, f"could not connect to: {im}"
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # <=== indent this line, so it works for local and url images.
    if input_format == "RGB":
    [...]

@LetiP
Copy link
Author

LetiP commented Dec 18, 2020

Hello @phiyodr ,

looking at your results, I can not but keep wondering: How did you decide whether the classification is correct or not?

I am asking because the cross_relationship_score is a tensor with two logit entries for an image-text pair.

How do you decide which logit represents a match and which one a mismatch?

@phiyodr
Copy link
Contributor

phiyodr commented Dec 18, 2020

Indeed the documentation is not super clear whether the first or second value in cross_relationship_score means is_match.

I used 5013 correct image-caption pairs and 5013 wrong image-caption combinations, then I made a confusion matrix and decided whether the first or second value of the tensor is more plausible for is_match.

Without considering #8333:

  • Using the first entry as is_match receives an accuracy of 22 %.
  • Using the second entry as is_match receives an accuracy of 78 %. Recall=56 %, Specificity=99.7 %, TP=2830, FN=2183, FP=14, TN=5002). You looked at Recall which is indeed close to random guessing.

Considering #8333:

  • Using the first entry as is_match receives an accuracy of 15 %.
  • Using the second entry as is_match receives an accuracy of 85 %.

Hence the first value is likely to be no_match and the second value is likely to be is_match.

@LetiP
Copy link
Author

LetiP commented Dec 18, 2020

Hello @phiyodr , thank you for your quick response!

I think that this issue evolved into the question:

How do you decide which logit represents a match and which one a mismatch?

I understand that you pick the logit delivering better results. But I followed the documentation which (I understand) assigns the first logit to is_match (True).

cross_relationship_score – (torch.FloatTensor of shape (batch_size, 2)): Prediction scores of the textual matching objective (classification) head (scores of True/False continuation before SoftMax).

@eltoto1219 Do you perhaps know which logit in the output_lxmert['cross_relationship_score'] represents a match and which one a mismatch? How to interpret the documentation?

@phiyodr
Copy link
Contributor

phiyodr commented Dec 18, 2020

Yeah, the docu is actually vice versa.
Actually looking at specificity makes more sense than accuray: Specificity of 99.7% for the second value vs. 0.3 % for the first.

@LetiP
Copy link
Author

LetiP commented Dec 18, 2020

Still waiting for confirmation about what is happening here, about the way that the model was trained and which logit was intended to predict the match. I do not see any reason why one can simply invert the logits with wishful thinking.

@yezhengli-Mr9
Copy link

Is there any entry-level example of Lxmert? Following example from Lxmert.

from transformers import LxmertTokenizer, LxmertModel
import torch

tokenizer = LxmertTokenizer.from_pretrained('unc-nlp/lxmert-base-uncased')
model = LxmertModel.from_pretrained('unc-nlp/lxmert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

comes up

File "/Users/yezli/miniconda3/lib/python3.8/site-packages/transformers/models/lxmert/modeling_lxmert.py", line 933, in forward
    assert visual_feats is not None, "`visual_feats` cannot be `None`"
AssertionError: `visual_feats` cannot be `None` 

@eltoto1219
Copy link
Contributor

eltoto1219 commented Dec 27, 2020

Hi @LetiP ! I am super sorry for all of the confusion regarding the correct/incorrect logit for the cross_relationship score. Per the documentation you pointed out, it is indeed a bit ambiguous in where the correct position of the "is_matched" index is. However, while pre-training, one must provide the sentence/image matching labels if including cross modality matching in the loss regime.
It is listed here that:

matched_label (tf.Tensor of shape (batch_size,), optional) –

Labels for computing the whether or not the text input matches the image (classification) loss. Input should be a sequence pair (see input_ids docstring) Indices should be in [0, 1]:

0 indicates that the sentence does not match the image,

1 indicates that the sentence does match the image.

The pytorch loss that was used can be found here. This should also indicate the assignments of the indicies.

Thus, if the sentence does match the image, the model will have maximized the likelihood of the first index to be 1. In the case of a mismatch, the zero'th index will have been maximized to be 1 (aka True). If for some reason this is not occurring properly, please let me know!

Hey @yezhengli-Mr9 ! That is also a very misleading example for lxmert as one must provide the visual-position (normalized bounding boxes) and the FRCNN ROI-pooled visual features in order for the model to run. For all optional/non-optional inputs, please see the docs! I should be able to fix that sometime soon. For now, if needed, you can refer to the LXMERT pytests.


I will be making a pull-request to remove the image tensorization from urls as it seems to be outside the scope of the demo and will remove one source of error. I will also formalize the feature-extraction code as using a batch size larger than one entails image padding which, consequently, lowers the quality of the image features. Perhaps, I can include that in an example too.

@LetiP
Copy link
Author

LetiP commented Dec 27, 2020

Hello @eltoto1219 , thank you, this is the answer I have been looking for! Indeed, the behavior is exactly as you say:

In the case of a mismatch, the zero'th index will have been maximized to be 1 (aka True)

It is good to hear which part of the documentation is correct (or as you say, not ambiguous 😉). For helping everyone and to avoid any confusion, I would suggest to adapt the documentation of cross_relationship_score accordingly.
Just replacing:

... (scores of True/False continuation before SoftMax)

with

(scores of False (index 0)/True (index 1) continuation before SoftMax)

would to the job.

@aishwaryap
Copy link

+1 for the suggested documentation change. I was trying to figure this out as well.

@LysandreJik
Copy link
Member

Feel free to open a PR to update the documentation, we'll glady merge it!

@aishwaryap
Copy link

aishwaryap commented Feb 19, 2021

Hello @LetiP @eltoto1219 @ecekt

I tried what I believe is the same experiment - predict match/no-match over the MSCOCO 2017 val set. Specifically, I used all image-caption pairs in the val set (25014 pairs over 5000 images) and sampled captions from random different images to create an equal number of negative examples (leading to a total of 50028 examples). I am getting the following results using this
Number of examples = 50028
Number of true positives (TP) = 17485
Number of false positives (FP) = 17485
Number of true negatives (TN) = 7529
Number of false negatives (FN) = 7529
Accuracy = 0.5
Precision = 0.5
Recall = 0.6990085552090829
F1 = 0.5829887970125367
This is a higher recall than @LetiP and @ecekt but a much lower sensitivity and precision.

I am trying to understand what caused the differences. I used this script provided by @eltoto1219 in #8769 to extract image features and I am getting the prediction by performing
pred = torch.argmax(softmax(output["cross_relationship_score"])).item()
and am treating 0 as no-match and 1 as match.
I did confirm that there is no color format issue in the feature extraction. Also, as in the demo I am

  • Loading the model and tokenizer from unc-nlp/lxmert-base-uncased.
  • Not performing any preprocessing on the caption (not even lower casing) as this did not seem required based on the demo and tokenizer source.

Would appreciate any help trying to figure out what is causing these differences? So far I have not tested on VQA/ GQA.

Thanks!

@aishwaryap
Copy link

Hello @LetiP @eltoto1219 @ecekt

Reminder if any of you have thoughts/ suggestions about my question.

Thanks!

@LetiP
Copy link
Author

LetiP commented Mar 27, 2021

Hello @aishwaryap ,

I could exactly reproduce the numbers of @phiyodr 🥳 (here the relevant excerpt):

Without considering #8333:

  • Using the second entry as is_match receives an accuracy of 78 %. Recall=56 %, Specificity=99.7 %, TP=2830, FN=2183, FP=14, TN=5002).

Considering #8333:

  • Using the second entry as is_match receives an accuracy of 85 %.

The script you are referring to for image feature extraction is unknown to me, therefore I did not use it. For reading in the images I closely followed the original LXMERT demo in this Colab Notebook.

Can you reproduce my and @phiyodr 's numbers with the data loading code from that Notebook as well?

Sorry for the late answer, I had too much going on.

@aishwaryap
Copy link

Hi @LetiP,

Thanks a lot for sharing your script!

Unfortunately, I am not able to reproduce those numbers using that notebook. Using the first 1000 val images on MSCOCO with all their paired captions as positive examples, and one randomly sampled caption from a different image as negative examples, I get
Total number of examples tested = 10004
tp = 3546
fp = 3546
tn = 1456
fn = 1456
Accuracy = 0.5
Precision = 0.5
Recall = 0.7089164334266294
Specificity = 0.29108356657337064
This is a significantly higher recall than what the two of you got but a much lower specificity. Note that this did require me to change the transformers source to prevent color format conversion for local images (#8333). Without that change, recall was 55.6% and specificity was 44.37%.

I did have to modify the script you provided in order to run on a remote server, load MSCOCO images and sample negatives, but I don't think I changed anything that should result in different numbers. Just in case, here are the python script used, and the bash script which shows other steps.

Overall I'm still confused about why I'm unable to reproduce your and @phiyodr 's results and would appreciate any suggestions.
Tagging @eltoto1219 as well in case he can provide further insight.

Thanks a lot!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants