[Model] How to evaluate Idefics Model's ability with in context examples? #25803

Luodian · 2023-08-28T19:39:02Z

Hi the recent release of Idefics-9/80B-Instruct model is superbly promising!

We would like to evaluate them on a customized benchmarks with in context examples. May I ask how should I arrange the prompt template, especially for instruct version?

We had some problems previously when evaluating the model on single images, the model will ramble and wont stop, but managed to resolve them somehow.

For single image we use the template to evaluate instruct version model.

User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:

Would it be perfectly correct (matching your training template?) or do you have better recommendation. Sorry we have a customized pipeline so it's not easy to adopt your designed IdeficsProcessor. 😭

Also we migrate the code on image_attention_mask with

# supporting idefics processing
def get_formatted_prompt(prompt: str="", in_context_prompts: list = []) -> str:
    # prompts = [
    #         "User:",
    #         "https://hips.hearstapps.com/hmg-prod/images/cute-photos-of-cats-in-grass-1593184777.jpg",
    #         "Describe this image.\nAssistant: An image of two kittens in grass.\n",
    #         "User:",
    #         "http://images.cocodataset.org/train2017/000000190081.jpg",
    #         "Describe this image.\nAssistant:",
    #     ]
    # prompts = f"User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:<answer>"
    prompts = f"User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:"
    return prompts

def get_image_attention_mask(output_input_ids, max_num_images, tokenizer, include_image=True):
    # image_attention_mask, _ = image_attention_mask_for_packed_input_ids(output_input_ids, tokenizer)
    # image_attention_mask = incremental_to_binary_attention_mask(image_attention_mask, num_classes=max_num_images)
    if include_image:
        image_attention_mask, _ = image_attention_mask_for_packed_input_ids(output_input_ids, tokenizer)
        image_attention_mask = incremental_to_binary_attention_mask(
            image_attention_mask, num_classes=max_num_images
        )
    else:
        # in full language mode we set the image mask to all-0s
        image_attention_mask = torch.zeros(
            output_input_ids.shape[0], output_input_ids.shape[1], 1, dtype=torch.bool
        )
    return image_attention_mask

lang_x = self.tokenizer(
    [
        get_formatted_prompt(question, []),
    ],
    return_tensors="pt",
)
image_attention_mask = get_image_attention_mask(lang_x['input_ids'], 1, self.tokenizer)

I have read all related blogs and docs but still got confused about the usage of <end_of_utterance>. Is it used to break the in context examples with query example?

My guess is

User:<fake_token_around_image><image><fake_token_around_image>{in_context_prompt} Assistant: {in_context_answer} <end_of_utterance> User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:

Besides, very curious that the model would generate the normal <end_of_utterance> at the last of sentence instead of normal llama's <|endofchunk|>?

The text was updated successfully, but these errors were encountered:

VictorSanh · 2023-08-28T20:17:54Z

Hi @Luodian ,

For single image we use the template to evaluate instruct version model.
User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:

To perfectly match the format used during the training of the instructed versions, you should modify slightly the template you are showing:
User:<fake_token_around_image><fake_token_around_image>{prompt}<end_of_utterance>\nAssistant: {assistant_answer}<end_of_utterance>`

beyond the additional \n and <end_of_utterance>, everything looks correct!
The rest of the code snippet looks about right too.
how are you getting the pixel values?

I have read all related blogs and docs but still got confused about the usage of <end_of_utterance>. Is it used to break the in context examples with query example?

Besides, very curious that the model would generate the normal <end_of_utterance> at the last of sentence instead of normal llama's <|endofchunk|>?

We use the <end_of_utterance> in the dialogue setup to have an easier exit condition. it marks both the end of a user and assistant turn. We found that not having this token makes it harder in a dialogue setup to stop the generation.
The end of a dialogue is marked by an during training.

Luodian · 2023-08-28T20:25:28Z

Thanks! Then for in context examples, should it be like?

User:<fake_token_around_image><image><fake_token_around_image>{in_context_prompt}<end_of_utterance>\n
Assistant: {in_context_answer}<end_of_utterance>\n
User:<fake_token_around_image><image><fake_token_around_image>{prompt}<end_of_utterance>\n
Assistant:

VictorSanh · 2023-08-28T20:42:41Z

No need for double line breaks but otherwise, it is correct, that is the most straightforward way to do in-context evaluation

Luodian · 2023-08-29T05:05:23Z

Btw we use self.image_processor = transformers.CLIPImageProcessor() to get the pixel values.

vision_x = self.image_processor.preprocess([raw_image], return_tensors="pt")["pixel_values"].unsqueeze(0)

Will it be different with IdeficsImageProcessor?

Here's an example output of the burger example.

github-actions · 2023-09-28T08:02:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] How to evaluate Idefics Model's ability with in context examples? #25803

[Model] How to evaluate Idefics Model's ability with in context examples? #25803

Luodian commented Aug 28, 2023 •

edited

VictorSanh commented Aug 28, 2023

Luodian commented Aug 28, 2023 •

edited

VictorSanh commented Aug 28, 2023

Luodian commented Aug 29, 2023

github-actions bot commented Sep 28, 2023

[Model] How to evaluate Idefics Model's ability with in context examples? #25803

[Model] How to evaluate Idefics Model's ability with in context examples? #25803

Comments

Luodian commented Aug 28, 2023 • edited

VictorSanh commented Aug 28, 2023

Luodian commented Aug 28, 2023 • edited

VictorSanh commented Aug 28, 2023

Luodian commented Aug 29, 2023

github-actions bot commented Sep 28, 2023

Luodian commented Aug 28, 2023 •

edited

Luodian commented Aug 28, 2023 •

edited