Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] How to evaluate Idefics Model's ability with in context examples? #25803

Closed
Luodian opened this issue Aug 28, 2023 · 5 comments
Closed

Comments

@Luodian
Copy link

Luodian commented Aug 28, 2023

Hi the recent release of Idefics-9/80B-Instruct model is superbly promising!

We would like to evaluate them on a customized benchmarks with in context examples. May I ask how should I arrange the prompt template, especially for instruct version?

We had some problems previously when evaluating the model on single images, the model will ramble and wont stop, but managed to resolve them somehow.

For single image we use the template to evaluate instruct version model.

User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:

Would it be perfectly correct (matching your training template?) or do you have better recommendation. Sorry we have a customized pipeline so it's not easy to adopt your designed IdeficsProcessor. 😭

Also we migrate the code on image_attention_mask with

# supporting idefics processing
def get_formatted_prompt(prompt: str="", in_context_prompts: list = []) -> str:
    # prompts = [
    #         "User:",
    #         "https://hips.hearstapps.com/hmg-prod/images/cute-photos-of-cats-in-grass-1593184777.jpg",
    #         "Describe this image.\nAssistant: An image of two kittens in grass.\n",
    #         "User:",
    #         "http://images.cocodataset.org/train2017/000000190081.jpg",
    #         "Describe this image.\nAssistant:",
    #     ]
    # prompts = f"User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:<answer>"
    prompts = f"User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:"
    return prompts

def get_image_attention_mask(output_input_ids, max_num_images, tokenizer, include_image=True):
    # image_attention_mask, _ = image_attention_mask_for_packed_input_ids(output_input_ids, tokenizer)
    # image_attention_mask = incremental_to_binary_attention_mask(image_attention_mask, num_classes=max_num_images)
    if include_image:
        image_attention_mask, _ = image_attention_mask_for_packed_input_ids(output_input_ids, tokenizer)
        image_attention_mask = incremental_to_binary_attention_mask(
            image_attention_mask, num_classes=max_num_images
        )
    else:
        # in full language mode we set the image mask to all-0s
        image_attention_mask = torch.zeros(
            output_input_ids.shape[0], output_input_ids.shape[1], 1, dtype=torch.bool
        )
    return image_attention_mask

lang_x = self.tokenizer(
    [
        get_formatted_prompt(question, []),
    ],
    return_tensors="pt",
)
image_attention_mask = get_image_attention_mask(lang_x['input_ids'], 1, self.tokenizer)

I have read all related blogs and docs but still got confused about the usage of <end_of_utterance>. Is it used to break the in context examples with query example?

My guess is

User:<fake_token_around_image><image><fake_token_around_image>{in_context_prompt} Assistant: {in_context_answer} <end_of_utterance> User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:

Besides, very curious that the model would generate the normal <end_of_utterance> at the last of sentence instead of normal llama's <|endofchunk|>?

@VictorSanh
Copy link
Member

Hi @Luodian ,

For single image we use the template to evaluate instruct version model.
User:<fake_token_around_image><image><fake_token_around_image>{prompt} Assistant:

To perfectly match the format used during the training of the instructed versions, you should modify slightly the template you are showing:
User:<fake_token_around_image><fake_token_around_image>{prompt}<end_of_utterance>\nAssistant: {assistant_answer}<end_of_utterance>`

beyond the additional \n and <end_of_utterance>, everything looks correct!
The rest of the code snippet looks about right too.
how are you getting the pixel values?

I have read all related blogs and docs but still got confused about the usage of <end_of_utterance>. Is it used to break the in context examples with query example?

Besides, very curious that the model would generate the normal <end_of_utterance> at the last of sentence instead of normal llama's <|endofchunk|>?

We use the <end_of_utterance> in the dialogue setup to have an easier exit condition. it marks both the end of a user and assistant turn. We found that not having this token makes it harder in a dialogue setup to stop the generation.
The end of a dialogue is marked by an during training.

@Luodian
Copy link
Author

Luodian commented Aug 28, 2023

Thanks! Then for in context examples, should it be like?

User:<fake_token_around_image><image><fake_token_around_image>{in_context_prompt}<end_of_utterance>\n
Assistant: {in_context_answer}<end_of_utterance>\n
User:<fake_token_around_image><image><fake_token_around_image>{prompt}<end_of_utterance>\n
Assistant:

@VictorSanh
Copy link
Member

No need for double line breaks but otherwise, it is correct, that is the most straightforward way to do in-context evaluation

@Luodian
Copy link
Author

Luodian commented Aug 29, 2023

Btw we use self.image_processor = transformers.CLIPImageProcessor() to get the pixel values.

vision_x = self.image_processor.preprocess([raw_image], return_tensors="pt")["pixel_values"].unsqueeze(0)

Will it be different with IdeficsImageProcessor?

Here's an example output of the burger example.

WechatIMG18283

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants