Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flax/run_hybrid_clip] Fix duplicating images when captions_per_image exceeds the number of captions, enable truncation #12752

Merged

Conversation

edugp
Copy link
Contributor

@edugp edugp commented Jul 16, 2021

What does this PR do?

Do not duplicate images when the number of captions in an example is smaller than captions_per_image.
Truncate tokens when number of tokens exceeds max_length.

Currently, if an example contains a number of captions smaller than captions_per_image, the dataset will get scrambled, because the image paths are going to be duplicated but the number of captions is not.
For example, if an example contains a single caption and captions_per_image is equal to 2, one caption will be appended to the captions list, but two image path will be appended to the image_paths list. This results in mismatching file_path-caption pairs.
This pull request fixes this issue.

Currently, if a tokenized sentence contains a number of tokens > max_length, the dimensions of inputs within a batch won't match.
This pull request fixes this issue by truncating the tokens to max_length.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [X ] Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patil-suraj

@edugp edugp force-pushed the fix-hybrid-clip-dataset-and-tokenizer branch from 6747944 to fec501e Compare July 16, 2021 05:29
…smaller than captions_per_image. Trunacte tokens when number of tokens exceeds max_length.
@edugp edugp force-pushed the fix-hybrid-clip-dataset-and-tokenizer branch from fec501e to d842968 Compare July 16, 2021 05:46
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Aug 23, 2021
@patil-suraj patil-suraj reopened this Sep 2, 2021
Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to only come back to this now.

Great catch @edugp! Thanks a lot for fixing this.

@patil-suraj patil-suraj changed the title Avoid duplicating images when captions_per_image exceeds the number of captions in an example and truncate tokens when they exceed max_length. [Flax/run_hybrid_clip] Fix duplicating images when captions_per_image exceeds the number of captions, enable truncation Sep 2, 2021
@patil-suraj patil-suraj merged commit 0a22335 into huggingface:master Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants