Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add YttmTokenizer, ImageTextDataset from @rom1504, Single-GPU trainin… #3

Closed
wants to merge 14 commits into from

Conversation

afiaka87
Copy link

@afiaka87 afiaka87 commented Dec 7, 2021

…g script

from torch.utils.data import Dataset


class ImageTextDataset(Dataset):
Copy link

@rom1504 rom1504 Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be some value to making an independent package with such vision/text dataset readers (this one and webdataset at least), and depending on it here and in dalle pytorch (and in clip retrieval and probably a bunch of other places)
what do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. at this point it's a bit strange that pytorch doesn't have something for this by now - are you aware of anything?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides from discussions at pytorch/pytorch#38419 no I don't know
I think pytorch has been taking the approach of letting the user build their own things on the data side

although as a reminder image+text dataset is something that has begun being useful in 2021, it's not that old yet :)

for batch_idx, (text, images) in current_epoch_pbar:
with autocast(enabled=args.amp):
text, images = map(lambda t: t.cuda(), (text, images))
mask = torch.ones_like(text).bool()
Copy link

@MicPie MicPie Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A recent update adapted the text loss to incorporate a text mask:

text_to_image = masked_mean(text_to_image, text_to_image_mask, dim = -1)

To make use of this, the dataset could return the accompanying text masks to utilize the masked mean.
Really nice work! :-)

@afiaka87 afiaka87 marked this pull request as ready for review December 16, 2021 19:08
@afiaka87
Copy link
Author

afiaka87 commented Dec 16, 2021

@lucidrains Let me know if there's any glaring mistakes but this should provide similar functionality to what we had in dalle-pytorch. Main thing missing is webdataset support and multi-GPU, but I figured folks may want to start using this and I don't know how long it will take me to implement that.

Romain made a decent point about how everyone seems to just rewrite/copy-paste the text-image dataloader but unfortunately I can't commit to maintaining a pip package for that either.

@afiaka87
Copy link
Author

@MicPie Thanks for the DDP code. I've rebased your branch onto this one so we can hopefully get that upstream.

@afiaka87
Copy link
Author

afiaka87 commented Jan 1, 2022

Apologies, have not had the time to get this branch working. Closing for now.

@afiaka87 afiaka87 closed this Jan 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants