-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add YttmTokenizer, ImageTextDataset from @rom1504, Single-GPU trainin… #3
Conversation
from torch.utils.data import Dataset | ||
|
||
|
||
class ImageTextDataset(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be some value to making an independent package with such vision/text dataset readers (this one and webdataset at least), and depending on it here and in dalle pytorch (and in clip retrieval and probably a bunch of other places)
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. at this point it's a bit strange that pytorch doesn't have something for this by now - are you aware of anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
besides from discussions at pytorch/pytorch#38419 no I don't know
I think pytorch has been taking the approach of letting the user build their own things on the data side
although as a reminder image+text dataset is something that has begun being useful in 2021, it's not that old yet :)
for batch_idx, (text, images) in current_epoch_pbar: | ||
with autocast(enabled=args.amp): | ||
text, images = map(lambda t: t.cuda(), (text, images)) | ||
mask = torch.ones_like(text).bool() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A recent update adapted the text loss to incorporate a text mask:
Line 314 in ac62779
text_to_image = masked_mean(text_to_image, text_to_image_mask, dim = -1) |
To make use of this, the dataset could return the accompanying text masks to utilize the masked mean.
Really nice work! :-)
@lucidrains Let me know if there's any glaring mistakes but this should provide similar functionality to what we had in dalle-pytorch. Main thing missing is webdataset support and multi-GPU, but I figured folks may want to start using this and I don't know how long it will take me to implement that. Romain made a decent point about how everyone seems to just rewrite/copy-paste the text-image dataloader but unfortunately I can't commit to maintaining a pip package for that either. |
@MicPie Thanks for the DDP code. I've rebased your branch onto this one so we can hopefully get that upstream. |
Apologies, have not had the time to get this branch working. Closing for now. |
…g script