New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to sample or generate a new image? #17
Comments
I have the same issue. I want to generate images from text. How do I input string/sentence as input and expect images as output as mentioned here? Please let me know. Thanks! |
I am not sure wether this is possible using this demo, since it contains only encoder and decoder, and we have no knowledge whatsoever on the language embedding part (apart from vocab size) I agree it would be cool to have it, but I doubt this will be released. |
Here -- use THIS! This is how to do it |
Thank you @fractaldna22 for sharing! I have looked at it. Is there a way I can train it on my dataset? I request you to please let me know if there is a way to train on my dataset along with the text/captions. |
@amish-logicwind I think this is probably what you need: https://github.com/lucidrains/DALLE-pytorch |
@TPreece101 It's not my notebook, it belongs to @advadnoun who also created several variations But i believe that's sorta what it does. CLIP isnt evaluating the quality but rather how closely the image generated matches with the tokenized text input. The Dall_e encoder and decoder convert the image and text into tokens and maps them on the latent space, then Clip, using a Visual Transformer (ViT-B-32) evaluates whether the image matches the categories denoted in the text input. It returns a loss to dall_e, then dall_e unmaps the pixels again, adjusts them more, maps, sends to the VIT, and so on, until it converges or collapses depending on factors. its IMO CLIP.model('ViT-B/32') that's the real gem here. Its clip which is so good at matching images with text and going multi-modal with it. It's only limited by the size of the VIT and its training dataset. There are multiple models you can use for clip , the regular ViT-B-32.pt, RN50, RN101, RN50x4, and there are several new ViTs out that include ViT-L-32 which is 1.4 GB vs the default B-32 which is only 300 mb. Theres also hybrid models. I dont know how to set up the new ViT models so that they load in clip. Clip model settings specify that if its not in the list, or downloaded via a URL, it will refuse to load it . And inside the ViT it has to have certain init words like "apply" and "version". I have no idea how to set that, even though the Vits are pre-trained with a large dataset. "I think this is probably what you need: https://github.com/lucidrains/DALLE-pytorch" @amish-logicwind and @TPreece101 eh. I haven't seen any images with that. And that notebook focuses on training the VaE which is just an auto encoder and not a visual transformer - so training it on a dataset does nothing. Try these: https://www.kaggle.com/abhinand05/vision-transformer-vit-tutorial-baseline/notebook , https://github.com/rwightman/pytorch-image-models/ Just make sure its one of the comptaible models. ViT-B-32, RN50, RN101, RN50x4. If anyone can figure out how to load in a different pretrained model into clip with this notebook i would love to know how. if you're gonna train something, train the Visual Transformer model using either Jax, Clip, or Timm recipes. the only thing the ViT-Base-32 model struggles with really, at least this implementation of it, is Focusing its vision onto a central point, it sees kinda cross eyed. Especially when its trying to make a human face - i tends to overlap two faces slightly as if its literal eyes were bad. Maybe someone will see the problem and it could be something as simple as a changing a 48 to a 32 or something in one of the parameters.. You can just set perceptor, preprocess = clip.load('ViT-B/32', torch.device('cuda:0'). jit=True) |
@fractaldna22 Thanks for the explanation - makes sense, although I will dig in a bit more later to get a better understanding. I've been playing around with the notebook and for some reason all of my pictures some out with a lot of white in the middle - see pic for prompt- I'm guessing something is exploding somewhere pushing the RGB values to the max but I'm going to investigate in more detail - just wondering if anyone has any ideas? |
@TPreece101 it has to do with the default temperature or tau in the latent coordinates. in this version of the notebook its defined by "hadies" (temperature lol). The best measure against the collapsing of image is by turning it up to 1.4 but you can turn it up 2, 2.5 or any positive number, but 1.4-1.7 generally works best. The higher you go the thicker the image, but also it introduces some weirdness or double vision sometimes if you go too high. i would use this instead, ill edit the things that i usually change when using the default notebook. Latent coordinates
for the next cell, train / generate:
hopefully those work because im just eyeballing it here lol. its definitely the tau / hadies though, to get more layers. Also, I lowered the learning rate from 1.5 to 1.0. when its too fast its more likely to collapse. 1 is usually good sometimes even lower is better for quality. |
Great, thanks a lot @fractaldna22, it has stopped collapsing now. Have you managed to get any good results out of it so far? |
@fractaldna22 The results seem to be really good. It would be great if you can share your colab link. I tried changing as per your suggestions, but the results are not satisfactory. Have a look at this: https://colab.research.google.com/drive/1oA1fZP7N1uPBxwbGIvOEXbTsq2ORa9vb?usp=sharing |
@fractaldna22 Have you managed to find a url to download ViT-L-32? I can't seem to find any references to it anywhere? |
I have but i can't figure out how to configure it for use by clip. Its obviously possible but its so confusing and im still new. Look for ViT Pytorch and those repos usually have pretrained models. The most accurate model currently is ViT-H_14, the "huge" model. it achieves 99% accuracy in some tests. if it can classify it, it can do the reverse and guide the generation of images. the more accurate clip is, the more accurate the VAE is. |
well? no dopamine reward? lol |
@fractaldna22 Thanks for your reply! But I'm still a little confused. In the paper, the image generation step is: 1. using the transformer to generate image tokens, with the input of text tokens; 2. put the generated image tokens into the decoder of dVAE to produce RGB images; 3. using the CLIP to select the best one. However, in your script, the image generation is following: 1. randomly initialize the index matrix; 2. put the index matrix into decoder of dVAE to produce RGB image; 3. using CLIP to calculate the similarity between the generated image and test tokens as a loss to optimize the index matrix, while keeping the weights of the network unchanged. In your pipeline, there isn't such a transformer predictor mentioned in the paper. Could you please explain this question? Thank you! |
Thanks @fractaldna22 that notebook is much easier to use for tweaking and such. |
@JohnDreamer It's not my code - but the reason is because DALL-E didn't release its transformer and released a VERY small simple autoencoder that is hardly necessary.. We did fine without it, using it only as a decoder to remap the pixels. CLIP is the real mastermind behind the image generation because Attention is all you need. |
@fractaldna22 Got it! Thank you very much! |
I just made this with another clip+(different vae) notebook which im not at liberty to share, but i can share the content https://www.youtube.com/watch?v=5HcBxeS7jkQ |
@fractaldna22 In fact, if we follow the paper's pipeline, we input the text tokens into GPT3 and predict the image tokens one by one. If the input text tokens are determined, then the generated image tokens are determined. How does Dall-E generate diverse images? Does it choose the next image token by the probability? Do you have any idea about this? |
It just knows what objects are labeled, can label images going in and do
the same thing in reverse, find images based on labels like a search. It
chooses pixels using cosine similarity, how strictly the clip neurons
associated with a label react to an image. It's not like an image has one
token, it has billions. Each pixel is a token. That's no deterministic
overlap between a text token to a pixel, that would literally be magic..
it takes the whole label, compares it to a datesets of images with that
label, and then the vae maps it onto the image.
Loss functions, weight decay and things like soft max, normalize, augments
etc are the only way to make it not seem like Google image search. They add
noise and noise is the key to make it continuously choose new ways to match
a label with a stack of image shapes.
…On Mon, Mar 22, 2021, 8:54 AM JohnDreamer ***@***.***> wrote:
@fractaldna22 <https://github.com/fractaldna22> In fact, if we follow the
paper's pipeline, we input the text tokens into GPT3 and predict the image
tokens one by one. If the input text tokens are determined, then the
generated image tokens are determined. How does Dall-E generate diverse
images? Does it choose the next image token by the probability? Do you have
any idea about this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI4YF7TPVXICDO7AE3WIBVTTE44X7ANCNFSM4YVK7GKQ>
.
|
@fractaldna22 Great! Is it possible to change output image resolution? (pic - result from "A samurai fighting with ninja" text input) |
i just place it there. A samurai fighting with ninja 1,3k interation (i dont know why one of samurai fighting with horse and ninja fighting with tree): |
@Mimocro @fractaldna22 @JohnDreamer @TPreece101 Is there a way to generate a group of images (say 10 or 12). What I could see is that it can generate a single image and improvise it. It would be great if multiple images can be generated. Something similar to this notebook. |
@amish-logicwind all this colabs there (include my if you disable cleaner) can generate a group of images but dont 10 or 12, its more for a bit. Also for see differences of interations in my version of colab you can run the cells under main cell to save it with gif or zip format. |
@Mimocro @fractaldna22 @JohnDreamer @TPreece101 I am working on a project in which I need to generate images based on some training images. I have done that here, but the size of output images are very small. I am a beginner in GAN, and don't know which parameters must be changed in the discriminator and generator to generate images of the output size of 200 * 200. Please guide me a little. It would be great if anyone of you can have a look at the above colab suggest the necessary changes. |
@amish-logicwind maybe change in 4rd cell |
@Mimocro Bro, this is not a one-line change. We need to change the parameters in the layers also. |
@amish-logicwind why, this lines changes size in output what drawing with plt. Original sizes more than 200x200 and they dont need to change i think. This error because plt cant draw big picture in small. Just remove all plt and replace displaying function at every n epochs with something like display(generated_images) or equivalent. |
@Mimocro Don't you think that parameters in the first layers of generator model must change? As the we have increased rhe image size, we need more layers to deconvolve it, right? |
@amish-logicwind i think no, 3rd cell how i can see gives real sizes, when height and width just set the output image sizes. I mean in output of training cell you can see downscaled image, when nn generating images in sizes by categories in first cell. I dont pro in that and dont know best solution how to see real size images, but i dont think what you really need real sizes - its more than 200x200. |
The transformer used to generate the images from the text is not part of this code release. I've since modified the README to state this explicitly. There are collab notebooks available that can be used to generate images by steering a generative model with CLIP, but these are unrelated to this release. |
HI, I'm wondering that is the index matrix learnable? |
During optimization, which is continuous and is the main and only way to
generate images from text using this method, the loss is essentially
negative and has no bottom limit. It is always a multiple of -1 or -.1 to
taste, and each time step is only slightly higher or lower, -1.53, -1.3,
-0.5, -1.6, -.1, etc. It has no end state or "finished" except whenever
you decide to end it based on subjective preference. The loss is less
important than the actual Image itself, the loss tells you nothing really.
And we actually want continuous improvement and for it to constantly be
striving to improve the image, especially if you're making an animation,
you just keep adding noise, encode the image, replace the tensor in the
optimizer with the new encoded tensor, optimizer step, decode tensor and
add noise, encode, optimizer step, etc. If you slightly zoom in or alter
the image with affine ever so slightly every time step it forces the model
to recognize new features and evolves the image as good as any SOTA text
to video mode
…On Sat, 27 Aug 2022 at 11:57, Asthestarsfalll ***@***.***> wrote:
@fractaldna22 <https://github.com/fractaldna22> Thanks for your reply!
But I'm still a little confused. In the paper, the image generation step
is: 1. using the transformer to generate image tokens, with the input of
text tokens; 2. put the generated image tokens into the decoder of dVAE to
produce RGB images; 3. using the CLIP to select the best one. However, in
your script, the image generation is following: 1. randomly initialize the
index matrix; 2. put the index matrix into decoder of dVAE to produce RGB
image; 3. using CLIP to calculate the similarity between the generated
image and test tokens as a loss to optimize the index matrix, while keeping
the weights of the network unchanged. In your pipeline, there isn't such a
transformer predictor mentioned in the paper. Could you please explain this
question? Thank you!
HI, I'm wondering that is the index matrix learnable?
My understanding is that during the training stage, the input index matrix
is the only parameter.
And the training will end as long as the score generated by CLIP is low
enough.
Thank for you help in advance!
—
Reply to this email directly, view it on GitHub
<#17 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI4YF7RZWMZKTZY7KEH44G3V3I3GNANCNFSM4YVK7GKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
i was thinking of vqgan sorry. Dvae is similar but is much less fluid and
less robust from collapsing. there is typically only one image from dvae
generation and it only improves the image it initially settles on but
doesnt tend to evolve like vqgan does
|
Hi, it's a great work! But I am a little confused about how to generate a new image? Shall I give the sentence tokens and then use them to predict the image tokens? And where to inject the noise? It will be very appreciate that you can answer these questions, thank you!
The text was updated successfully, but these errors were encountered: