Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to sample or generate a new image? #17

Closed
JohnDreamer opened this issue Mar 5, 2021 · 36 comments
Closed

How to sample or generate a new image? #17

JohnDreamer opened this issue Mar 5, 2021 · 36 comments

Comments

@JohnDreamer
Copy link

Hi, it's a great work! But I am a little confused about how to generate a new image? Shall I give the sentence tokens and then use them to predict the image tokens? And where to inject the noise? It will be very appreciate that you can answer these questions, thank you!

@amish-logicwind
Copy link

I have the same issue. I want to generate images from text. How do I input string/sentence as input and expect images as output as mentioned here? Please let me know. Thanks!

@EmilEOGG
Copy link

EmilEOGG commented Mar 8, 2021

I am not sure wether this is possible using this demo, since it contains only encoder and decoder, and we have no knowledge whatsoever on the language embedding part (apart from vocab size) I agree it would be cool to have it, but I doubt this will be released.

@fractaldna22
Copy link

https://colab.research.google.com/drive/1oA1fZP7N1uPBxwbGIvOEXbTsq2ORa9vb?usp=sharing#scrollTo=O78RfTZfh7ji

Here -- use THIS! This is how to do it

@amish-logicwind
Copy link

Thank you @fractaldna22 for sharing! I have looked at it. Is there a way I can train it on my dataset? I request you to please let me know if there is a way to train on my dataset along with the text/captions.

@TPreece101
Copy link

@amish-logicwind I think this is probably what you need: https://github.com/lucidrains/DALLE-pytorch
@fractaldna22 Can you explain the notebook a little bit? Am I right in thinking that you are fitting some kind of latent space and evaluating the quality of it using CLIP?

@fractaldna22
Copy link

@TPreece101 It's not my notebook, it belongs to @advadnoun who also created several variations But i believe that's sorta what it does.

CLIP isnt evaluating the quality but rather how closely the image generated matches with the tokenized text input. The Dall_e encoder and decoder convert the image and text into tokens and maps them on the latent space, then Clip, using a Visual Transformer (ViT-B-32) evaluates whether the image matches the categories denoted in the text input. It returns a loss to dall_e, then dall_e unmaps the pixels again, adjusts them more, maps, sends to the VIT, and so on, until it converges or collapses depending on factors.

its IMO CLIP.model('ViT-B/32') that's the real gem here. Its clip which is so good at matching images with text and going multi-modal with it. It's only limited by the size of the VIT and its training dataset.

There are multiple models you can use for clip , the regular ViT-B-32.pt, RN50, RN101, RN50x4, and there are several new ViTs out that include ViT-L-32 which is 1.4 GB vs the default B-32 which is only 300 mb. Theres also hybrid models.

I dont know how to set up the new ViT models so that they load in clip. Clip model settings specify that if its not in the list, or downloaded via a URL, it will refuse to load it . And inside the ViT it has to have certain init words like "apply" and "version". I have no idea how to set that, even though the Vits are pre-trained with a large dataset.

"I think this is probably what you need: https://github.com/lucidrains/DALLE-pytorch"

@amish-logicwind and @TPreece101

eh. I haven't seen any images with that. And that notebook focuses on training the VaE which is just an auto encoder and not a visual transformer - so training it on a dataset does nothing.

Try these: https://www.kaggle.com/abhinand05/vision-transformer-vit-tutorial-baseline/notebook , https://github.com/rwightman/pytorch-image-models/

Just make sure its one of the comptaible models. ViT-B-32, RN50, RN101, RN50x4. If anyone can figure out how to load in a different pretrained model into clip with this notebook i would love to know how.

if you're gonna train something, train the Visual Transformer model using either Jax, Clip, or Timm recipes.
But ---- if you have a pretrained model thats already trained on the entire ImageNet, Cifaar etc, why do you need to train it? It can literally Imagine everything, it can combine concepts based on what it already knows as long as you give it the right prompt. You can say "JFK as an anime character" and it will make that, without training it on any anime dataset . It already has knowledge and can infer what you meant.

the only thing the ViT-Base-32 model struggles with really, at least this implementation of it, is Focusing its vision onto a central point, it sees kinda cross eyed. Especially when its trying to make a human face - i tends to overlap two faces slightly as if its literal eyes were bad. Maybe someone will see the problem and it could be something as simple as a changing a 48 to a 32 or something in one of the parameters..

You can just set

perceptor, preprocess = clip.load('ViT-B/32', torch.device('cuda:0'). jit=True)
perceptor.train() instead of preceptor.eval() and I think that will cause it to improve itself over time the more you use it.
When you're done. !cp the model out of '/root/.cache/clip/ViT-B-32.pt' to your google drive so you can load it back into .cache/clip next time and continue giving it practice.

@TPreece101
Copy link

@fractaldna22 Thanks for the explanation - makes sense, although I will dig in a bit more later to get a better understanding.

I've been playing around with the notebook and for some reason all of my pictures some out with a lot of white in the middle - see pic for prompt- 'A portrait of Abe Lincoln':
https://ibb.co/nBtdC4T

I'm guessing something is exploding somewhere pushing the RGB values to the max but I'm going to investigate in more detail - just wondering if anyone has any ideas?

@fractaldna22
Copy link

fractaldna22 commented Mar 16, 2021

@TPreece101 it has to do with the default temperature or tau in the latent coordinates.

in this version of the notebook its defined by "hadies" (temperature lol). The best measure against the collapsing of image is by turning it up to 1.4 but you can turn it up 2, 2.5 or any positive number, but 1.4-1.7 generally works best. The higher you go the thicker the image, but also it introduces some weirdness or double vision sometimes if you go too high.

i would use this instead, ill edit the things that i usually change when using the default notebook.

Latent coordinates

    def __init__(self):
        super(Pars, self).__init__()


        hots = torch.nn.functional.one_hot((torch.arange(0, 8192).to(torch.int64)), num_classes=8192)
        rng = torch.zeros(batch_size, 64*64, 8192).uniform_()**torch.zeros(batch_size, 64*64, 8192).uniform_(.1,1)
        for b in range(batch_size):
          for i in range(64**2):
            rng[b,i] = hots[[np.random.randint(8191)]]

        rng = rng.permute(0, 2, 1)

        self.normu = torch.nn.Parameter(rng.cuda().view(batch_size, 8192, 64, 64))
        




    def forward(self):

      
      normu = torch.softmax(hadies*self.normu.reshape(batch_size, 8192//2, -1), dim=1).view(batch_size, 8192, 64, 64)
      return normu


lats = Pars().cuda()
mapper = [lats.normu]
optimizer = torch.optim.Adam([{'params': mapper, 'lr': .10}])
eps = 0



tx = clip.tokenize(text_input)
t = perceptor.encode_text(tx.cuda()).detach().clone()


nom = torchvision.transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))


will_it = False
hadies = 1.4
with torch.no_grad():
  al = unmap_pixels(torch.sigmoid(model(lats()).cpu().float())).numpy()
  for allls in al:
    displ(allls[:3])
    print('\n')
  # print(lats())
  # print(lats().sum())


for the next cell, train / generate:

def checkin(loss):
  global hadies
  print("
  ##########################################################
  ",
        loss, ' (loss)\n',itt)
  
  with torch.no_grad():
    
    al = unmap_pixels(torch.sigmoid(model(lats())[:, :3]).cpu().float()).numpy()
    for allls in al:
      displ(allls)
      display.display(display.Image(str(3)+'.png'))
      print('\n')





def ascend_txt():
  out = unmap_pixels(torch.sigmoid(model(lats())[:, :3].float()))

  cutn = 32 # improves quality
  p_s = []
  for ch in range(cutn):
    size = int(sideX*torch.zeros(1,).uniform_(.5, .99))#.normal_(mean=.4, std=.80).clip(.5, .98))
    offsetx = torch.randint(0, sideX - size, ())
    offsety = torch.randint(0, sideX - size, ())
    apper = out[:, :, offsetx:offsetx + size, offsety:offsety + size]
    apper = torch.nn.functional.interpolate(apper, (224,224), mode='nearest')
    p_s.append(apper)
  into = torch.cat(p_s, 0)
  # into = torch.nn.functional.interpolate(out, (224,224), mode='bilinear')

  into = nom((into))


  iii = perceptor.encode_image(into)


  lat_l = 0



  return [lat_l, 100*-torch.cosine_similarity(t, iii).view(-1, batch_size).T.mean(1)]

def train(i):
  global hadies
  loss1 = ascend_txt()
  loss = loss1[0] + loss1[1]
  loss = loss.mean()
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  
  # hadies /= 1.01
  # hadies = max(hadies, 1.5)

  for g in optimizer.param_groups:
    g['lr'] = g['lr']*1.0
    
  
  if itt % 10 == 0:
    # print('temp', hadies)
    # print(g['lr'], 'lr')
    checkin(loss1)


itt = 0
for asatreat in range(10000):
  train(itt)
  itt+=1



hopefully those work because im just eyeballing it here lol. its definitely the tau / hadies though, to get more layers.

Also, I lowered the learning rate from 1.5 to 1.0. when its too fast its more likely to collapse. 1 is usually good sometimes even lower is better for quality.

@TPreece101
Copy link

Great, thanks a lot @fractaldna22, it has stopped collapsing now. Have you managed to get any good results out of it so far?

@fractaldna22
Copy link

image

"a hedgehog made of violins. A hedgehog with the texture of violin"

@amish-logicwind
Copy link

amish-logicwind commented Mar 18, 2021

@fractaldna22 The results seem to be really good. It would be great if you can share your colab link. I tried changing as per your suggestions, but the results are not satisfactory. Have a look at this: https://colab.research.google.com/drive/1oA1fZP7N1uPBxwbGIvOEXbTsq2ORa9vb?usp=sharing
Actually, I want it to generate textile design patterns. Is there a way I can train it on my data?

@TPreece101
Copy link

@fractaldna22 Have you managed to find a url to download ViT-L-32? I can't seem to find any references to it anywhere?

@fractaldna22
Copy link

I have but i can't figure out how to configure it for use by clip. Its obviously possible but its so confusing and im still new.

Look for ViT Pytorch and those repos usually have pretrained models. The most accurate model currently is ViT-H_14, the "huge" model. it achieves 99% accuracy in some tests.

if it can classify it, it can do the reverse and guide the generation of images. the more accurate clip is, the more accurate the VAE is.

@7exe 7exe mentioned this issue Mar 20, 2021
Closed
@fractaldna22
Copy link

Custom Notebook

@fractaldna22
Copy link

well? no dopamine reward? lol

@JohnDreamer
Copy link
Author

@fractaldna22 Thanks for your reply! But I'm still a little confused. In the paper, the image generation step is: 1. using the transformer to generate image tokens, with the input of text tokens; 2. put the generated image tokens into the decoder of dVAE to produce RGB images; 3. using the CLIP to select the best one. However, in your script, the image generation is following: 1. randomly initialize the index matrix; 2. put the index matrix into decoder of dVAE to produce RGB image; 3. using CLIP to calculate the similarity between the generated image and test tokens as a loss to optimize the index matrix, while keeping the weights of the network unchanged. In your pipeline, there isn't such a transformer predictor mentioned in the paper. Could you please explain this question? Thank you!

@TPreece101
Copy link

Thanks @fractaldna22 that notebook is much easier to use for tweaking and such.

@fractaldna22
Copy link

@JohnDreamer It's not my code - but the reason is because DALL-E didn't release its transformer and released a VERY small simple autoencoder that is hardly necessary.. We did fine without it, using it only as a decoder to remap the pixels.

CLIP is the real mastermind behind the image generation because Attention is all you need.

@JohnDreamer
Copy link
Author

@fractaldna22 Got it! Thank you very much!

@fractaldna22
Copy link

I just made this with another clip+(different vae) notebook which im not at liberty to share, but i can share the content https://www.youtube.com/watch?v=5HcBxeS7jkQ

@JohnDreamer
Copy link
Author

@fractaldna22 In fact, if we follow the paper's pipeline, we input the text tokens into GPT3 and predict the image tokens one by one. If the input text tokens are determined, then the generated image tokens are determined. How does Dall-E generate diverse images? Does it choose the next image token by the probability? Do you have any idea about this?

@fractaldna22
Copy link

fractaldna22 commented Mar 23, 2021 via email

@watashiwa-toki
Copy link

watashiwa-toki commented Mar 23, 2021

Custom Notebook

@fractaldna22 Great! Is it possible to change output image resolution? (pic - result from "A samurai fighting with ninja" text input)

samurai_ninja

@Mimocro
Copy link

Mimocro commented Mar 25, 2021

i just place it there.
For some purposes, I changed this #17 (comment) code, now I translated it back into English.
https://colab.research.google.com/drive/1fS8D61V-CTlup7nsSK-KsXGwLVFio20o?usp=sharing
For now this notebook has:
-no errors when setting up the environment
-significant parameters
-almost all the code (not counting the counter, which must be initialized outside the main cell) in one cell
-you can download all generated pictures in zip or gif format
-bad eng translation
-easy interface
-u can disable or enable normalization (dry colors) by checkbox

an red ball 1,2k interation:
an red ball

A samurai fighting with ninja 1,3k interation (i dont know why one of samurai fighting with horse and ninja fighting with tree):
A samurai fighting with ninja

2,2k
A samurai fighting with ninja 2

@amish-logicwind
Copy link

amish-logicwind commented Mar 25, 2021

@Mimocro @fractaldna22 @JohnDreamer @TPreece101 Is there a way to generate a group of images (say 10 or 12). What I could see is that it can generate a single image and improvise it. It would be great if multiple images can be generated. Something similar to this notebook.

@Mimocro
Copy link

Mimocro commented Mar 25, 2021

@amish-logicwind all this colabs there (include my if you disable cleaner) can generate a group of images but dont 10 or 12, its more for a bit. Also for see differences of interations in my version of colab you can run the cells under main cell to save it with gif or zip format.

@amish-logicwind
Copy link

@Mimocro @fractaldna22 @JohnDreamer @TPreece101 I am working on a project in which I need to generate images based on some training images. I have done that here, but the size of output images are very small. I am a beginner in GAN, and don't know which parameters must be changed in the discriminator and generator to generate images of the output size of 200 * 200. Please guide me a little. It would be great if anyone of you can have a look at the above colab suggest the necessary changes.
Thanks!

@Mimocro
Copy link

Mimocro commented Mar 25, 2021

@amish-logicwind maybe change in 4rd cell
HEIGHT = 32
WIDTH = 54
To
HEIGHT = 200
WIDTH = 200
?
Also im beginner too, and this colab not work for me because i dont have required dir in my drive and i cant do something in it. But try to change code which show image by mathplotlib to show just image.

@amish-logicwind
Copy link

@Mimocro Bro, this is not a one-line change. We need to change the parameters in the layers also.

@Mimocro
Copy link

Mimocro commented Mar 25, 2021

@amish-logicwind why, this lines changes size in output what drawing with plt. Original sizes more than 200x200 and they dont need to change i think. This error because plt cant draw big picture in small. Just remove all plt and replace displaying function at every n epochs with something like display(generated_images) or equivalent.

@amish-logicwind
Copy link

@Mimocro Don't you think that parameters in the first layers of generator model must change? As the we have increased rhe image size, we need more layers to deconvolve it, right?

@Mimocro
Copy link

Mimocro commented Mar 25, 2021

@amish-logicwind i think no, 3rd cell how i can see gives real sizes, when height and width just set the output image sizes. I mean in output of training cell you can see downscaled image, when nn generating images in sizes by categories in first cell. I dont pro in that and dont know best solution how to see real size images, but i dont think what you really need real sizes - its more than 200x200.

@adityaramesh
Copy link
Collaborator

The transformer used to generate the images from the text is not part of this code release. I've since modified the README to state this explicitly.

There are collab notebooks available that can be used to generate images by steering a generative model with CLIP, but these are unrelated to this release.

@Asthestarsfalll
Copy link

@fractaldna22 Thanks for your reply! But I'm still a little confused. In the paper, the image generation step is: 1. using the transformer to generate image tokens, with the input of text tokens; 2. put the generated image tokens into the decoder of dVAE to produce RGB images; 3. using the CLIP to select the best one. However, in your script, the image generation is following: 1. randomly initialize the index matrix; 2. put the index matrix into decoder of dVAE to produce RGB image; 3. using CLIP to calculate the similarity between the generated image and test tokens as a loss to optimize the index matrix, while keeping the weights of the network unchanged. In your pipeline, there isn't such a transformer predictor mentioned in the paper. Could you please explain this question? Thank you!

HI, I'm wondering that is the index matrix learnable?
My understanding is that during the training stage, the input index matrix is the only parameter.
And the training will end as long as the score generated by CLIP is low enough.
Thank for you help in advance!

@fractaldna22
Copy link

fractaldna22 commented Aug 28, 2022 via email

@fractaldna22
Copy link

fractaldna22 commented Aug 28, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants