You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm working on training a text-to-video model using the starter code provided, but have been struggling to get anything other than noise, even after 5000+ training steps. I have since moved on to trying to implement text-to-image, but here as well I am only able to generate a black square after many training steps. I'm wondering if I am doing something wrong. I'm currently using ~600 images and captions from the Flickr8K dataset, and my code is below. Any help would be greatly appreciated!
### IMPORTSimporttorchimporttorchvisionimportimageio.v3asiioimportnumpyasnpimportosimporttorchimportcv2importpandas### DATA PREPROCESSINGf= []
images= []
for (dirpath, dirnames, filenames) inos.walk('/content/flickr8k/images'):
f.extend(filenames)
breakforfilenameinf:
images.append(filename)
images=sorted(images)
captions=pandas.read_csv("/content/flickr8k/captions.txt")
count=0img_tensors= []
img_captions= []
# Get images, resize, and convert to tensor# Also get corresponding captions and append both to listsforimginimages:
im_path='/content/flickr8k/images/'+str(img)
im=iio.imread(im_path, index=None)
im=cv2.resize(im, (64,64))
img_arr=np.asarray(im)
# print(img_arr.shape)# img_arr = np.moveaxis(img_arr, -1, 1)img_tensor=torch.from_numpy(img_arr)
img_tensor=img_tensor.type(torch.float32)
img_tensor=img_tensor.to(device="cuda")
img_caption= (captions.loc[captions['image'] ==str(img)].iloc[0]['caption'])
img_tensors.append(img_tensor)
img_captions.append(img_caption)
# Create image batchesimg_batches= []
foriinrange(0, len(img_tensors), 2):
iflen(img_tensors) - (i+2) >0:
a=img_tensors[i]
b=img_tensors[i+1]
z=torch.stack((a,b), dim=0)
img_batches.append(z)
# Create caption batchescaption_batches= []
batch_size=2foriinrange(0, len(img_captions), batch_size):
iflen(img_captions) - (i+batch_size) >0:
a=img_captions[i]
b=img_captions[i+1]
z= [a,b]
caption_batches.append(z)
### INITIALIZE IMAGENfromimagen_pytorchimportt5fromimagen_pytorchimportUnet, Imagen, ImagenTrainerfromgoogle.colabimportdrive# Tokenize captions using t5-largeembed_batches= []
foriinrange(len(caption_batches)):
encoded_input=t5.t5_encode_text(caption_batches[i], name='t5-large')
embed_batches.append(encoded_input)
# Use a single u-net and instantiate imagen, trainerunet1=Unet(
dim=64,
cond_dim=512,
dim_mults= (1, 2, 4, 8),
num_resnet_blocks=3,
layer_attns= (False, True, True, True),
)
imagen=Imagen(
unets= (unet1),
text_encoder_name='t5-large',
image_sizes= (64),
timesteps=1000,
cond_drop_prob=0.1
).cuda()
trainer=ImagenTrainer(image)
### TRAINING LOOPdrive.mount('/content/gdrive', force_remount=True)
save_path='/content/gdrive/My Drive/Research'print(f'Checkpoints will be saved at {save_path}')
# Load trainer from checkpoint if available, save every 5 epochscurrent_epoch=25checkpoint='imagen_1unet_text2image_epoch'+str(current_epoch) +'.ckpt'num_epochs=50ifcheckpoint:
trainer.load(os.path.join(save_path, checkpoint))
else:
current_epoch=0forepochinrange(num_epochs):
print(f'Beginning epoch {current_epoch+epoch+1}...')
foriinrange(len(img_batches)):
image_batch=img_batches[i]
image_batch=image_batch.moveaxis(-1, 1)
image_batch=image_batch.to(device="cuda")
embed_batch=embed_batches[i]
loss=trainer(
image_batch,
text_embeds=embed_batch,
unet_number=1,
max_batch_size=4
)
trainer.update(unet_number=1)
ifi%5==0ori== (len(img_batches) -1):
print(f'Image batches processed: {i}/{len(img_batches)}')
if ((current_epoch+epoch+1) %5) ==0:
ckpt_name='imagen_1unet_text2image_epoch'+str(current_epoch+epoch+1) +'.ckpt'trainer.save(os.path.join(save_path, ckpt_name))
print(f'Saved checkpoint for epoch {current_epoch+epoch+1}')
print(f'\n\n')
### SAMPLE AND VISUALIZEimages=trainer.sample(
texts=caption_batches[0], cond_scalbatch_idx=0x=images[batch_idx].cpu().detach().numpy()
x=np.moveaxis(x, 0, -1)
x=x.astype(np.uint8)
imshow(x)
print(caption_batches[0][batch_idx])
The text was updated successfully, but these errors were encountered:
Closing after receiving some help from the awesome @HReynaud (thread in question can be found here: #305)
For future reference, the main issue was that I was using trainer.update() for my gradient updates, rather than trainer.train_step.
trainer.update() only does the backpropagation operation ie. tensor.backward() in pytorch. trainer.train_step() first runs the foward process and then automatically calls trainer.update() to train the model. If you don't run the forward step first the model has no gradients to backpropagate through.
Hi! I'm working on training a text-to-video model using the starter code provided, but have been struggling to get anything other than noise, even after 5000+ training steps. I have since moved on to trying to implement text-to-image, but here as well I am only able to generate a black square after many training steps. I'm wondering if I am doing something wrong. I'm currently using ~600 images and captions from the Flickr8K dataset, and my code is below. Any help would be greatly appreciated!
The text was updated successfully, but these errors were encountered: