-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Hi, there.
I am new to CLIP and I've found that it really improves my development productivity. Thanks for your great work.
But I have some problems when using CLIP to tackle a image retrieval task.
For every epcoh, I train and validate the model respectively. The pseudocode of the training process is as follows:
clip_model, clip_preprocess=clip.load('RN50x4', device=device, jit=False)
...
for epoch in range(num_epoch):
clip_model.train() # with or without **clip_model.train()** makes a noticeable difference in accuracy on validation set
for image,text,label in train_dataloader:
with torch.cuda.amp.autocast():
img_feat=clip_model.encode_image(image)
text_feat=clip_model.encode_text(text)
loss=loss_function(img_feat,text_feat,label)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
clip_model.eval()
for image,text,label in val_dataloader:
...
At first, I forgot to call clip_model.train() before the training loops.
Then, I added clip_model.train() before training loops. But the accuracy on the validation set drops noticeably (I have tried for many times and the performance gap always exist).
That is to say, only adding or removing model.train() changes the performance noticeably.
The phenomenon above is very strange and I want to know the reason behind this and how to solve this problem.
Thanks a lot.