Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIP Training Code #83

Open
vinson2233 opened this issue Apr 8, 2021 · 222 comments
Open

CLIP Training Code #83

vinson2233 opened this issue Apr 8, 2021 · 222 comments

Comments

@vinson2233
Copy link

vinson2233 commented Apr 8, 2021

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage.
Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 


if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 
    
      images= images.to(device)
      texts = texts.to(device)
    
      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)
  • NOTE :

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

@nikky4D
Copy link

nikky4D commented Apr 8, 2021

Very helpful. Thank you

@vkmavani
Copy link

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code
Feel free to ask or point out any mistakes in my code.

train_dataloader = DataLoader(...,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
clip.model.convert_weights(model)

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params from paper

for batch in train_dataloader :
    optimizer.zero_grad()

    list_image,list_txt = batch #list_images is list of image in numpy array(np.uint8)
    
    images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0)
    texts = clip.tokenize(list_txt)
    
    logits_per_image, logits_per_text = model(images, texts)

    ground_truth = torch.arange(BATCH_SIZE).to(device)
    total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
    total_loss.backward()

    convert_models_to_fp32(model)
    optimizer.step()
    clip.model.convert_weights(model)

Hi, Thank you for this training code.
I have a dataset, where I want to check the image similarity, and I want to use the CLIP. But I don't know how to prepare(image_size, embedding_size, transforms, etc) a dataset to feed this training code. Can you please provide me the dataset class if possible?

@vinson2233
Copy link
Author

vinson2233 commented Apr 12, 2021

@vkmavani sure. The preprocess object from CLIP takes care of all of the preprocessing steps for the image part, so you don't need to worry about image_size or transform(see https://github.com/openai/CLIP/blob/main/clip/clip.py line 58).
For example, maybe your data look like this :

| image  | caption  |
---------------------
| url1   | caption1 |
| url2   | caption2 |

where the URL is the path to the image and the caption is the string of the caption.

Here's the dataset class definition for image-text similarity :

from PIL import Image

class image_caption_dataset(Dataset):
    def __init__(self, df):

        self.images = df["image"].tolist()
        self.caption = df["caption"].tolist()

    def __len__(self):
        return len(self.caption)

    def __getitem__(self, idx):
        
        images = preprocess(Image.open(self.images[idx])) #preprocess from clip.load
        caption = self.caption[idx]
        return images,caption

dataset = image_caption_dataset(df)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

With this dataset definition, you can omit the Image.fromarray() and the preprocess step after loading the batch since the actual data already in tensor format

If you are interested in doing image-image similarity, just modify the dataset to return pair of images and
for the training code, adjust the code accordingly, a big change will happen in the creating the logits part. Change the forward method logits_per_image, logits_per_text = model(images, texts) according to https://github.com/openai/CLIP/blob/main/clip/model.py, line 354.

@lonngxiang
Copy link

what is the clip.model.convert_weights meaning? and can you Provide a complete training code if possible

@vinson2233
Copy link
Author

@lonngxiang For more information, read #57, clip.model.convert_weights basically convert the CLIP model weight into float16. This will help accelerate and reduce memory usage during training.
The definition of clip.model.convert_weight can be found at https://github.com/openai/CLIP/blob/main/clip/model.py line 371

I can't give a fully working example code since I'm using a private dataset, but I believe the training code and dataset code that I provided is sufficient.

@lonngxiang
Copy link

@lonngxiang For more information, read #57, clip.model.convert_weights basically convert the CLIP model weight into float16. This will help accelerate and reduce memory usage during training.
The definition of clip.model.convert_weight can be found at https://github.com/openai/CLIP/blob/main/clip/model.py line 371

I can't give a fully working example code since I'm using a private dataset, but I believe the training code and dataset code that I provided is sufficient.

Thank you for your kind reply

@lonngxiang
Copy link

there is a error when run this train code:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.JpegImagePlugin.JpegImageFile'>

@vkmavani
Copy link

@vkmavani sure. The preprocess object from CLIP takes care of all of the preprocessing steps for the image part, so you don't need to worry about image_size or transform(see https://github.com/openai/CLIP/blob/main/clip/clip.py line 58).
For example, maybe your data look like this :

| image  | caption  |
---------------------
| url1   | caption1 |
| url2   | caption2 |

where the URL is the path to the image and the caption is the string of the caption.

Here's the dataset class definition for image-text similarity :

from PIL import Image

class image_caption_dataset(Dataset):
    def __init__(self, df):

        self.images = df["image"].tolist()
        self.caption = df["caption"].tolist()

    def __len__(self):
        return len(self.caption)

    def __getitem__(self, idx):
        
        images = Image.open(self.images[idx])
        caption = self.caption[idx]
        return images,caption

dataset = image_caption_dataset(df)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

With this dataset definition, you can omit the Image.fromarray() since the actual data already in PIL format.

If you are interested in doing image-image similarity, just modify the dataset to return pair of images and
for the training code, adjust the code accordingly, a big change will happen in the creating the logits part. Change the forward method logits_per_image, logits_per_text = model(images, texts) according to https://github.com/openai/CLIP/blob/main/clip/model.py, line 354.

Thank you very much. It really helps a lot.

@vinson2233
Copy link
Author

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

@lonngxiang
Copy link

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

one more thing,when you use preprocess in class image_caption_dataset, the torch.stack's preprocess is it still useful?

@lonngxiang
Copy link

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

still have a error in images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0):

AttributeError: 'Tensor' object has no attribute 'array_interface'

@vinson2233
Copy link
Author

Yeah, if already using preprocess inside the class. The result from the batch can be used directly to the CLIP. So that line can be change into this : images = list_image

@lonngxiang
Copy link

Yeah, if already using preprocess inside the class. The result from the batch can be used directly to the CLIP. So that line can be change into this : images = list_image

then have anthor error:
RuntimeError: "unfolded2d_copy" not implemented for 'Half'

@vinson2233
Copy link
Author

Hmmmm, that error is new for me. Is the error occurred when calculating the loss?

@lonngxiang
Copy link

Hmmmm, that error is new for me. Is the error occurred when calculating the loss?

yes,the error occurred in this line:
logits_per_image, logits_per_text = model(images, texts)

add model(images.float(), texts.float()) still error:
RuntimeError: "unfolded2d_copy" not implemented for 'Half'

@vinson2233
Copy link
Author

Are you using CPU by any chance? The mixed precision training usually don't work on CPU

@lonngxiang
Copy link

Are you using CPU by any chance? The mixed precision training usually don't work on CPU

yes, i use it on cpu

@vinson2233
Copy link
Author

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

@lonngxiang
Copy link

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

ok. so kind of you; Thank you for your patience

@lonngxiang
Copy link

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU
run it on cpu;There's still a problem. the total_loss is always 0

image

@lonngxiang
Copy link

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

how to set BATCH_SIZE to get ground_truth's label?

@vinson2233
Copy link
Author

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1.
This pattern keeps repeating until the last image-text pair.
So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]).
Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

@lonngxiang
Copy link

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1.
This pattern keeps repeating until the last image-text pair.
So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]).
Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

tks for your reply;so If you have five pairs, so your BATCH_SIZE is five,is right?

@vinson2233
Copy link
Author

vinson2233 commented Apr 13, 2021

Your BATCH_SIZE will determince the number of pairs for each batch

For example, If you have 1000 pairs, and set BATCH_SIZE = 20.
Then for each loop of for batch in train_dataloader, the variable batch will give you 20 pairs. The loop will be repeated 50 times to cover all the data for 1 epoch.

@lonngxiang
Copy link

Your BATCH_SIZE will determince the number of pairs for each batch

For example, If you have 1000 pairs, and set BATCH_SIZE = 20.
Then for each loop of for batch in train_dataloader, the variable batch will give you 20 pairs. The loop will be repeated 50 times to cover all the data for 1 epoch.

yes,but when I set BATCH_SIZE = 1,the total_loss is always 0,is this right?What's wrong with it

@vinson2233
Copy link
Author

vinson2233 commented Apr 13, 2021

Yes, that's the problem. BATCH_SIZE must be greater than 1.
The reason is your prediction will return cosine similarity for that image and that text.
CrossEntropyLoss is combination of softmax with logloss.
Since one row only has 1 prediction(because BATCH_SIZE=1), the softmax will return probability=1 for that entry(It doesn't matter whether the logits is high or low), where it automatically correspond to the correct ground truth.

@lonngxiang
Copy link

Yes, that's the problem. BATCH_SIZE must be greater than 1.
The reason is your prediction will return cosine similarity for that image and that text.
CrossEntropyLoss is combination of softmax with logloss.
Since one row only has 1 prediction(because BATCH_SIZE=1), the softmax will return probability=1 for that entry(It doesn't matter whether the logits is high or low), where it automatically correspond to the correct ground truth.

Thank you for helping me a lot and learning a lot

@dmoham1476
Copy link

  1. Don't we need to do clip.load_state_dict after clip.load?
  2. Are we not doing model.encode_image and model.encode_text and then doing norm before training?
  3. Can you please add demo code for early stopping, saving the model (.pt) and metrics as well
  4. Are we fine-tuning only ViT and not the text part? How did this impact performance on custom dataset?

@Benjizhang
Copy link

Benjizhang commented Aug 6, 2023

Hi, thanks for your helpful codes. In my project, I have a dataset including pairs of text and another modal signal (e.g., tactile information, represented as a one-dimension vector for each tactile signal.). So, if I still use the pipeline of CLIP to align the text embedding and tactile embedding, may I ask for any suggestions or modifications for training this dataset?

@xuntianci
Copy link

@lamnt2008

Your question is very broad. What is your specific use case? If you are doing image classification, I am assuming that you have in your training dataset images paired with labels? If that is the case, you have two options as I understand:

  1. Use only CLIP's image encoder as a feature extractor and train a MLP classification head on top of it to perform classification. You may want to experiment with using lower learning rate for the CLIP image encoder and higher learning rate for the MLP head. Freezing the CLIP image encoder altogether and only training the MLP head is also a decent option for faster training on simpler classification tasks.
  2. Treat your image-label pairs as positive examples and tune CLIP using the same contrastive approach* as outlined by @vinson2233 in his code. It is probably a good idea to build prompts around your class labels (e.g. banana -> "a photo of a banana").

*For a summary of the contrastive training approach see the CLIP blog post: https://openai.com/research/clip

As for which approach is better, the most recent paper I've read suggests that 2. comes out on top: https://arxiv.org/abs/2212.00638

For that 1. ,I use the CLIP's image encoder and then add a MLP head ,make all paramters learnable, but I got a nan loss in the batch 2. when a only train the head, it will be ok.Then I located nan , the crossentropyloss returned a nan value, can you give me some advice?I would appreciate it.

@xuntianci
Copy link

For that 1. ,I use the CLIP's image encoder and then add a MLP head ,make all paramters learnable, but I got a nan loss in the batch 2. when a only train the head, it will be ok.Then I located nan , the crossentropyloss returned a nan value, can you give me some advice?I would appreciate it.

hey!I have a similar question.Did you solve your problem?I use the CLIP's image encoder and then add a MLP head ,make all paramters learnable, but I got a nan loss in the batch 2. when a only train the head, it will be ok.Then I located nan , the crossentropyloss returned a nan value.

@xiyangyang99
Copy link

xiyangyang99 commented Aug 23, 2023

@vinson2233 >
Firstly, thank you for sharing the fine-tuning code. However, after completing the fine-tuning, in my own dataset, I only fine-tuned the image encoding part, while I did not fine-tune the text encoding part. I used vit base 16 as the pre training weight, but after fine-tuning, the. pt increased by 5 times. Also, how should the. pt model generated after fine-tuning be used for inference? Looking forward to your guidance, thank you.

@TaylorLi123
Copy link

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage. Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 


if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 
    
      images= images.to(device)
      texts = texts.to(device)
    
      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

Thanks for sharing the fine-tuning code, but when I called the save model to test after fine-tune, the scores for each category were average. how did this happen?Looking forward to your guidance, thank you.
tensor([0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250],
device='cuda:0', dtype=torch.float16)

@xiyangyang99
Copy link

@TaylorLi123
What is your dataset like? How many datasets are there? My dataset is very small, and I trained it to be the same as you. The final output value is the same.

@TaylorLi123
Copy link

@QzYER
My dataset contains over 20,000 images
picture format:
-- images
--00001.jpg
--00002.jpg
description text:
--caption.txt
--description for image1.jpg
--description for image2.jpg

@TaylorLi123
Copy link

@vinson2233
How do I output the evaluation indicators accuracy, precision, recall, and F1 during the training process? Does anyone have these indicators equal

@p1k0pan
Copy link

p1k0pan commented Oct 27, 2023

Hi @vinson2233 thanks a lot for sharing your code! If I understand well you never set the model in training phase model.train(). Do you do that on purpose to freeze dropout and batchnorm layers during fine-tuning? And second questions, do you have any ideas about good metrics to keep track of during evaluation to understand if fine-tuning is running fine (validation loss apart)?

I have checked the architechture with printing out the state_dict from the model which loaded by clip.load(). I found there are no Dropout layer in the model, although it is a standard layer from a standard ViT and Transformer. Does it mean the model is not complete and not suffient to do fine-tunning?

@IliasParas13
Copy link

Where can I find a dataset with texts?

@keshavsharma347
Copy link

@uplusv If you want to modify CLIP as a classifier(Single label, multi class), here's some modification you can do :

  1. Change the ground_truth = torch.arange(BATCH_SIZE).to(device) to integer vector that specify which class your image are on (for example torch.tensor([0,1,2,1,2,3,4,5])). With this now you can set your batch size in arbitrary size.
  2. One image should match 1 label, but 1 label can match will multiple images. You can omit the loss_txt in the total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2 to total_loss = loss_img(logits_per_image,ground_truth)

I'm not sure what you meant by "After fine-tuning, the model outputs sample feature for every image"

Thank you for your reply and advice, I will try it soon! By "After fine-tuning, the model outputs sample feature for every image", I mean that, with "image_features = model.encode_image(image_input)" I print this "image_features" and get image_features: tensor([ [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], ..., [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039]]) while the original model outputs: image_features: tensor([ [ 0.0304, -0.0169, -0.0383, ..., 0.0927, 0.0261, 0.0203], [ 0.0013, -0.0067, -0.0524, ..., 0.1029, 0.0028, 0.0169], [ 0.0115, -0.0006, -0.0392, ..., 0.0616, 0.0317, 0.0171], ..., [ 0.0173, -0.0152, -0.0431, ..., 0.0836, 0.0405, 0.0268], [ 0.0287, -0.0236, -0.0401, ..., 0.0856, 0.0119, 0.0287], [ 0.0150, 0.0013, -0.0537, ..., 0.0792, 0.0104, 0.0062]]) After fine-tuning, the features become same and smaller so I get identical and large logits(like 99.8856) for every image😢.

Were you able to finetune for classification task?? If so can you provide some reference. I have my dataset as image, caption and class it belongs to..

@Dinosaurcubs
Copy link

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage. Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 


if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 
    
      images= images.to(device)
      texts = texts.to(device)
    
      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

Really nice reply. I am trying to borrow the image encoder part of the CLIP and finetune the encoder only, because I plan to use it as a feature extraction part of my own model. For detail, I am trying to add some parameters in it and only train the params, but I donnot know how to split the visual encoder part from CLIP and modify it, can u provide some guide? tks

@deadpipe
Copy link

@uplusv If you want to modify CLIP as a classifier(Single label, multi class), here's some modification you can do :

  1. Change the ground_truth = torch.arange(BATCH_SIZE).to(device) to integer vector that specify which class your image are on (for example torch.tensor([0,1,2,1,2,3,4,5])). With this now you can set your batch size in arbitrary size.
  2. One image should match 1 label, but 1 label can match will multiple images. You can omit the loss_txt in the total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2 to total_loss = loss_img(logits_per_image,ground_truth)

I'm not sure what you meant by "After fine-tuning, the model outputs sample feature for every image"

Hi

Can you please explain the first point you mentioned? Actually i want to fine tune clip for multi labelled image classification where one image may belong to multiple classes

@ItzHaad
Copy link

ItzHaad commented Jan 7, 2024

Hi, if you could clarify why is that we are using torch arrange, suppose that data is randomly shuffled after every epoch, we will have image pairs at different positions every time, so essentially we are not learning anything apart from position (which is also changing every time and that too randomly), instead this approach makes alot more sense https://github.com/moein-shariatnia/OpenAI-CLIP (with the projection head). Since, we are going to a common representation between image and text embeddings.

@vinson2233
Copy link
Author

vinson2233 commented Jan 7, 2024

@ItzHaad I can give you 2 answers:

  1. actual answer
    you shuffle the pair of the data instead of shuffling the image and text independently. that's how you retain the labels.
    let say the original data is (Image1, Text1), (Image2, Text2), (Image3, Text3)
    it will make no difference if you shuffle the data so the order becomes like this (Image2, Text2), (Image3, Text3), (Image1, Text1)

  2. lazy answer : that is what is presented in the paper (read Figure 3 of the paper)

@Heathcliff-saku
Copy link

Hi!@vinson2233
Thank you very much for your training script. In my code, I have adopted your dataset method. However, my dataset is quite large (about 2 million img-text pairs), which has led to an unusual phenomenon during training: at the start of each epoch, specifically during the loading of the first batch, there is a prolonged delay (approximately 30 to 40 minutes), and the GPU utilization remains at 0%. Have you encountered this issue before? Is this normal? Do you have any recommended solutions to mitigate this?
PS. I have set an appropriate number of workers and enabled pin memory, but this waiting time still seems unavoidable.

@vinson2233
Copy link
Author

@Heathcliff-saku its expected because how the dataset object created. if you don't want huge overhead in the front, another way is to do the image preprocess and clip tokenize after the data produced by data loader, but this will create redundancy in every epoch.

If anyone can give recommendations as well then feel free to do so since i don't use CLIP anymore

@manas6266
Copy link

How many epochs does clip need to be finetuned and what should be batch size??

@anas2908
Copy link

anas2908 commented Mar 3, 2024

ell In[13], line 64, in image_title_dataset.getitem(self, idx)
63 def getitem(self, idx):
---> 64 image_path = self.image_path[idx] # Get the image path at the specified index
65 image = preprocess(Image.open(image_path)) # Open the image using PIL
66 title = self.title[idx] # Get the title corresponding to the image

TypeError: list indices must be integers or slices, not list

is there any problem in my dataloader i am using from torch.utils.data import DataLoader
i am stucked can you please help me?

@anas2908
Copy link

anas2908 commented Mar 3, 2024

Hi!@vinson2233 Thank you very much for your training script. In my code, I have adopted your dataset method. However, my dataset is quite large (about 2 million img-text pairs), which has led to an unusual phenomenon during training: at the start of each epoch, specifically during the loading of the first batch, there is a prolonged delay (approximately 30 to 40 minutes), and the GPU utilization remains at 0%. Have you encountered this issue before? Is this normal? Do you have any recommended solutions to mitigate this? PS. I have set an appropriate number of workers and enabled pin memory, but this waiting time still seems unavoidable.

hey can you share me your code

@BaochaoZhu
Copy link

@vinson2233 ,Hi, I would like to train CLIP with my own custom datasets. Can you please advise me on how many images I need to prepare per class? Thank you.

@lalit-pivotchain
Copy link

@lonngxiang i have update the code for save and load, basically to load the model use this code :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) 
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution 
checkpoint['model_state_dict']["context_length"] = model.context_length
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Just modify the dict key to match your dict key when saving to .pt file

i implemented the above code to load the saved model(.pt) , but i encountered this error
AttributeError: 'CLIP' object has no attribute 'input_resolution'

@Shadedog838
Copy link

I have implemented the training code above and have been trying to train the model on the Flickr dataset but my loss keeps increasing until it eventually stays the the same I don't know what the problem is can anyone provide some insight?

@seidasoeun
Copy link

ell In[13], line 64, in image_title_dataset.getitem(self, idx) 63 def getitem(self, idx): ---> 64 image_path = self.image_path[idx] # Get the image path at the specified index 65 image = preprocess(Image.open(image_path)) # Open the image using PIL 66 title = self.title[idx] # Get the title corresponding to the image

TypeError: list indices must be integers or slices, not list

is there any problem in my dataloader i am using from torch.utils.data import DataLoader i am stucked can you please help me?

you just fix your code like below

from torch.utils.data import Dataset

@nightrain-vampire
Copy link

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1. This pattern keeps repeating until the last image-text pair. So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]). Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

Hello,I set the Batch Size as 64,and my total loss is always 4.02734375 during the training;I don't know what the problem is can anyone provide some insight?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests