Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the inference results of finetuned coca model is not as expected #751

Closed
lilisandy opened this issue Nov 29, 2023 · 2 comments
Closed

the inference results of finetuned coca model is not as expected #751

lilisandy opened this issue Nov 29, 2023 · 2 comments

Comments

@lilisandy
Copy link

lilisandy commented Nov 29, 2023

I use the example in the document to fine_tune coca, params are same with the example, I also add the pretrained params:

CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nproc_per_node 4 -m training.main --dataset-type "csv" --train-data "path/to/data/dir/train2014.csv" --csv-img-key "filepath" --csv-caption-key "title" --csv-separator "\t" --warmup 1000 --batch-size 128 --lr 1e-5 --wd 0.1 --epochs 2 --workers 4 --model "coca_ViT-L-14" --pretrained "mscoco_finetuned_laion2B-s13B-b90k" --report-to "wandb" --coca-contrastive-loss-weight 0 --coca-caption-loss-weight 1 --log-every-n-steps 100

the csv dataset example is:
filepath title
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A restaurant has modern wooden tables and chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long restaurant table with rattan rounded back chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg a long table with a plant on top of it surrounded with wooden chairs
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long table with a flower arrangement in the middle for meetings
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A table is adorned with wooden chairs with blue accents.

However, the inference results of the trained model are not as expected,below is my inference code:

import open_clip
import torch
from PIL import Image

model, _, transform = open_clip.create_model_and_transforms(
model_name="coca_ViT-L-14",
pretrained="path/to/model/epoch_1.pt",
precision="amp"
)

im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)

with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(im)

print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))

but the inference result is:
"turnpike turnpike turnpike turnpike parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway"

The loss curve of two epochs is as follows, the convergence speed is very fast and quickly drops to 0, It seems something went wrong:
image
image

I want to know how to modify it to get the correct result?

@Thomas2419
Copy link

I am also having problems with the CoCa training. Given enough time and further training the models give empty output for the caption predictions.

@Thomas2419
Copy link

@lilisandy, Hello, I git pulled the open_clip repository, and then edited in src/open_clip/coca_model.py the lines as exactly edited line per line in Pull Request #710 by gpucce, and then ran pip install -e . the repository's main directory to install it post edits. This change made the CoCa training successful for me. Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants