You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the csv dataset example is:
filepath title
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A restaurant has modern wooden tables and chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long restaurant table with rattan rounded back chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg a long table with a plant on top of it surrounded with wooden chairs
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long table with a flower arrangement in the middle for meetings
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A table is adorned with wooden chairs with blue accents.
However, the inference results of the trained model are not as expected,below is my inference code:
import open_clip
import torch
from PIL import Image
@lilisandy, Hello, I git pulled the open_clip repository, and then edited in src/open_clip/coca_model.py the lines as exactly edited line per line in Pull Request #710 by gpucce, and then ran pip install -e . the repository's main directory to install it post edits. This change made the CoCa training successful for me. Cheers.
I use the example in the document to fine_tune coca, params are same with the example, I also add the pretrained params:
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nproc_per_node 4 -m training.main --dataset-type "csv" --train-data "path/to/data/dir/train2014.csv" --csv-img-key "filepath" --csv-caption-key "title" --csv-separator "\t" --warmup 1000 --batch-size 128 --lr 1e-5 --wd 0.1 --epochs 2 --workers 4 --model "coca_ViT-L-14" --pretrained "mscoco_finetuned_laion2B-s13B-b90k" --report-to "wandb" --coca-contrastive-loss-weight 0 --coca-caption-loss-weight 1 --log-every-n-steps 100
the csv dataset example is:
filepath title
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A restaurant has modern wooden tables and chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long restaurant table with rattan rounded back chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg a long table with a plant on top of it surrounded with wooden chairs
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long table with a flower arrangement in the middle for meetings
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A table is adorned with wooden chairs with blue accents.
However, the inference results of the trained model are not as expected,below is my inference code:
import open_clip
import torch
from PIL import Image
model, _, transform = open_clip.create_model_and_transforms(
model_name="coca_ViT-L-14",
pretrained="path/to/model/epoch_1.pt",
precision="amp"
)
im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(im)
print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))
but the inference result is:
"turnpike turnpike turnpike turnpike parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway"
The loss curve of two epochs is as follows, the convergence speed is very fast and quickly drops to 0, It seems something went wrong:
![image](https://private-user-images.githubusercontent.com/121156190/287700317-d7975a12-3ab6-476a-9c04-53fdfb4bd01f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE1MDg1ODIsIm5iZiI6MTcyMTUwODI4MiwicGF0aCI6Ii8xMjExNTYxOTAvMjg3NzAwMzE3LWQ3OTc1YTEyLTNhYjYtNDc2YS05YzA0LTUzZmRmYjRiZDAxZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzIwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyMFQyMDQ0NDJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xOWI3ZjE4YjE4MmI2ZTEwZjA1ZDZlZDU3MGFjZDVmODRlMDhhNGU5YjM1OTg3ODExZTk1NTU4YTM5NWViOTJjJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.0bp7h5X13ts1dVieCCjNpF7TeJZoEbRpATiht3rkylY)
![image](https://private-user-images.githubusercontent.com/121156190/287700537-b346a22c-66d2-4a82-8076-912a71df205c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE1MDg1ODIsIm5iZiI6MTcyMTUwODI4MiwicGF0aCI6Ii8xMjExNTYxOTAvMjg3NzAwNTM3LWIzNDZhMjJjLTY2ZDItNGE4Mi04MDc2LTkxMmE3MWRmMjA1Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzIwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyMFQyMDQ0NDJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lNmViOGYzZjUyZWYxNTQ1ZGVjMjBkODYxYWFiZTM0NGI0ODEwMTBkMjJhN2ZjMzM3NGFiNzg0Yjc5N2YzNzQyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.Gxl24ZLj7Py-D-U3V8xpM-CSAPA1HTTRjKQlfASvNeE)
I want to know how to modify it to get the correct result?
The text was updated successfully, but these errors were encountered: