-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X-CLIP] Video demo inference code #57
Comments
Thanks for your interest and suggestion. For now, you can try open-set recognition by loading the pretrained model and custom configuring your dataset, including the video and candidate categories. Also, we will provide a simple inference code later. |
Hi, for open set video recognition , should i load the "Zero-shot", "Few-shot" or "Fully supervised" mode ? |
@fixedwater Hi, you should load the zero-shot pretrained model. |
Hi, I've got something which will make your life a bit easier ;) see the notebook at #61 (comment) |
Thanks @NielsRogge - the colab notebook in that comment is super helpful! I wanted to check if anyone could verify something about this result from the xclip inputs = processor(text=["playing sports", "eating spaghetti", "go shopping"], videos=list(video), return_tensors="pt", padding=True)
# forward pass
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_video.softmax(dim=1) Presumably all text provided to the model in this way is also going through the trained video-specific prompt generator before being scored against the video? Just trying to make sure I understand the sematics correctly of these pre-trained models. For example, in this case I would expect models fully-supervised on Kinetics-400 to expect only succinct text labels as text input and longer more descriptive captions would be outside of the training distribution. |
Yes, that's correct.
Yes, but note that the authors of X-CLIP started from the weights of OpenAI's CLIP model, which has seen 400 million (image, text) pairs. This allows the model to also work on longer text descriptions. It's basically a sort of fine-tuning of CLIP. |
Hi @NielsRogge , thanks for the notebook.
and pass the numpy arrray to
I get |
Hi, It depends on which model you're using. If you use https://huggingface.co/microsoft/xclip-base-patch16-zero-shot, then the number of frames should be 32 (as this model was trained on 32 frames per video as seen here). You can also check this as follows:
|
Oh..somehow missed that. Thanks for pointing out! |
gpu is not working, is there any other way? model = XCLIPModel.from_pretrained(model_name).to('cuda:0') |
What's the error message you're getting? |
excuse me , is your simple inference code ready? :) |
i got error message -> segmentation fault (core dumped) This part doesn't work -> model = model.to(device) I want to use gpu. It's working fine with the cpu np.random.seed(0)
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
vr = VideoReader(file_path, num_threads=1, ctx=cpu(0))
# sample 16 frames
vr.seek(0)
indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=len(vr))
video = vr.get_batch(indices).asnumpy()
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)
#model_name = "microsoft/xclip-base-patch32-16-frames"
#model_name = "microsoft/xclip-base-patch32"
#model_name = "microsoft/xclip-base-patch16-kinetics-600"
model_name = "microsoft/xclip-large-patch14-kinetics-600"
model = XCLIPModel.from_pretrained(model_name)
model = model.to(device)
print("model load")
processor = XCLIPProcessor.from_pretrained(model_name)
inputs = processor(text=k600_names, videos=list(video), return_tensors="pt", padding=True)
# forward pass
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_video.softmax(dim=1)
np.set_printoptions(suppress=True)
result = dict(zip(k600_names, probs[0].numpy().tolist()))
res_topk5 = sorted(result.items(), key = lambda item: item[1], reverse = True)[:5]
for i in res_topk5:
print(i) |
Hi, PyTorch places your model on the device in-place, no need to do model = model.to(device), just |
Hi. i used torch 1.11+cuda 10.2 version I modified the code, but the same error message comes out. Is there any other reason?
|
@NielsRogge thank you i just downgrade my torch version 1.11.0 -> 1.8.0 working fine gpu inference :) |
Dear author: |
Dear author:
Thanks for publishing your work! it's really insightful! I want to have a try about the open set video recognition performance. Hence a simple inferece code like this will be much helpful:
thanks.
The text was updated successfully, but these errors were encountered: