[X-CLIP] Video demo inference code #57

dragen1860 · 2022-08-11T02:34:53Z

Dear author:
Thanks for publishing your work! it's really insightful! I want to have a try about the open set video recognition performance. Hence a simple inferece code like this will be much helpful:

thanks.

nbl97 · 2022-08-11T05:48:26Z

Thanks for your interest and suggestion. For now, you can try open-set recognition by loading the pretrained model and custom configuring your dataset, including the video and candidate categories. Also, we will provide a simple inference code later.

fixedwater · 2022-08-30T07:28:38Z

Thanks for your interest and suggestion. For now, you can try open-set recognition by loading the pretrained model and custom configuring your dataset, including the video and candidate categories. Also, we will provide a simple inference code later.

Hi, for open set video recognition , should i load the "Zero-shot", "Few-shot" or "Fully supervised" mode ?

nbl97 · 2022-08-30T07:53:49Z

Thanks for your interest and suggestion. For now, you can try open-set recognition by loading the pretrained model and custom configuring your dataset, including the video and candidate categories. Also, we will provide a simple inference code later.

Hi, for open set video recognition , should i load the "Zero-shot", "Few-shot" or "Fully supervised" mode ?

@fixedwater Hi, you should load the zero-shot pretrained model.

fixedwater · 2022-08-31T01:53:24Z

thx!, i got a task to label my own dataset with predefined categories (like 800 categories). Therefore it might be necessary to train on my own dataset for transferring. Is it enough to follow the guide of

on your readme.md, or anything to change with?

nbl97 · 2022-08-31T13:50:35Z

thx!, i got a task to label my own dataset with predefined categories (like 800 categories). Therefore it might be necessary to train on my own dataset for transferring. Is it enough to follow the guide of on your readme.md, or anything to change with?

According to my understanding, you can follow the README.md to prepare your dataset and train the model : )

fixedwater · 2022-09-02T02:06:40Z

thx!, i got a task to label my own dataset with predefined categories (like 800 categories). Therefore it might be necessary to train on my own dataset for transferring. Is it enough to follow the guide of on your readme.md, or anything to change with?

According to my understanding, you can follow the README.md to prepare your dataset and train the model : )

perfect! thanks for your amazing project

NielsRogge · 2022-09-02T12:49:07Z

Hi,

I've got something which will make your life a bit easier ;) see the notebook at #61 (comment)

dribnet · 2022-09-20T01:40:11Z

Thanks @NielsRogge - the colab notebook in that comment is super helpful!

I wanted to check if anyone could verify something about this result from the xclip processor.

inputs = processor(text=["playing sports", "eating spaghetti", "go shopping"], videos=list(video), return_tensors="pt", padding=True)

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_video.softmax(dim=1)

Presumably all text provided to the model in this way is also going through the trained video-specific prompt generator before being scored against the video?

Just trying to make sure I understand the sematics correctly of these pre-trained models. For example, in this case I would expect models fully-supervised on Kinetics-400 to expect only succinct text labels as text input and longer more descriptive captions would be outside of the training distribution.

NielsRogge · 2022-09-20T06:24:04Z

Presumably all text provided to the model in this way is also going through the trained video-specific prompt generator before being scored against the video?

Yes, that's correct.

For example, in this case I would expect models fully-supervised on Kinetics-400 to expect only succinct text labels as text input and longer more descriptive captions would be outside of the training distribution.

Yes, but note that the authors of X-CLIP started from the weights of OpenAI's CLIP model, which has seen 400 million (image, text) pairs. This allows the model to also work on longer text descriptions. It's basically a sort of fine-tuning of CLIP.

RaiAmanRai · 2022-09-27T06:34:34Z

Hi,

I've got something which will make your life a bit easier ;) see the notebook at #61 (comment)

Hi @NielsRogge , thanks for the notebook.
I was trying to play around with it and noticed I cannot pass more than 8 frames to the model

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

vr = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
vr.seek(0)
indices = sample_frame_indices(clip_len=17, frame_sample_rate=1, seg_len=len(vr))
video = vr.get_batch(indices).asnumpy()
print(video.shape)

>>> (17, 360, 640, 3)

and pass the numpy arrray to

model_name = "microsoft/xclip-base-patch32"
processor = XCLIPProcessor.from_pretrained(model_name)
model = XCLIPModel.from_pretrained(model_name)

inputs = processor(text=["reading", "writing", "inspecting"], videos=list(video), return_tensors="pt", padding=True)

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_video.softmax(dim=1)
probs

I get RuntimeError: shape '[2, 8, 768]' is invalid for input of size 13056

NielsRogge · 2022-09-27T06:48:26Z

Hi,

It depends on which model you're using. If you use https://huggingface.co/microsoft/xclip-base-patch16-zero-shot, then the number of frames should be 32 (as this model was trained on 32 frames per video as seen here).

You can also check this as follows:

from transformers import XCLIPModel

model = XCLIPModel.from_pretrained("microsoft/xclip-base-patch32"")
print(model.config.vision_config.num_frames)

RaiAmanRai · 2022-09-27T07:11:27Z

Oh..somehow missed that. Thanks for pointing out!

Hwijune · 2022-09-30T09:27:27Z

gpu is not working, is there any other way?

model = XCLIPModel.from_pretrained(model_name).to('cuda:0')

NielsRogge · 2022-09-30T09:46:49Z

What's the error message you're getting?

opentld · 2022-10-10T01:56:28Z

Thanks for your interest and suggestion. For now, you can try open-set recognition by loading the pretrained model and custom configuring your dataset, including the video and candidate categories. Also, we will provide a simple inference code later.

excuse me , is your simple inference code ready? :)

Hwijune · 2022-10-19T07:20:30Z

What's the error message you're getting?

i got error message -> segmentation fault (core dumped)

This part doesn't work -> model = model.to(device)

I want to use gpu.

It's working fine with the cpu

np.random.seed(0)

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

vr = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
vr.seek(0)
indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=len(vr))
video = vr.get_batch(indices).asnumpy()

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

#model_name = "microsoft/xclip-base-patch32-16-frames"
#model_name = "microsoft/xclip-base-patch32"
#model_name = "microsoft/xclip-base-patch16-kinetics-600"
model_name = "microsoft/xclip-large-patch14-kinetics-600"

model = XCLIPModel.from_pretrained(model_name)
model = model.to(device)
print("model load")

processor = XCLIPProcessor.from_pretrained(model_name)
inputs = processor(text=k600_names, videos=list(video), return_tensors="pt", padding=True)

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_video.softmax(dim=1)
np.set_printoptions(suppress=True)

result = dict(zip(k600_names, probs[0].numpy().tolist()))

res_topk5 = sorted(result.items(), key = lambda item: item[1], reverse = True)[:5]
for i in res_topk5:
    print(i)

NielsRogge · 2022-10-19T07:21:51Z

Hi,

PyTorch places your model on the device in-place, no need to do model = model.to(device), just model.to(device) is enough.

Hwijune · 2022-10-19T07:27:31Z

Hi,

PyTorch places your model on the device in-place, no need to do model = model.to(device), just model.to(device) is enough.

Hi.

i used torch 1.11+cuda 10.2 version

I modified the code, but the same error message comes out.

Is there any other reason?

model = XCLIPModel.from_pretrained(model_name)
model.to(device)
print("model load")

processor = XCLIPProcessor.from_pretrained(model_name)
inputs = processor(text=k600_names, videos=list(video), return_tensors="pt", padding=True)

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_video.softmax(dim=1)

Hwijune · 2022-10-19T08:30:20Z

What's the error message you're getting?

i got error message -> segmentation fault (core dumped)

This part doesn't work -> model = model.to(device)

I want to use gpu.

It's working fine with the cpu

np.random.seed(0)

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

vr = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
vr.seek(0)
indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=len(vr))
video = vr.get_batch(indices).asnumpy()

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

#model_name = "microsoft/xclip-base-patch32-16-frames"
#model_name = "microsoft/xclip-base-patch32"
#model_name = "microsoft/xclip-base-patch16-kinetics-600"
model_name = "microsoft/xclip-large-patch14-kinetics-600"

model = XCLIPModel.from_pretrained(model_name)
model = model.to(device)
print("model load")

processor = XCLIPProcessor.from_pretrained(model_name)
inputs = processor(text=k600_names, videos=list(video), return_tensors="pt", padding=True)

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_video.softmax(dim=1)
np.set_printoptions(suppress=True)

result = dict(zip(k600_names, probs[0].numpy().tolist()))

res_topk5 = sorted(result.items(), key = lambda item: item[1], reverse = True)[:5]
for i in res_topk5:
    print(i)

@NielsRogge thank you

i just downgrade my torch version 1.11.0 -> 1.8.0

working fine gpu inference :)

zyhzyh88 · 2022-11-11T03:41:39Z

Dear author:
Thanks for your promising work. We have followed your code conducted on zero-shot of UCF-101, the test set has only one category, however, as training progresses, the test performance gradually decreases from the 1st epoch (The attached is our training log). We want you to seek help. Thank you!
log_rank0.txt

nbl97 closed this as completed Sep 27, 2022

e-caste mentioned this issue Jan 8, 2023

X-CLIP and other video classification models can't be loaded into CUDA GPU for inference without crashing the kernel/process huggingface/transformers#21054

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X-CLIP] Video demo inference code #57

[X-CLIP] Video demo inference code #57

dragen1860 commented Aug 11, 2022

nbl97 commented Aug 11, 2022

fixedwater commented Aug 30, 2022

nbl97 commented Aug 30, 2022 •

edited

Loading

fixedwater commented Aug 31, 2022 •

edited

Loading

nbl97 commented Aug 31, 2022

fixedwater commented Sep 2, 2022

NielsRogge commented Sep 2, 2022

dribnet commented Sep 20, 2022

NielsRogge commented Sep 20, 2022

RaiAmanRai commented Sep 27, 2022 •

edited

Loading

NielsRogge commented Sep 27, 2022

RaiAmanRai commented Sep 27, 2022

Hwijune commented Sep 30, 2022

NielsRogge commented Sep 30, 2022

opentld commented Oct 10, 2022

Hwijune commented Oct 19, 2022 •

edited

Loading

NielsRogge commented Oct 19, 2022

Hwijune commented Oct 19, 2022

Hwijune commented Oct 19, 2022 •

edited

Loading

zyhzyh88 commented Nov 11, 2022

[X-CLIP] Video demo inference code #57

[X-CLIP] Video demo inference code #57

Comments

dragen1860 commented Aug 11, 2022

nbl97 commented Aug 11, 2022

fixedwater commented Aug 30, 2022

nbl97 commented Aug 30, 2022 • edited Loading

fixedwater commented Aug 31, 2022 • edited Loading

nbl97 commented Aug 31, 2022

fixedwater commented Sep 2, 2022

NielsRogge commented Sep 2, 2022

dribnet commented Sep 20, 2022

NielsRogge commented Sep 20, 2022

RaiAmanRai commented Sep 27, 2022 • edited Loading

NielsRogge commented Sep 27, 2022

RaiAmanRai commented Sep 27, 2022

Hwijune commented Sep 30, 2022

NielsRogge commented Sep 30, 2022

opentld commented Oct 10, 2022

Hwijune commented Oct 19, 2022 • edited Loading

NielsRogge commented Oct 19, 2022

Hwijune commented Oct 19, 2022

Hwijune commented Oct 19, 2022 • edited Loading

zyhzyh88 commented Nov 11, 2022

nbl97 commented Aug 30, 2022 •

edited

Loading

fixedwater commented Aug 31, 2022 •

edited

Loading

RaiAmanRai commented Sep 27, 2022 •

edited

Loading

Hwijune commented Oct 19, 2022 •

edited

Loading

Hwijune commented Oct 19, 2022 •

edited

Loading