# Fine-tuning for Video Classification with 🤗 Transformers
### Abstract
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

https://arxiv.org/pdf/2103.15691

![image.png](vivit.png)


## Embeddings
### Uniform frame sampling 
straightforward method of tokenising the input video is to uniformly sample nt frames from the input video clip, embed each 2D frame independently using the same method as ViT, and concatenate all these tokens together. Concretely, if nh · nw non-overlapping image patches are extracted from each frame, then a total of nt ·nh·nw tokens will be forwarded through the transformer encoder.Intuitively, this process may be seen as simply constructing a large 2D image to be tokenised following ViT

#### Tubelet embedding
An alternate method, to extract non-overlapping, spatio-temporal “tubes” from the input volume, and to linearly project this to Rd. This method is an extension of ViT’s embedding to 3D,and corresponds to a 3D convolution. 

### HF Vivit
https://huggingface.co/docs/transformers/main/model_doc/vivit

# Dataset
https://paperswithcode.com/dataset/kinetics-400-1

# Download Dataset sayakpaul/ucf101-subset
#### Complete UCF101
UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.

With 13320 videos from 101 action categories, UCF101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc, it is the most challenging data set to date. As most of the available action recognition data sets are not realistic and are staged by actors, UCF101 aims to encourage further research into action recognition by learning and exploring new realistic action categories.

https://www.crcv.ucf.edu/research/data-sets/ucf101/

In [None]:
from huggingface_hub import hf_hub_download
import os
hf_dataset_identifier = "sayakpaul/ucf101-subset"
filename = "UCF101_subset.tar.gz"
file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset", local_dir=".")
file_path

In [None]:
os.getcwd()

In [None]:
import tarfile
import os
with tarfile.open("UCF101_subset.tar.gz") as t:
     t.extractall("./data")

In [None]:
from transformers import TrainingArguments
from transformers import Trainer, TrainingArguments, AdamW
from model_configuration import *
from transformers import Trainer
from preprocessing import create_dataset
from data_handling import frames_convert_and_create_dataset_dictionary
from model_configuration import initialise_model
import wandb

In [None]:
from dotenv import load_dotenv
import os
env_path =  ".env"
load_dotenv(env_path)

# Base Model

https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

### google/vivit-f-16x2-kinetics400

![image.png](models.png)


##### https://huggingface.co/docs/transformers/main/model_doc/vivit

In [None]:
import model_configuration
from model_configuration import compute_metrics
import cv2
import av
from data_handling import sample_frame_indices, read_video_pyav

In [None]:
container = av.open("./data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi")

In [None]:
container.streams.video[0].frames

In [None]:
! pip install moviepy -q

In [None]:
import moviepy.editor

In [None]:
container = av.open("./data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi")
indices = sample_frame_indices(clip_len=50, frame_sample_rate=2,seg_len=container.streams.video[0].frames)
video = read_video_pyav(container=container, indices=indices)

In [None]:
indices

In [None]:
video.shape

In [None]:
# from importlib import reload
# reload(model_configuration)



In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

In [None]:
path_files = "data/UCF101_subset"
video_dict, class_labels = frames_convert_and_create_dataset_dictionary(path_files)


In [None]:
len(video_dict)

In [None]:
video_dict[0].keys()

In [None]:
video_dict[0]['video'].shape

In [None]:
video_dict[0]['labels']

In [None]:
num_frames, height, width, channels =  video_dict[0]['video'].shape
num_frames, height, width, channels 

# Display Video sample

In [None]:
# filename = "./tmp/saved.mp4"
# codec_id = "mp4v" # ID for a video codec.
# fourcc = cv2.VideoWriter_fourcc(*codec_id)
# out = cv2.VideoWriter(filename, fourcc=fourcc, fps=2, frameSize=(width, height))

# for frame in np.split(video_dict[0]['video'], num_frames, axis=0):
#     out.write(frame)

In [None]:
# container2 = av.open("./tmp/saved.mp4")
# moviepy.editor.ipython_display(container2.name)

In [None]:
class_labels = sorted(class_labels)
label2id = {label: i for i, label in enumerate(class_labels)}
id2label = {i: label for label, i in label2id.items()}

print(f"Unique classes: {list(label2id.keys())}.")

In [None]:
shuffled_dataset = create_dataset(video_dict)

In [None]:
shuffled_dataset['train'].features

In [None]:

model = model_configuration.initialise_model(shuffled_dataset, device)

In [None]:
training_output_dir = "/tmp/results"
training_args = TrainingArguments(
    output_dir=training_output_dir,         
    num_train_epochs=3,             
    per_device_train_batch_size=2,   
    per_device_eval_batch_size=2,    
    learning_rate=5e-05,            
    weight_decay=0.01,              
    logging_dir="./logs",           
    logging_steps=10,                
    seed=42,                       
    eval_strategy="steps",    
    eval_steps=10,                   
    warmup_steps=int(0.1 * 20),      
    optim="adamw_torch",          
    lr_scheduler_type="linear",      
    fp16=True,  
    report_to="wandb"
)

In [None]:
wandb_key =  os.getenv("WANDB_API_KEY")
wandb.login(key=wandb_key)

PROJECT = "ViViT"
MODEL_NAME = "google/vivit-b-16x2-kinetics400"
DATASET = "sayakpaul/ucf101-subset"

wandb.init(project=PROJECT, # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes ="Fine tuning ViViT with ucf101-subset")

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-05, betas=(0.9, 0.999), eps=1e-08)
# Define the trainer
trainer = Trainer(
    model=model,                      
    args=training_args,              
    train_dataset=shuffled_dataset["train"],      
    eval_dataset=shuffled_dataset["test"],       
    optimizers=(optimizer, None),  
    compute_metrics = compute_metrics
)

In [None]:
with wandb.init(project=PROJECT, job_type="train", # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes =f"Fine tuning {MODEL_NAME} with {DATASET}."):
           train_results = trainer.train()

In [None]:
trainer.save_model("model")
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

In [None]:
custom_path = "./model"

In [None]:
with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("ViViT-Fine-tuned", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)


# Inference

In [None]:
path_files_val = "data/UCF_101_subset_val"
video_dict_val, class_labels_val = frames_convert_and_create_dataset_dictionary(path_files_val)

In [None]:
val_dataset = create_dataset(video_dict_val)

In [None]:
import wandb
run = wandb.init()
artifact = run.use_artifact('olonok69/ViViT/ViViT-Fine-tuned:v0', type='model')
artifact_dir = artifact.download()

In [None]:
artifact_dir

In [None]:
val_dataset

In [None]:
from data_handling import generate_all_files
import os
import numpy as np
import av
from pathlib import Path
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    '''
    Sample a given number of frame indices from the video.
    Args:
        clip_len (`int`): Total number of frames to sample.
        frame_sample_rate (`int`): Sample every n-th frame.
        seg_len (`int`): Maximum allowed index of sample's last frame.
    Returns:
        indices (`List[int]`): List of sampled frame indices
    '''
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

In [None]:
labels = val_dataset['train'].features['labels'].names
config = VivitConfig.from_pretrained(artifact_dir)
config.num_classes=len(labels)
config.id2label = {str(i): c for i, c in enumerate(labels)}
config.label2id = {c: str(i) for i, c in enumerate(labels)}
config.num_frames=10
config.video_size= [10, 224, 224]

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
from transformers import VivitImageProcessor, VivitForVideoClassification

In [None]:
image_processor = VivitImageProcessor.from_pretrained("google/vivit-b-16x2-kinetics400")
fine_tune_model = VivitForVideoClassification.from_pretrained(artifact_dir,config=config)

In [None]:
directory =  "data/UCF_101_subset_val"

In [None]:
class_labels = []
true_labels=[]
predictions = []
predictions_labels = []
all_videos=[]
video_files= []
sizes = []
for p in generate_all_files(Path(directory), only_files=True):
    set_files = str(p).split("/")[2] # train or test
    cls = str(p).split("/")[3] # class
    file= str(p).split("/")[4] # file name
    #file name path
    file_name= os.path.join(directory, set_files, cls, file)
    true_labels.append(cls)   
    # Process class
    if cls not in class_labels:
        class_labels.append(cls)
    # process video File
    container = av.open(file_name)
    #print(f"Processing file {file_name} number of Frames: {container.streams.video[0].frames}")  
    indices = sample_frame_indices(clip_len=10, frame_sample_rate=1,seg_len=container.streams.video[0].frames)
    video = read_video_pyav(container=container, indices=indices)
    inputs = image_processor(list(video), return_tensors="pt")
    with torch.no_grad():
        outputs = fine_tune_model(**inputs)
        logits = outputs.logits

    # model predicts one of the 400 Kinetics-400 classes
    predicted_label = logits.argmax(-1).item()
    prediction = fine_tune_model.config.id2label[str(predicted_label)]
    predictions.append(prediction)
    predictions_labels.append(predicted_label)
    print(f"file {file_name} True Label {cls}, predicted label {prediction}")

In [None]:
from sklearn.metrics import classification_report

In [None]:
report = classification_report(true_labels, predictions)
print(report)

In [None]:
file_name = "./tmp/6540601-uhd_2560_1440_25fps.mp4"
container = av.open(file_name)

In [None]:

#moviepy.editor.ipython_display(container.name)

In [None]:
indices = sample_frame_indices(clip_len=10, frame_sample_rate=3,seg_len=container.streams.video[0].frames)
print(f"Processing file {file_name} number of Frames: {container.streams.video[0].frames}")  
video = read_video_pyav(container=container, indices=indices)
inputs = image_processor(list(video), return_tensors="pt")

In [None]:

with torch.no_grad():
    outputs = fine_tune_model(**inputs)
    logits = outputs.logits

In [None]:
predicted_label = logits.argmax(-1).item()
prediction = fine_tune_model.config.id2label[str(predicted_label)]
prediction