# Convert and Optimize TimeSformer with OpenVINO™
TimeSformer(from Time-Space Transformer) devised and proposed in [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) uses self-attention layers and feedforward layers to extract features from frames. The main innovation here is the use of a modality called "pose" which refers to the spatial location and orientation of an object in video. By incorporating this information,TimeSformer is able to keep track of fine-grained details of activity of objects in the video. 

The variation of model that we will be using is the base model trained on kinetics400 dataset. The compiled model in available at [hugging face](https://huggingface.co/facebook/timesformer-base-finetuned-k400)

The tutorial consists of the following steps:
- Validate the original model
- Convert PyTorch model to OpenVINO IR
- Validate converted model
- Prepare and run optimization pipeline
- Compare performance of the FP32 and quantized models.
- Compare accuracy of the FP32 and quantized models.

# Validating the Original Model

The first step involves downloading and running the pretrained model on test data from kinetics400 dataset since this was the dataset the model was trained on, one can expect the model to perform well since it was trained on this dataset, the model is available on hugging face and can be downloaded directly from there.

In [None]:
#first step involves downloading the model, while one can also use wget and torch.load method to load the model, 
#we can also use TimesformerForVideoClassification method since our aim is to mainly demonstrate model performance
from transformers import TimesformerForVideoClassification
model = TimesformerForVideoClassification.from_pretrained("facebook/timesformer-base-finetuned-k400")

Next step is to prepare the data preprocessing pipeline, once again this can be done using transforms.Compose but instead we will be using AutoImageProcessor from transforms to take care of this for us, more about this method [here](https://huggingface.co/docs/transformers/v4.26.1/en/autoclass_tutorial) 

In [None]:
from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("facebook/timesformer-base-finetuned-k400")

Now it is time time to load the test split of the kinetics 400 dataset 

In [5]:
!pip install requests
import requests
url  = "https://storage.googleapis.com/deepmind-media/Datasets/kinetics400.tar.gz"
r = requests.get(url)
with open('kinetics400.tar.gz','wb') as f:
    f.write(r.content)




In [7]:
#now to unzip the file
import tarfile
file = tarfile.open('kinetics400.tar.gz')
file.extractall('./')
file.close()


In [16]:
import pandas as pd
test_dataset = pd.read_csv('./kinetics400/test.csv').drop(columns='split',axis = 1)

In [17]:
test_dataset

Unnamed: 0,label,youtube_id,time_start,time_end
0,drinking beer,--6bJUbfpnQ,17,27
1,climbing tree,--8YXc8iCt8,2,12
2,surfing water,--coBvtS-eQ,57,67
3,stomping grapes,--q6ElFyVq0,148,158
4,tai chi,--q_mvQ8zP8,67,77
...,...,...,...,...
34746,catching or throwing softball,zzHsdlYe_5I,11,21
34747,skiing (not slalom or crosscountry),zzl-3zkieiE,436,446
34748,jumping into pool,zzpqbqLllzA,1,11
34749,gargling,zzy_artj1B8,210,220


Now that the dataset is up and loaded nicely, we have to create a function that will automatically download these vidoes from the selected time and preprocess and send it to model