Для работы с датасетом кинетик нашел очень интересный фреймворк [FIFTYONE](https://docs.voxel51.com/). Попробуем использовать его в поставленной задаче. К сожалению, в нем еще очень много багов, и найденный [гайд](https://medium.com/voxel51/the-kinetics-dataset-train-and-evaluate-video-classification-models-1d26e699a9e7) по загрузке и классификации видео через flash(pytorch lightning) повторить/применить для нашей задачи не удалось. Поэтому будем использовать данный фреймворк только для скачивания нужных данных.

# Импорты

In [1]:
import os
import random
import time
import warnings



import torch
import timm
import pandas as pd
import numpy as np
import fiftyone.zoo as foz
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models
import torch.nn.functional as F
import albumentations as A
from albumentations.pytorch.transforms import ToTensorV2

from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from torchvision.io import read_video
warnings.simplefilter("ignore", UserWarning)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
device

'cuda'

In [3]:
BATCH_SIZE = 8
EPOCHS = 10
SEED = 42

In [4]:
# Зафиксируем сиды
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True
    
set_seed(seed=SEED)

# Создание датасета

Сначала нужно получить список классов, содержащих слово **dancing**

In [5]:
df = pd.read_csv("train.csv")
df = df[df['label'].str.contains('dancing')]
class_values = df['label'].unique().tolist()
class_values

['tap dancing',
 'breakdancing',
 'belly dancing',
 'dancing charleston',
 'dancing ballet',
 'square dancing',
 'jumpstyle dancing',
 'salsa dancing',
 'robot dancing',
 'country line dancing',
 'dancing macarena',
 'mosh pit dancing',
 'dancing gangnam style',
 'swing dancing',
 'tango dancing']

Теперь загрузим видео, так как классов всего 15, то загрузим примерно по 20 видео на класс для обучения и примерно по 3 видео на класс для проверки. Получится 296 и 48, цифры берем такие, чтобы нацело делились на размер батча.

In [6]:
# Load Kinetics
dataset_train = foz.load_zoo_dataset(
    "kinetics-700-2020",
    dataset_dir = "videos",
    split="train",
    classes=class_values,
    max_samples=296,
    num_workers = -1,
    shuffle = True,
)

Downloading split 'train' to 'videos\train' if necessary
Existing download of split 'train' is sufficient
Loading existing dataset 'kinetics-700-2020-train-296'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use


При первом запуске выдает ошибку, но скачивает при этом все необходимые видео. После того как видео скачено и при повторном запуске ячейке ошибки не выдается

In [7]:
dataset_valid = foz.load_zoo_dataset(
    "kinetics-700-2020",
    dataset_dir = "videos",
    split="validation",
    classes=class_values,
    max_samples=48,
    num_workers = -1,
    shuffle = True,
)

Downloading split 'validation' to 'videos\validation' if necessary
Existing download of split 'validation' is sufficient
Loading existing dataset 'kinetics-700-2020-validation-48'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use


Теперь сделаем датафрейм, который будет содержать следующие колонки: **'path', 'class_str', 'target'**

In [8]:
train_dir = os.path.join('videos', 'train')
valid_dir = os.path.join('videos', 'validation')

In [9]:
def create_df(folder_name):
    # сначала формируем списки файлов и другой доп инфы
    all_paths = []
    all_labels = []
    all_targets = []
    folders = sorted([f for f in os.listdir(folder_name) if os.path.isdir(os.path.join(folder_name, f))])

    for i, folder in enumerate(folders):
        temp_paths = [os.path.join(folder_name, folder, f) for f 
                      in os.listdir(os.path.join(folder_name, folder))]
        
        all_paths += temp_paths
        all_labels += [str(folder)] * len(temp_paths)
        all_targets += [i] * len(temp_paths)
       
    # сделаем датафрейм
    df = pd.DataFrame({'path': all_paths,
                       'class_str': all_labels,
                       'target': all_targets})

    return df

In [10]:
df_train = create_df(train_dir)
df_valid = create_df(valid_dir)

In [11]:
df_train.tail()

Unnamed: 0,path,class_str,target
291,videos\train\tap dancing\KwvMol8NsZQ_000035_00...,tap dancing,14
292,videos\train\tap dancing\lR_t-6WxR_g_000070_00...,tap dancing,14
293,videos\train\tap dancing\TlOXXxGDJKA_000016_00...,tap dancing,14
294,videos\train\tap dancing\TTg2eg_lJ-o_000011_00...,tap dancing,14
295,videos\train\tap dancing\yEoEK_KskJ4_000037_00...,tap dancing,14


Теперь создадим Датасет

In [12]:
class DanceDataset(Dataset):
    def __init__(self, df):
        self.df = df
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.loc[idx]
        video_path = row['path']
        target = row['target']

        video, audio, info = read_video(video_path, pts_unit="sec")
        # Возьмем только часть кадров, чтобы сократить вычисления 
        if len(video) > 0:
            if len(video) < 128:
                video = video[:32] 
            else:
                video = video[:128:4]

            video = video.numpy()
            video = torch.Tensor(video)
            resize_transform = transforms.Resize((112, 112))
            video_resized = torch.stack([resize_transform(frame.permute(2, 0, 1)).permute(1, 2, 0) for frame in video])
            video_normalized = video_resized.permute(3, 0, 1, 2)
            tensor_3d = video_normalized / 255 
        else:
            tensor_3d = torch.empty(3, 32, 112, 112)
            
        label = torch.tensor(target).long()
        
        return tensor_3d, label

In [13]:
dataset_train = DanceDataset(df_train.reset_index(drop=True))
dataset_test = DanceDataset(df_valid.reset_index(drop=True))

train_loader = DataLoader(dataset_train,
                          batch_size=BATCH_SIZE,
                          shuffle=True)
valid_loader = DataLoader(dataset_test, batch_size=BATCH_SIZE)

# Создание модели и обучение

In [14]:
model = models.video.r3d_18(pretrained=True)
model.fc = torch.nn.Linear(model.fc.in_features, 15)
model.to(device);

In [15]:
loss_f = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001)

In [16]:
def train(model, optimizer, train_loader, valid_loader):
    start = time.time()
    for epoch_i in range(1, EPOCHS + 1):

        print(f'---------------------epoch:{epoch_i}/{EPOCHS}---------------------')

        # loss
        avg_train_loss = 0
        avg_val_loss = 0
        summa = 0

        ############## Train #############
        model.train()
        train_pbar = tqdm(train_loader, desc="Training")
        for X, y in (train_pbar):
            X_batch = X.to(device)
            y_batch = y.to(device)

            optimizer.zero_grad()
            res = model.forward(X_batch)
        
            loss = loss_f(res, y_batch)

            if torch.cuda.is_available():
                train_pbar.set_postfix(gpu_load=f"{torch.cuda.memory_allocated() / 1024 ** 3:.2f}GB",
                                    loss=f"{loss.item():.4f}")
            else:
                train_pbar.set_postfix(loss=f"{loss.item():.4f}")

            loss.backward()
            optimizer.step()
            avg_train_loss += loss * len(y_batch)
            
            del X, res

        ########## VALIDATION ###############
        model.eval()
        valid_pbar = tqdm(valid_loader, desc="Testing")
        with torch.no_grad():
            for X, y in (valid_pbar):
                X_batch = X.to(device)
                y_batch = y.to(device)

                res = model.forward(X_batch)
                
                loss = loss_f(res, y_batch)
                avg_val_loss += loss * len(y_batch)
                valid_pbar.set_postfix(loss=f"{loss.item():.4f}")

                res = res.detach().cpu()
                y_batch = y_batch.cpu()
                
                preds = torch.max(F.softmax(res, dim=1), dim=1)
                correct= torch.eq(preds[1], y_batch)
                summa += torch.sum(correct).item()

                del X, res
                

        torch.cuda.empty_cache()

        avg_train_loss = avg_train_loss / len(dataset_train)
        avg_val_loss = avg_val_loss / len(dataset_test)
        acc = summa / len(dataset_test)

        print(f'Epoch: {epoch_i}, lr_rate {optimizer.param_groups[0]["lr"]}')

        print("Loss_train: %0.4f| Loss_valid: %0.4f|" % (avg_train_loss, avg_val_loss))
        print("ACC:", acc)

        torch.save(model, f"model_ep_{epoch_i}.pt")

    elapsed_time = time.time() - start
    hours = int(elapsed_time // 3600)
    minutes = int((elapsed_time % 3600) // 60)
    seconds = int(elapsed_time % 60)
    print(f"Elapsed total time: {hours:02d}:{minutes:02d}:{seconds:02d}")

    return acc

In [20]:
model1_acc = train(model, optimizer, train_loader, valid_loader)
model1_acc

---------------------epoch:1/10---------------------


Training: 100%|██████████| 37/37 [09:44<00:00, 15.80s/it, gpu_load=2.91GB, loss=1.7112]
Testing: 100%|██████████| 6/6 [01:20<00:00, 13.49s/it, loss=2.4026]


Epoch: 1, lr_rate 0.0001
Loss_train: 2.4356| Loss_valid: 2.2392|
ACC: 0.25
---------------------epoch:2/10---------------------


Training: 100%|██████████| 37/37 [08:30<00:00, 13.81s/it, gpu_load=2.91GB, loss=1.1985]
Testing: 100%|██████████| 6/6 [01:20<00:00, 13.35s/it, loss=1.6802]


Epoch: 2, lr_rate 0.0001
Loss_train: 1.0702| Loss_valid: 2.0141|
ACC: 0.375
---------------------epoch:3/10---------------------


Training: 100%|██████████| 37/37 [08:11<00:00, 13.29s/it, gpu_load=2.91GB, loss=0.4402]
Testing: 100%|██████████| 6/6 [01:16<00:00, 12.75s/it, loss=1.8313]


Epoch: 3, lr_rate 0.0001
Loss_train: 0.4797| Loss_valid: 2.0387|
ACC: 0.3333333333333333
---------------------epoch:4/10---------------------


Training: 100%|██████████| 37/37 [08:06<00:00, 13.16s/it, gpu_load=2.91GB, loss=0.1970]
Testing: 100%|██████████| 6/6 [01:22<00:00, 13.69s/it, loss=1.7125]


Epoch: 4, lr_rate 0.0001
Loss_train: 0.2187| Loss_valid: 1.9436|
ACC: 0.375
---------------------epoch:5/10---------------------


Training: 100%|██████████| 37/37 [09:01<00:00, 14.63s/it, gpu_load=2.91GB, loss=0.1351]
Testing: 100%|██████████| 6/6 [01:19<00:00, 13.19s/it, loss=1.6633]


Epoch: 5, lr_rate 0.0001
Loss_train: 0.1197| Loss_valid: 2.0018|
ACC: 0.4166666666666667
---------------------epoch:6/10---------------------


Training: 100%|██████████| 37/37 [08:22<00:00, 13.58s/it, gpu_load=2.91GB, loss=0.0807]
Testing: 100%|██████████| 6/6 [01:15<00:00, 12.58s/it, loss=1.9010]


Epoch: 6, lr_rate 0.0001
Loss_train: 0.1016| Loss_valid: 2.0096|
ACC: 0.3125
---------------------epoch:7/10---------------------


Training: 100%|██████████| 37/37 [08:25<00:00, 13.65s/it, gpu_load=2.91GB, loss=0.0449]
Testing: 100%|██████████| 6/6 [01:18<00:00, 13.05s/it, loss=1.8808]


Epoch: 7, lr_rate 0.0001
Loss_train: 0.0705| Loss_valid: 2.0001|
ACC: 0.3958333333333333
---------------------epoch:8/10---------------------


Training: 100%|██████████| 37/37 [08:29<00:00, 13.77s/it, gpu_load=2.91GB, loss=0.0434]
Testing: 100%|██████████| 6/6 [01:16<00:00, 12.76s/it, loss=2.2138]


Epoch: 8, lr_rate 0.0001
Loss_train: 0.0631| Loss_valid: 2.0827|
ACC: 0.3541666666666667
---------------------epoch:9/10---------------------


Training: 100%|██████████| 37/37 [08:32<00:00, 13.85s/it, gpu_load=2.91GB, loss=0.0359]
Testing: 100%|██████████| 6/6 [01:20<00:00, 13.42s/it, loss=2.1420]


Epoch: 9, lr_rate 0.0001
Loss_train: 0.0413| Loss_valid: 2.0356|
ACC: 0.25
---------------------epoch:10/10---------------------


Training: 100%|██████████| 37/37 [08:26<00:00, 13.70s/it, gpu_load=2.91GB, loss=0.0171]
Testing: 100%|██████████| 6/6 [01:18<00:00, 13.16s/it, loss=2.1276]


Epoch: 10, lr_rate 0.0001
Loss_train: 0.0404| Loss_valid: 2.0524|
ACC: 0.3125
Elapsed total time: 01:39:04


0.3125

---

Теперь сравним по метрике с другой моделью, которую предварительно тоже обучим. Обучать будем на отдельных кадрах в качестве модели возьмем `tf_efficientnetv2_s_in21k`

In [17]:
class DanceImgDataset(Dataset):
    def __init__(self, df):
        self.df = df
        self.aug =  A.Compose([
            A.Resize(height=224, width=224, always_apply=True),
            A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225],),
            ToTensorV2(),
        ])
            
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.loc[idx]
        video_path = row['path']
        target = row['target']

        video, audio, info = read_video(video_path, pts_unit="sec")
        # Берем случайный кадр 
        if len(video) > 0:
            total_frames = video.shape[0]
            random_frame_index = torch.randint(0, total_frames, (1,)).item()
            random_frame = video[random_frame_index].numpy()
            frame_with_aug = self.aug(image=random_frame)['image']
            
        else:
            random_frame = torch.randint(0, 256, (244, 244, 3), dtype=torch.uint8).numpy()
            frame_with_aug = self.aug(image=random_frame)['image']
            
        label = torch.tensor(target).long()
        
        return frame_with_aug, label

In [18]:
dataset_train = DanceImgDataset(df_train.reset_index(drop=True))
dataset_test = DanceImgDataset(df_valid.reset_index(drop=True))

train_loader = DataLoader(dataset_train,
                          batch_size=BATCH_SIZE,
                          shuffle=True)
valid_loader = DataLoader(dataset_test, batch_size=BATCH_SIZE)

In [19]:
model = timm.create_model('tf_efficientnetv2_s_in21k', pretrained=True)
model.classifier = nn.Sequential(
    nn.Linear(model.classifier.in_features, 15)
)
model.to(device);

In [20]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001)

In [21]:
model2_acc = train(model, optimizer, train_loader, valid_loader)
model2_acc

---------------------epoch:1/10---------------------


Training: 100%|██████████| 37/37 [04:11<00:00,  6.80s/it, gpu_load=1.42GB, loss=2.7988]
Testing: 100%|██████████| 6/6 [00:49<00:00,  8.27s/it, loss=2.5763]


Epoch: 1, lr_rate 0.0001
Loss_train: 2.7128| Loss_valid: 2.6233|
ACC: 0.14583333333333334
---------------------epoch:2/10---------------------


Training: 100%|██████████| 37/37 [04:53<00:00,  7.94s/it, gpu_load=1.42GB, loss=1.9292]
Testing: 100%|██████████| 6/6 [00:53<00:00,  8.84s/it, loss=2.5702]


Epoch: 2, lr_rate 0.0001
Loss_train: 2.3023| Loss_valid: 2.4527|
ACC: 0.2708333333333333
---------------------epoch:3/10---------------------


Training: 100%|██████████| 37/37 [04:58<00:00,  8.07s/it, gpu_load=1.42GB, loss=2.0441]
Testing: 100%|██████████| 6/6 [00:50<00:00,  8.37s/it, loss=2.7985]


Epoch: 3, lr_rate 0.0001
Loss_train: 1.8802| Loss_valid: 2.4774|
ACC: 0.14583333333333334
---------------------epoch:4/10---------------------


Training: 100%|██████████| 37/37 [04:48<00:00,  7.79s/it, gpu_load=1.42GB, loss=1.5762]
Testing: 100%|██████████| 6/6 [00:45<00:00,  7.57s/it, loss=2.5455]


Epoch: 4, lr_rate 0.0001
Loss_train: 1.5713| Loss_valid: 2.3814|
ACC: 0.25
---------------------epoch:5/10---------------------


Training: 100%|██████████| 37/37 [04:52<00:00,  7.92s/it, gpu_load=1.42GB, loss=1.3514]
Testing: 100%|██████████| 6/6 [00:50<00:00,  8.48s/it, loss=2.6042]


Epoch: 5, lr_rate 0.0001
Loss_train: 1.3142| Loss_valid: 2.3736|
ACC: 0.14583333333333334
---------------------epoch:6/10---------------------


Training: 100%|██████████| 37/37 [04:50<00:00,  7.85s/it, gpu_load=1.42GB, loss=1.2638]
Testing: 100%|██████████| 6/6 [00:47<00:00,  7.87s/it, loss=2.2956]


Epoch: 6, lr_rate 0.0001
Loss_train: 1.1241| Loss_valid: 2.3380|
ACC: 0.1875
---------------------epoch:7/10---------------------


Training: 100%|██████████| 37/37 [04:42<00:00,  7.64s/it, gpu_load=1.42GB, loss=0.8194]
Testing: 100%|██████████| 6/6 [00:49<00:00,  8.17s/it, loss=2.3588]


Epoch: 7, lr_rate 0.0001
Loss_train: 0.8955| Loss_valid: 2.3321|
ACC: 0.25
---------------------epoch:8/10---------------------


Training: 100%|██████████| 37/37 [05:22<00:00,  8.73s/it, gpu_load=1.42GB, loss=1.2200]
Testing: 100%|██████████| 6/6 [00:50<00:00,  8.34s/it, loss=2.0568]


Epoch: 8, lr_rate 0.0001
Loss_train: 0.7997| Loss_valid: 2.0698|
ACC: 0.3541666666666667
---------------------epoch:9/10---------------------


Training: 100%|██████████| 37/37 [04:44<00:00,  7.68s/it, gpu_load=1.42GB, loss=0.7231]
Testing: 100%|██████████| 6/6 [00:47<00:00,  7.91s/it, loss=2.4321]


Epoch: 9, lr_rate 0.0001
Loss_train: 0.5929| Loss_valid: 2.3486|
ACC: 0.25
---------------------epoch:10/10---------------------


Training: 100%|██████████| 37/37 [04:50<00:00,  7.85s/it, gpu_load=1.42GB, loss=0.7106]
Testing: 100%|██████████| 6/6 [00:50<00:00,  8.40s/it, loss=2.1984]


Epoch: 10, lr_rate 0.0001
Loss_train: 0.5143| Loss_valid: 2.1798|
ACC: 0.2708333333333333
Elapsed total time: 00:56:31


0.2708333333333333

Тяжело делать выводы на основе этих данных.