# **Dimensionality Reduction**

En `video-features.ipynb` se pudo extraer las caracteristicas de los videos del dataset `test_subset_10.csv`, `train_subset_10.csv` y `val_subset_10.csv`, de manera que, cada vector característico se almacenó en un archivo `${youtube_id}.npy`(dentro de la carpeta 'extraction')

## **Load packages**

In [26]:
import numpy as np
import pandas as pd
from pathlib import Path

## **Load datasets**

In [3]:
train_path = "./data/train_subset_10.csv"
val_path = "./data/val_subset_10.csv"
test_path = "./data/test_subset_10.csv"

train_df = pd.read_csv(train_path)
val_df = pd.read_csv(val_path)
test_df = pd.read_csv(test_path)

## **Useful functions**

In [17]:
def get_X_y_id(path: str, df:pd.DataFrame, is_train:bool=True):

    """
    Description:
        This function returns the feature_vectors, labels an ids of the videos

    Args: 
        path (str): the path to the directory where the videos are located
        df (pd.DataFrame): a DataFrame of Pandas, could be train_df, val_df or test_df
        is_train (bool): if True, the function returns the feature_vectors, labels an ids of the videos

    """

    feature_vectors = []
    labels = []
    ids = []

    for video in Path(path).glob('*.npy'):
        
        id = os.path.basename(video).split('_')[0]

        if id not in df['youtube_id'].values:
            continue

        current_video = np.load(video)
        feature_vectors.append(np.mean(current_video, axis = 0)) # CHECK
        ids.append(id)

        if is_train:
            labels.append(df[df['youtube_id'] == id]['label'].values[0])

    feature_vectors = pd.DataFrame(np.vstack(feature_vectors))
    ids = pd.DataFrame({'youtube_id': ids})
    labels = pd.DataFrame(np.vstack(labels))

    
    if is_train:
        return feature_vectors, labels, ids
    else:
        return feature_vectors, ids


In [19]:
path_train = './extraction/train/r21d/r2plus1d_18_16_kinetics'
path_val = './extraction/val/r21d/r2plus1d_18_16_kinetics'
path_test = './extraction/test/r21d/r2plus1d_18_16_kinetics'


In [22]:
# X_train, y_train, ids_train = get_X_y_id(path_train, train_df)
X_val, y_val, ids_val = get_X_y_id(path_val, val_df)
# X_test, y_test, ids_test = get_X_y_id(path_test, test_df, False)

In [27]:
X_val.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,502,503,504,505,506,507,508,509,510,511
0,0.983179,0.327599,0.782543,1.001214,1.577275,0.340752,1.61847,1.223442,0.823669,0.681321,...,0.737517,0.705552,1.096622,0.711036,0.974399,0.668527,0.925978,0.582594,1.312039,2.023441
1,1.061987,0.605783,0.573484,0.550123,1.083703,0.436514,0.986723,0.237259,0.870769,0.636464,...,0.710814,1.498515,0.875496,1.125336,2.201024,1.377259,1.26104,0.406046,0.522262,0.986933
2,2.448404,1.147437,1.186966,0.792676,0.788874,0.29224,0.791563,0.51228,0.616745,0.588685,...,1.055853,1.433681,2.381611,0.469194,0.610362,2.227937,1.41925,0.234973,0.533393,0.299416
3,0.312113,0.368528,0.241799,0.789037,0.732306,0.635243,1.188323,0.714574,1.577718,1.490716,...,1.056502,0.055957,1.616206,1.442354,0.660173,0.921523,0.043679,0.595073,1.01982,0.219829
4,0.232415,0.736968,0.989492,0.516418,0.407583,0.240481,0.723166,1.121362,0.571783,0.914052,...,0.694645,0.176282,0.564752,1.375056,0.592111,0.756491,0.426875,0.457333,1.180734,0.757165


In [23]:
y_val.head()

Unnamed: 0,0
0,flipping pancake
1,wrapping present
2,stretching leg
3,stretching leg
4,shot put


In [24]:
ids_val.head()


Unnamed: 0,youtube_id
0,--33Lscn6sk
1,-0WZKTu0xNk
2,-2VKVjgNuE0
3,-2VXhGGeOWg
4,-2zDnjMmI5U


De esta manera, la fila 0 de X, y, id se corresponden, lo mismo para la fila 1, 2, ...

In [25]:
print(X_val.shape)
print(y_val.shape)  
print(ids_val.shape)

(426, 512)
(426, 1)
(426, 1)
