[Info dataset Youtube8M](https://www.kaggle.com/competitions/youtube8m-2019)

**frame-level data** <br>
You may download to your local computer with instructions here
Total size of 1.53TB (Large file warning!)
Each video has:

- id: unique id for the video, in train set it is a YouTube video id, and in test/validation they are anonymized.
- labels: list of labels of that video.
- Each frame has rgb: float array of length 1024,
- Each frame has audio: float array of length 128
- A subset of the validation set videos are provided with segment-level labels. In addition to id, labels and the frame level features described above, they come with
- segment_start_times: list of segment start times.
- segment_end_times: list of segment end times.
- segment_labels: list of segment labels.
- segment_scores: list of binary values indicating positive or negative corresponding to the segment labels.

Files are in TFRecords format, TensorFlow python readers are available in the github repo.
frame-sample.zip - a sample of frame-level data including train00 and train01

validate-sample.zip - a sample of validation set data including validate00 and validate01

vocabulary.csv - the full data dictionary for label names and their descriptions

sample_submission.csv - a sample submission file in the correct format


In [1]:
import os
import sys
import numpy as np
import pandas as pd
from pathlib import Path
import tensorflow.compat.v1 as tf

# add root path
PATH_ROOT = Path.cwd()
for _ in range(6):
    last_files = os.listdir(PATH_ROOT)
    if 'src' in last_files:
        break
    else:
        PATH_ROOT = PATH_ROOT.parent
sys.path.append(PATH_ROOT.__str__())

# Local imports
from utils.utils import Map_index2label

#### Note: 
only the validation data contains the segment start and times and segment labels

In [48]:
# Define path data
FOLDER_DATA = Path('../data/raw')
PATH_VOCABULARY = FOLDER_DATA / 'vocabulary.csv'

PATH_TF_TRAIN_00 = (FOLDER_DATA / 'frame' / 'train00.tfrecord').__str__()
PATH_TF_TRAIN_01 = (FOLDER_DATA / 'frame' / 'train01.tfrecord').__str__()

PATH_TF_VAL_00 = (FOLDER_DATA / 'validate' / 'validate00.tfrecord').__str__()
PATH_TF_VAL_01 = (FOLDER_DATA / 'validate' / 'validate01.tfrecord').__str__()


In [69]:
get_label_videos = lambda x: [tf.train.Example.FromString(example).features.feature['labels'].int64_list.value for example in tf.python_io.tf_record_iterator(x)]

print(f"Number of videos in train00.tfrecord:       {len(get_label_videos(PATH_TF_TRAIN_00))}")
print(f"Number of videos in train01.tfrecord:       {len(get_label_videos(PATH_TF_TRAIN_01))}")
print(f"Number of videos in validate00.tfrecord:    {len(get_label_videos(PATH_TF_VAL_00))}")
print(f"Number of videos in validate01.tfrecord:    {len(get_label_videos(PATH_TF_VAL_01))}")

Number of videos in train00.tfrecord:       1015
Number of videos in train01.tfrecord:       1041
Number of videos in validate00.tfrecord:    16
Number of videos in validate01.tfrecord:    16


## Explore tfrecord
    * labels(list): target, # list of index   
    * id(str): video id
    * audio_embedding_numpy(array): audio embedding for the first frame, shape: (128,) | min: 0 , max: 255
    * rgb_embedding_numpy(array): rgb embedding for the first frame, shape: (1024,) | min: 0 , max: 255
    
    only for validation data:
    * segment_start_times (list): start frame for each segment | len(): number of segments
    * segment_end_times (list): end frame for each segment | len(): number of segments
    * segment_labels (list): index categorie for each segment | len(): number of segments
    * segment_scores (list): value for determining if the segment has the label |  len(): number of segments

In [68]:
# define function transform bytes 2 array
f_bytes2array = lambda x: tf.cast(tf.decode_raw( x.bytes_list.value[0], tf.uint8), tf.float32).numpy()
# instance mapping inde 2 labels
map_index2label = Map_index2label(PATH_VOCABULARY)

def get_info_1st_video(PATH_TF):
    """ Explore tfrecord for the first video
    Args:
        PATH_TF (str): path of tfrecord

    Features:
        labels(list): target, # list of index   
        id(str): video id
        audio_embedding_numpy(array): audio embedding for the first frame, shape: (128,) | min: 0 , max: 255
        rgb_embedding_numpy(array): rgb embedding for the first frame, shape: (1024,) | min: 0 , max: 255
        
        only for validation data:
        segment_start_times (list): start frame for each segment | len(): number of segments
        segment_end_times (list): end frame for each segment | len(): number of segments
        segment_labels (list): index categorie for each segment | len(): number of segments
        segment_scores (list): value for determining if the segment has the label |  len(): number of segments
    """
    for example in tf.python_io.tf_record_iterator(PATH_TF):
        # get data for the video 
        tf_example = tf.train.SequenceExample.FromString(example)
        
        # Only do for validation data
        segment_start_times = tf_example.context.feature['segment_start_times'].int64_list.value
        segment_end_times = tf_example.context.feature['segment_end_times'].int64_list.value
        segment_labels = tf_example.context.feature['segment_labels'].int64_list.value
        segment_scores = tf_example.context.feature['segment_scores'].float_list.value
        
        # get index labels for the video
        labels =  tf_example.context.feature['labels'].int64_list.value # list of index

        # get id video
        id = tf_example.context.feature['id'].bytes_list.value[0].decode(encoding='UTF-8') # str
        
        # get  audio and rgb embeddings
        audio_embedding_bytes =  tf_example.feature_lists.feature_list['audio'].feature
        rgb_embedding_bytes =  tf_example.feature_lists.feature_list['rgb'].feature

        # get number of frames of video
        N_FRAMES_VIDEO = len(audio_embedding_bytes)

        # only take the embedding for the first frame
        audio_embedding_numpy = f_bytes2array( audio_embedding_bytes[0] ) # shape: (128,) | min: 0 , max: 255
        rgb_embedding_numpy = f_bytes2array( rgb_embedding_bytes[0] ) # shape: (1024,) | min: 0 , max: 255
        
        break # just for exploring the data of the first video

    print(f"Get data from 1st video ({PATH_TF}):\n------------------------------------")
    print(f"ID VIDEO:               {id}")
    print(f"Number of frames:       {N_FRAMES_VIDEO}")
    print(f"Label index video:      {labels}")
    print(f"Names labels:           {[map_index2label(index) for index in labels]}")
    print(f"Label index segments:   {segment_labels}")
    print(f"Names labels segments:  {[map_index2label(index) for index in segment_labels]}")
    print(f"Start frame segmetns:   {segment_start_times}")
    print(f"Ends frame segments:    {segment_end_times}")
    print(f"Scores segments:        {segment_scores}")
    print(f"Shape video embedding:  {rgb_embedding_numpy.shape}")
    print(f"Shape adio embedding:   {audio_embedding_numpy.shape}")
    print('\n')

get_info_1st_video(PATH_TF_VAL_00)
get_info_1st_video(PATH_TF_TRAIN_00)

Get data from 1st video (../data/raw/validate/validate00.tfrecord):
------------------------------------
ID VIDEO:               Iv00
Number of frames:       190
Label index video:      [375, 1036, 1062]
Names labels:           ['', 'Laser lighting display', '']
Label index segments:   [1036, 1036, 1036, 1036, 1036]
Names labels segments:  ['Laser lighting display', 'Laser lighting display', 'Laser lighting display', 'Laser lighting display', 'Laser lighting display']
Start frame segmetns:   [145, 110, 135, 155, 70]
Ends frame segments:    [150, 115, 140, 160, 75]
Scores segments:        [0.0, 0.0, 1.0, 0.0, 0.0]
Shape video embedding:  (1024,)
Shape adio embedding:   (128,)


Get data from 1st video (../data/raw/frame/train00.tfrecord):
------------------------------------
ID VIDEO:               op00
Number of frames:       234
Label index video:      [82, 103, 346, 350]
Names labels:           ['', '', '', '']
Label index segments:   []
Names labels segments:  []
Start frame segmetn

# Vocabulary

In [54]:
vocabulary_df = pd.read_csv(PATH_VOCABULARY)
vocabulary_df.head()

Unnamed: 0,Index,TrainVideoCount,KnowledgeGraphId,Name,WikiUrl,Vertical1,Vertical2,Vertical3,WikiDescription
0,3,378135,/m/01jddz,Concert,https://en.wikipedia.org/wiki/Concert,Arts & Entertainment,,,A concert is a live music performance in front...
1,7,200813,/m/0k4j,Car,https://en.wikipedia.org/wiki/Car,Autos & Vehicles,,,"A car is a wheeled, self-powered motor vehicle..."
2,8,181579,/m/026bk,Dance,https://en.wikipedia.org/wiki/Dance,Arts & Entertainment,,,Dance is a performance art form consisting of ...
3,11,135357,/m/02wbm,Food,https://en.wikipedia.org/wiki/Food,Food & Drink,,,Food is any substance consumed to provide nutr...
4,12,130835,/m/02vx4,Association football,https://en.wikipedia.org/wiki/Association_foot...,Sports,,,"Association football, more commonly known as f..."


In [55]:
unique_class = vocabulary_df['Index'].unique()
unique_vertiacal1 = vocabulary_df['Vertical1'].unique()
unique_vertiacal2 = vocabulary_df['Vertical2'].unique()
unique_vertiacal3 = vocabulary_df['Vertical3'].unique()

classes_verticals = list(unique_vertiacal1) + list(unique_vertiacal2) + list(unique_vertiacal3)
total_clases = np.unique(classes_verticals + list(unique_class))
N_CLASSES = len( unique_class )
N_CLASSES_VERTICAL = len( classes_verticals )
print(f"Number of classes:                      {N_CLASSES}")
print(f"Number of classes verticals:            {N_CLASSES_VERTICAL}") # some verticals classes are in CLASESS
print(f"Number of unique classes + verticals:   {len(total_clases)}")

Number of classes:                      1000
Number of classes verticals:            50
Number of unique classes + verticals:   1025


In [56]:
vocabulary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Index             1000 non-null   int64 
 1   TrainVideoCount   1000 non-null   int64 
 2   KnowledgeGraphId  1000 non-null   object
 3   Name              988 non-null    object
 4   WikiUrl           988 non-null    object
 5   Vertical1         1000 non-null   object
 6   Vertical2         153 non-null    object
 7   Vertical3         12 non-null     object
 8   WikiDescription   988 non-null    object
dtypes: int64(2), object(7)
memory usage: 70.4+ KB


# Explore categories
## top 30 categories

In [66]:
# Get top n classes
from collections import Counter
label_mapping = vocabulary_df[['Index', 'Name']].set_index('Index', drop=True).to_dict()['Name'] # dict: key --> index,  values --> name of category | len(): number of categories

n = 30
labels = get_label_videos(PATH_TF_TRAIN_00)
top_n = Counter([item for sublist in labels for item in sublist]).most_common(n) # tuple --> (index, num_samples)
top_n_labels = [int(i[0]) for i in top_n]   # list: top n index 
top_n_label_names = [label_mapping[x] for x in top_n_labels if x in label_mapping] # filter out the labels that aren't in the 1,000 used for this competition

print(f"Most {n} frequent categories:\n",top_n_label_names)

Most 30 frequent categories:
 ['Concert', 'Car', 'Association football', 'Food', 'Dance', 'Motorsport', 'Racing', 'Mobile phone', 'Smartphone', 'Cooking', 'Pet', 'Dish (food)', 'Drum kit']


# Distribution of classes