# Load and Preprocess Videos

GOALS: 
- Load the data from a zip file.

- Read sequences of frames out of the video files.

- Visualize the video data.

- Wrap the frame-generator `tf.data.Dataset`.

## Installing Dependencies 

## Importing Dependencies 

In [1]:
import tensorflow as tf 
import remotezip as rz 
import collections
import pathlib
import os 
import random
import tqdm

In [2]:
tf.__version__

'2.16.2'

## Download a subset of the UCF101 dataset 

In [3]:
URL = 'https://storage.googleapis.com/thumos14_files/UCF101_videos.zip'

The above URL contains a zip file with the UCF 101 dataset. Create a function that uses the remotezip library to examine the contents of the zip file in that URL

In [4]:
def list_files_from_zip_url(zip_url): 
    files = []

    with rz.RemoteZip(zip_url) as zip: 
        for zip_info in zip.infolist(): 
            files.append(zip_info.filename)

    return files 

In [5]:
files = list_files_from_zip_url(URL)
files = [file for file in files if file.endswith('.avi')]
files[:10]

['UCF101/v_ApplyEyeMakeup_g01_c01.avi',
 'UCF101/v_ApplyEyeMakeup_g01_c02.avi',
 'UCF101/v_ApplyEyeMakeup_g01_c03.avi',
 'UCF101/v_ApplyEyeMakeup_g01_c04.avi',
 'UCF101/v_ApplyEyeMakeup_g01_c05.avi',
 'UCF101/v_ApplyEyeMakeup_g01_c06.avi',
 'UCF101/v_ApplyEyeMakeup_g02_c01.avi',
 'UCF101/v_ApplyEyeMakeup_g02_c02.avi',
 'UCF101/v_ApplyEyeMakeup_g02_c03.avi',
 'UCF101/v_ApplyEyeMakeup_g02_c04.avi']

Begin with a few videos and a limited number of classes for training. After running the above code block, notice that the class name is included in the filename of each video.

Define the `get_class` function that retrieves the class name from a filename. Then, create a function called `get_files_per_class` which converts the list of all files (files above) into a dictionary listing the files for each class:

In [6]:
def get_class(filename):
    return filename.split('_')[-3]

In [7]:
get_class('UCF101/v_ApplyEyeMakeup_g01_c01.avi')

'ApplyEyeMakeup'

In [8]:
def get_files_per_class(files): 
    """ 
    Purpose: Retrieve the files that belong to each class
    """
    files_for_class = collections.defaultdict(list)

    for filename in files: 
        class_name = get_class(filename)
        files_for_class[class_name].append(filename)

    return files_for_class

In [9]:
files_for_class = get_files_per_class(files)
classes = list(files_for_class.keys())

In [10]:
print('Num classes:', len(classes))
print('Num videos for class[0]:', len(files_for_class[classes[0]]))

Num classes: 101
Num videos for class[0]: 145


Create a new function called `select_subset_of_classes` that selects a subset of the classes present within the dataset and a particular number of files per class

In [11]:
NUM_CLASSES = 10
FILES_PER_CLASS = 50

In [12]:
def select_subset_of_classes(files_for_class, classes, files_per_class): 
    files_subset = dict()

    for class_name in classes: 
        class_files = files_for_class[class_name]
        files_subset[class_name] = class_files[:files_per_class]

    return files_subset

In [13]:
files_subset = select_subset_of_classes(files_for_class, classes[:NUM_CLASSES], FILES_PER_CLASS)
list(files_subset.keys())

['ApplyEyeMakeup',
 'ApplyLipstick',
 'Archery',
 'BabyCrawling',
 'BalanceBeam',
 'BandMarching',
 'BaseballPitch',
 'BasketballDunk',
 'Basketball',
 'BenchPress']

Define Helper Functions that split the videos into training, validation and test sets. The Videos are downloaded from a Url with the ZIP file, and placed into their respective sub-directories

In [14]:
def download_from_zip(zip_url, to_dir, file_names): 
    """
    Purpose: Download the contents of the zip file from the zip url

    Arguments:
             zip_url: A URL with a zip file containing data.
             to_dir: A directory to download data to.
             file_names: Names of files to download.
    """
    with rz.RemoteZip(zip_url) as zip:
        for filename in tqdm.tqdm(file_names): 
            class_name = get_class(filename)

            zip.extract(filename, str(to_dir / class_name))
            unzipped_file = to_dir / class_name / filename


            filename = pathlib.Path(filename).parts[-1]
            output_file = to_dir / class_name/ filename

            unzipped_file.rename(output_file)


In [15]:
def split_class_lists(files_for_class, count): 
    """
    Purpose: Returns the list of files belonging to a subset of data as well as the remainder of files that need to be downloaded

    Arguments: 
             files_for_class: Files belonging to a particular class of data 
             count: Number of files to download 
    """
    split_files = []
    remainder = {}

    for className in files_for_class: 
        split_files.extend(files_for_class[className][:count])
        remainder[className] = files_for_class[className][count:]

    return split_files, remainder

In [16]:
def download_ucf_101_subset(zip_url, num_classes, splits, download_dir): 
    """ 
    Purpose: Downloads and organizes a subset of the UCF101 dataset into different splits (like training, validation and test)

    Arguments: 
             zip_url: URL of the zip file containing the dataset
             num_classes: Number of classes to include 
             splits: Dictionary defining how to split the data 
             download_dir: where to save downloaded files
    """
    # STEP 1: Get List of Files and clean it 
    files = list_files_from_zip_url(zip_url)
    for f in files: 
        path = os.path.normpath(f)
        tokens = path.split(os.sep)
        # Remove Entries that dont have proper filenames or does not have a filename
        if len(tokens) <= 2: 
            files.remove(f) 


    # STEP 2: Organize Files by Class
    # This organizes files into their respective classes and selects only the specified number of classes.
    files_for_class = get_files_per_class(files)
    classes = list(files_for_class.keys())[:num_classes]

    # STEP 3: Shuffle Files for each class 
    # Randomly shuffles the files within each class for better distribution
    for cls in classes: 
        random.shuffle(files_for_class[cls])

    # STEP 4: Creates Splits and download files
    # * Creates directories for each split (train/test/validation)
    # * Splits the files according to the specified counts
    # * Downloads the files from the ZIP to appropriate directories
    # * Keeps track of where everything is stored
    dirs = {}

    for split_name, split_count in splits.items(): 
        split_dir = download_dir / split_name
        split_files, files_for_class = split_class_lists(files_for_class, split_count)
        download_from_zip(zip_url, split_dir, split_files)
        dirs[split_name] = split_dir

    return dir

In [None]:
download_dir = pathlib.Path('./UCF101_subset/')
subset_paths = download_ucf_101_subset(URL, num_classes=NUM_CLASSES, splits={'train': 30, "val": 10, "test": 10}, download_dir=download_dir)

 69%|██████████████████████████████████████████████████████████████████████████████████████████▌                                         | 2078/3030 [1:13:25<29:03,  1.83s/it]