# Part 1: Create labeled image dataset with two classes

1. Indoor photographs (e.g. Bedrooms, Bathrooms, Classrooms, Offices) 
2. Outdoor photographs (e.g. Landscapes, Skyscrapers, Mountains, Beaches)

## Setup

In [14]:
import numpy as np
import pandas as pd
from pathlib import Path
import os
import math
import sklearn
from glob import glob
import tensorflow as tf
from IPython.display import YouTubeVideo

In [16]:
%pwd

'/Users/administrator/Documents/pex_challenge/data/yt8m/frame'

In [17]:
# project directory
project_dir = Path('/Users/administrator/Documents/pex_challenge/')
data_dir = project_dir.joinpath('data/yt8m/frame')

## Step 1
Download a subset of examples from the YouTube-8M labeled video dataset: https://research.google.com/youtube8m/explore.html

In [18]:
# change directories into data_dir, where we want to download the data
%cd {data_dir}

/Users/administrator/Documents/pex_challenge/data/yt8m/frame


In [None]:
# download the 1/100th of the training frame level data
!curl data.yt8m.org/download.py | shard=1,100 partition=2/frame/train mirror=us python

>> Downloading http://us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 8.0%rain/trainZ6.tfrecord 4.5%us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 4.7%rain/trainZ6.tfrecord 4.8%us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 4.9%e/train/trainZ6.tfrecord 5.0%rain/trainZ6.tfrecord 5.1%.1%://us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 5.2%rame/train/trainZ6.tfrecord 5.3%cord 5.4%us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 5.5%e/train/trainZ6.tfrecord 5.5%d 5.6%rain/trainZ6.tfrecord 5.7%.8%://us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 5.8%rame/train/trainZ6.tfrecord 5.9%cord 6.0%g http://us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 6.1%us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 6.2%e/train/trainZ6.tfrecord 6.3%rain/trainZ6.tfrecord 6.4%rain/trainZ6.tfrecord 6.5%rain/trainZ6.tfrecord 6.5%rain/trainZ6.tfrecord 6.6%rain/trainZ6.tfrecord 6.7%.8%rain/trainZ6.tfrecord 6.8%.9%us.data.yt8m.org/2/frame/train/trainZ6.tfrecord 7.0%rain/trainZ6.tfrecord 7.3%rain/trainZ6.tfrecord 

In [None]:
# download the 1/100th of the validate frame level data
%%capture # stops from displaying the output to manage file size
curl data.yt8m.org/download.py | shard=1,20 partition=2/frame/validate mirror=us python

In [None]:
# download the 1/100th of the test frame level data
%%capture # stops from displaying the output to manage file size
curl data.yt8m.org/download.py | shard=1,20 partition=2/frame/test mirror=us python

## Step 2

Extract relevant frames from the videos to build a balanced dataset of indoor and outdoor images. The dataset should contain a few thousand images in total. This task can be performed with tools like OpenCV or FFmpeg.

In [28]:
# get data on the labels for videos
label_file = project_dir.joinpath('data/vocabulary.csv')
# read the csv that contains infromation about labels of videos into dataframe
df_labels = pd.read_csv(label_file.as_posix(), sep=',')

In [56]:
def extract_data(file):
    '''
    This function reads the frame level data of one tfrecord file
    It goes through all the frames in the video and returns a three lists where each row
    is an image (a frame from the video) and the column corresponds to the rgb data for that frame
    It also extract the video ID and associated labels
    
    features: a list of features we want to extract from the tfrecord file
    file: the path to a tfrecord file
    '''
    
    # create an empty dataframe where the columns correspend to 
    # features we will extract
    df = pd.DataFrame(columns = ['id', 'rgb', 'labels'])
    
    num_video = 1
    for e in tf.python_io.tf_record_iterator(file): 
        print(num_video, len(df))
        
        tf_seq_example = tf.train.SequenceExample.FromString(e)
        # get the number of frames in the video
        n_frames = len(tf_seq_example.feature_lists.feature_list['audio'].feature)
        
        # start interactive TF session
        sess = tf.InteractiveSession()
    
        # iterate through frames
        for i in range(n_frames):
            # get the id of the video
            video_id = tf.cast(tf.decode_raw(
                    tf_seq_example.context.feature['id'].bytes_list.value[0],tf.uint8
                ),tf.float32).eval()
            # get rgb values for the frame image
            # this returns an array of 1024 rgb elements for the image
            arr_rgb = tf.cast(tf.decode_raw(
                    tf_seq_example.feature_lists.feature_list['rgb'].feature[i].bytes_list.value[0],tf.uint8
                ),tf.float32).eval()  
            # get the associated labels for the frame image
            arr_labels = tf_seq_example.context.feature['labels'].int64_list.value
            # add this list to the overall dataframe
            
            # create a row of the extracted information
            row = {
                'id': video_id,
                'rgb': arr_rgb,
                'labels': arr_labels
            }
            df = df.append(row, ignore_index=True)        
        
        sess.close()
        num_video += 1
    
    return df

In [53]:
# get all the tensor flow files that we are going to read 
tf_files = [x for x in data_dir.glob('*.tfrecord')]

In [None]:
# go through each of the tensor files
# each tensor files contains thousands of videos
# extract information about each the frames in the videos
for file in tf_files[:1]:
    df = extract_data(file.as_posix())

1 0
2 300
3 522
4 685
5 820


## Step 3

Create a train/test split of the data