# Part 1: Create labeled image dataset with two classes

1. Indoor photographs (e.g. Bedrooms, Bathrooms, Classrooms, Offices) 
2. Outdoor photographs (e.g. Landscapes, Skyscrapers, Mountains, Beaches)

## Setup

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import os
import math
import seaborn
import matplotlib.pyplot as plt
import sklearn
from glob import glob
import tensorflow as tf
from IPython.display import YouTubeVideo

In [None]:
%pwd

In [None]:
# project directory
project_dir = Path('/Users/administrator/Documents/pex_challenge/')
data_dir = project_dir.joinpath('data/yt8m/frame')

## Step 1
Download a subset of examples from the YouTube-8M labeled video dataset: https://research.google.com/youtube8m/explore.html

In [None]:
# change directories into data_dir, where we want to download the data
%cd {data_dir}

In [None]:
# download the 1/100th of the training frame level data
!curl data.yt8m.org/download.py | shard=1,100 partition=2/frame/train mirror=us python

>> Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 46.9%://us.data.yt8m.org/2/frame/train/train4f.tfrecord 25.0%http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 25.1%ing http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 25.2%loading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 25.3%Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 25.4%org/2/frame/train/train4f.tfrecord 25.5%5.9%rd 25.9%record 26.0%://us.data.yt8m.org/2/frame/train/train4f.tfrecord 26.1%n/train4f.tfrecord 26.3%train/train4f.tfrecord 26.3%org/2/frame/train/train4f.tfrecord 26.4%org/2/frame/train/train4f.tfrecord 26.6%6.8%6.9%rd 27.0%s.data.yt8m.org/2/frame/train/train4f.tfrecord 27.1%org/2/frame/train/train4f.tfrecord 27.2%t8m.org/2/frame/train/train4f.tfrecord 27.4%ta.yt8m.org/2/frame/train/train4f.tfrecord 27.5%f.tfrecord 27.6%http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 27.7%ing http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 27.8%ame/train/tra

>> Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 69.0%s.data.yt8m.org/2/frame/train/train4f.tfrecord 47.1%ain4f.tfrecord 47.2%ing http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 47.3%loading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 47.4%2/frame/train/train4f.tfrecord 47.5%7.7%rd 47.7%s.data.yt8m.org/2/frame/train/train4f.tfrecord 47.9%ain4f.tfrecord 48.0%n/train4f.tfrecord 48.1%train/train4f.tfrecord 48.2%ame/train/train4f.tfrecord 48.2%t8m.org/2/frame/train/train4f.tfrecord 48.5%ta.yt8m.org/2/frame/train/train4f.tfrecord 48.6%f.tfrecord 48.7%http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 48.8%ing http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 48.9%org/2/frame/train/train4f.tfrecord 49.0%rd 49.1%record 49.2%://us.data.yt8m.org/2/frame/train/train4f.tfrecord 49.3%n/train4f.tfrecord 49.4%train/train4f.tfrecord 49.5%Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 49.7%9.8%ta.yt8m.org/2/frame/train/train4f.tfrecor

>> Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 92.2%n/train4f.tfrecord 69.2%train/train4f.tfrecord 69.3%Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 69.4%9.6%rd 69.7%record 69.7%s.data.yt8m.org/2/frame/train/train4f.tfrecord 69.8%://us.data.yt8m.org/2/frame/train/train4f.tfrecord 69.9%http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 69.9%ing http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 70.0%loading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 70.1%Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 70.2%0.4%rd 70.4%record 70.5%f.tfrecord 70.6%http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 70.7%t8m.org/2/frame/train/train4f.tfrecord 70.9%record 71.0%f.tfrecord 71.1%org/2/frame/train/train4f.tfrecord 71.4%rd 71.5%record 71.6%://us.data.yt8m.org/2/frame/train/train4f.tfrecord 71.7%n/train4f.tfrecord 71.8%train/train4f.tfrecord 71.9%ame/train/train4f.tfrecord 72.0%2/frame/train/train4f.tfrecord 72

>> Downloading http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 100.0%Succesfully downloaded train3477.tfrecord 270452664 bytes.frecord 92.8%ain4f.tfrecord 92.9%n/train4f.tfrecord 92.9%train/train4f.tfrecord 93.0%ame/train/train4f.tfrecord 93.1%2/frame/train/train4f.tfrecord 93.2%org/2/frame/train/train4f.tfrecord 93.3%t8m.org/2/frame/train/train4f.tfrecord 93.4%ta.yt8m.org/2/frame/train/train4f.tfrecord 93.4%s.data.yt8m.org/2/frame/train/train4f.tfrecord 93.5%://us.data.yt8m.org/2/frame/train/train4f.tfrecord 93.6%http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 93.7%ing http://us.data.yt8m.org/2/frame/train/train4f.tfrecord 93.8%ame/train/train4f.tfrecord 93.9%2/frame/train/train4f.tfrecord 94.0%org/2/frame/train/train4f.tfrecord 94.0%t8m.org/2/frame/train/train4f.tfrecord 94.1%ta.yt8m.org/2/frame/train/train4f.tfrecord 94.2%s.data.yt8m.org/2/frame/train/train4f.tfrecord 94.3%://us.data.yt8m.org/2/frame/train/train4f.tfrecord 94.4%http://us.data.yt8m.org/2/frame/train/tra

In [None]:
# download the 1/100th of the validate frame level data
curl data.yt8m.org/download.py | shard=1,20 partition=2/frame/validate mirror=us python

In [None]:
# download the 1/100th of the test frame level data
curl data.yt8m.org/download.py | shard=1,20 partition=2/frame/test mirror=us python

## Step 2

Extract relevant frames from the videos to build a balanced dataset of indoor and outdoor images. The dataset should contain a few thousand images in total. This task can be performed with tools like OpenCV or FFmpeg.

In [None]:
def extract_features(features, file):
    '''
    This function reads the frame level data of one tfrecord file
    It goes through all the frames in the video and returns a dataframe where each row
    is an image (a frame from the video) and the columns correspond to features extracted
    
    features: a list of features we want to extract from the tfrecord file
    file: the path to a tfrecord file
    '''
    
    # create an empty dataframe where the columns correspend to 
    # features we will extract
    df = pd.DataFrame(columns = features)
    
    for e in tf.python_io.tf_record_iterator(file):        
        tf_seq_example = tf.train.SequenceExample.FromString(e)
        # get the number of frames in the cideo
        n_frames = len(tf_seq_example.feature_lists.feature_list['audio'].feature)
        
        # start interactive TF session
        sess = tf.InteractiveSession()
    
        # iterate through frames
        for i in range(n_frames):
            # create a list containing values of the various features of a certain frame
            i_features = []
            for feat in features:
                # append 
                i_features.append(
                    tf.cast(tf.decode_raw(
                    tf_seq_example.feature_lists.feature_list[feat].feature[i].bytes_list.value[0],tf.uint8
                ),tf.float32).eval()  
                )
        
    sess.close()
    feat_rgb.append(rgb_frame)
    feat_audio.append(audio_frame)
    break

In [None]:
# get all the tensor flow files that we are going to read 
# the tf_files is a list of three elements
# the first element is a list of training tf files
# the second element is a list of validate tf files
# the third element is a list of test tf files
tf_files = [[x for x in data_dir.glob('train*.tfrecord')],
           [x for x in data_dir.glob('validate*.tfrecord')],
           [x for x in data_dir.glob('test*.tfrecord')]]

## Step 3

Create a train/test split of the data