# Creating an Audioset subset

First, we need to download the full [Audioset](https://research.google.com/audioset/download.html) dataset. This is about 2.5GB so it may take some time to download.

### NOTE: The Audioset data is stored with filenames that are case sensitive. If you are using a filesystem with case-insensitive filenames (such as macOS) 75% of the dataset will be overwritten when you decompress the archive. You should only run this on a Linux machine.

In [1]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz

--2019-03-25 12:58:45--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.10.48, 2607:f8b0:4006:803::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.10.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2588881044 (2.4G) [application/octet-stream]
Saving to: ‘features.tar.gz’


2019-03-25 12:59:10 (99.1 MB/s) - ‘features.tar.gz’ saved [2588881044/2588881044]



In [2]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv

--2019-03-25 12:59:10--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.10.48, 2607:f8b0:4006:803::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.10.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101468408 (97M) [application/octet-stream]
Saving to: ‘unbalanced_train_segments.csv’


2019-03-25 12:59:14 (28.5 MB/s) - ‘unbalanced_train_segments.csv’ saved [101468408/101468408]



In [1]:
#!tar xvzf features.tar.gz

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd
import glob


In order to run this on a linux machine, I used this notebook in [Google Colaboratory](https://colab.research.google.com/)
The following cells are for the uploading of necessary files to the colab instance.
If you are running this locally, you can skip this section

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
for k,v in uploaded.items():
  with open(k,'wb') as f:
    f.write(v)

In [None]:
uploaded.keys()

Here is where the laughter and not laughter labels are loaded from local .csv files.
If you want to create a subset using a different category of labels, create new files containing a list of the labels you want to select as your positive and negative classes. You can find a list of all the Audioset labels [here](class_labels_indices.csv)

In [6]:
labels = pd.read_csv('unbalanced_train_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
laugh_labels = pd.read_csv('cheer_applause_labels.csv',names=['num','label','description'])
not_laugh_labels = pd.read_csv('human_non_laugh_labels.csv',names=['num','label','description'])
l_str = '|'.join(laugh_labels['label'].values)

In [4]:
n_str = '|'.join(not_laugh_labels['label'].values)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))

## Eval set

In [5]:
%%time
labels = pd.read_csv('eval_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
labels['laughter'] = labels['positive_labels'].str.contains(l_str)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))


positive = labels[labels['laughter']==True]
negative = labels[labels['not_laughter']==True].sample(positive.shape[0])
subset = positive.append(negative)
subset.to_csv('eval_laugh_speech_training_subset.csv')
print(subset.shape[0])


files = glob.glob('audioset_v1_embeddings/eval/*')
subset_ids = subset['# YTID'].values

i=0
writer = tf.io.TFRecordWriter('eval_laugh_speech_subset.tfrecord')
for tfrecord in files:
    for example in tf.compat.v1.python_io.tf_record_iterator(tfrecord):
        tf_example = tf.train.Example.FromString(example)
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in subset_ids:
            writer.write(example)
            i+=1
print(i)

writer.close()

W0325 14:28:04.188743 139774521296704 deprecation.py:323] From <timed exec>:19: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


240
240
CPU times: user 16.5 s, sys: 473 ms, total: 17 s
Wall time: 18 s


## train set

In [6]:
#!pip install tqdm
#from tqdm import tqdm

In [4]:
%%capture
from tqdm.autonotebook import tqdm
tqdm()

#### Warning: The audioset dataset is large and this will take a while to run. It took about 2 hours to process.

In [9]:
labels = pd.read_csv('unbalanced_train_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
n_str = '|'.join(not_laugh_labels['label'].values)
labels['laughter'] = labels['positive_labels'].str.contains(l_str)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))

positive = labels[labels['laughter']==True]
negative = labels[labels['not_laughter']==True].sample(positive.shape[0])
subset = positive.append(negative)
subset.to_csv('laugh_speech_unbal_training_subset.csv')

print(subset.shape[0])

import glob
files = glob.glob('audioset_v1_embeddings/unbal_train/*')
subset_ids = subset['# YTID'].values

i=0
writer = tf.io.TFRecordWriter('unbal_laugh_speech_subset.tfrecord')
for tfrecord in tqdm(files):
    for example in tf.compat.v1.python_io.tf_record_iterator(tfrecord):
        tf_example = tf.train.Example.FromString(example)
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in subset_ids:
            writer.write(example)
            i+=1
print(i)

writer.close()

12658


HBox(children=(IntProgress(value=0, max=4096), HTML(value='')))


12658


In [8]:
labels = pd.read_csv('balanced_train_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
n_str = '|'.join(not_laugh_labels['label'].values)
labels['laughter'] = labels['positive_labels'].str.contains(l_str)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))

positive = labels[labels['laughter']==True]
negative = labels[labels['not_laughter']==True].sample(positive.shape[0])
subset = positive.append(negative)
subset.to_csv('laugh_speech_bal_training_subset.csv')

print(subset.shape[0])

import glob
files = glob.glob('audioset_v1_embeddings/bal_train/*')
subset_ids = subset['# YTID'].values

print('No of files', len(files))
i=0
writer = tf.io.TFRecordWriter('bal_laugh_speech_subset.tfrecord')
for tfrecord in tqdm(files):
    for example in tf.compat.v1.python_io.tf_record_iterator(tfrecord):
        tf_example = tf.train.Example.FromString(example)
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in subset_ids:
            writer.write(example)
            i+=1
            if i % 100 == 0:
                print(i)
print(i)

writer.close()

256
No of files 4070


HBox(children=(IntProgress(value=0, max=4070), HTML(value='')))

W0401 04:31:01.321030 139676783638336 deprecation.py:323] From <ipython-input-8-084ab2e004a8>:21: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


100
200

256


This is only necessary if you want to download from Colab

In [None]:
from google.colab import files
files.download('bal_laugh_speech_subset.tfrecord')

In [None]:
files.download('eval_laugh_speech_subset.tfrecord')

In [12]:
dataset = tf.compat.v1.python_io.tf_record_iterator('unbal_laugh_speech_subset.tfrecord')

In [13]:
v = 0
for i in dataset:
    v += 1
print(v)

12658


TypeError: object of type 'TFRecordDatasetV2' has no len()