# AudioSet

1. 在youtube对应的视频中抽取10秒的音频数据
2. 对输入的音频，win_length = 25ms，hop_length = 10 ms，window = hann 进行STFT
3. 使用64个Mel 刻度来计算Mel光谱图。覆盖范围125-7500赫兹
4. 在 mel 频谱中，+.01的偏差（为了防止log(0))，进行对数计算
5. 将这些特征加框成0.96秒的非重叠样本，其中每个样本包括64个MEL波段和每10ms一组的96帧（10msX96 = 0.96s），得到 96x64的mel fbank 特征
6. 将 96x64 fbank 输入到 VGG 模型中
7. 经过四组卷积/maxpool层，最后得到 128 的输出层
8. 0.96s = 1s, 10秒的音频数据就有 10组 128 特征

### 注：mel频谱计算请参看[mfcc-fbank](../mfcc-fbank/mfcc-fbank.ipynb)

In [1]:
import tensorflow as tf
import numpy as np

# Pre-requisites:
Please download the following files before you begin this tutorial:
- [balanced_train_segments.csv](http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv)
- [unbalanced_train_segments.csv](http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv)
- [eval_segments.csv](http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/eval_segments.csv)
- [128-dimension audio features](https://research.google.com/audioset/download.html) i.e., embeddings - About 2GB in size.

`examples` must contain YouTube IDs of all examples for one class. Consider the class `Clapping`.

In [2]:
!grep Clapping ../../audioset/class_labels_indices.csv

63,/m/0l15bq,"Clapping"


In [3]:
class_label_index = !grep Clapping ../../audioset/class_labels_indices.csv

In [4]:
print(class_label_index[0].split(",")[1])

/m/0l15bq


In [37]:
!grep /m/0l15bq ../../audioset/balanced_train_segments.csv |head

0FMdORf5iGs, 30.000, 40.000, "/m/04rlf,/m/081rb,/m/09x0r,/m/0l15bq"
1IxBagCJeZc, 150.000, 160.000, "/m/01j3sz,/m/09x0r,/m/0l15bq"
1_DouJRW3PM, 30.000, 40.000, "/m/028ght,/m/09x0r,/m/0l15bq"
2y9ikTsTsl0, 30.000, 40.000, "/m/028ght,/m/09x0r,/m/0l15bq"
3PliaLqMSqg, 30.000, 40.000, "/m/028ght,/m/09x0r,/m/0l15bq"
3ixOXsKUufM, 30.000, 40.000, "/m/0l15bq"
4mOTOTJLv5U, 0.000, 10.000, "/m/09x0r,/m/0l15bq,/m/0ytgt"
7Ep2a7_sbmc, 260.000, 270.000, "/m/09x0r,/m/0l15bq"
7SpYywlGPyM, 30.000, 40.000, "/m/09x0r,/m/0k65p,/m/0l15bq,/m/0ytgt"
AiGF0850kT8, 6.000, 16.000, "/m/04rlf,/m/0l15bq"


In [38]:
examples = !grep /m/0l15bq ../../audioset/balanced_train_segments.csv | head -4 | cut -c -11
print(examples)

['0FMdORf5iGs', '1IxBagCJeZc', '1_DouJRW3PM', '2y9ikTsTsl0']


In [39]:
tfrecord_prefixes = set([i[:2] for i in examples])

In [40]:
tfrecord_filenames = ["../../audioset/audioset_v1_embeddings/bal_train/" + i + ".tfrecord" for i in tfrecord_prefixes]

In [41]:
audio_embeddings_dict = {}
audio_labels_dict = {}
#all_tfrecord_filenames = glob.glob("bal_train/" + example[:2] + ".tfrecord")

# Load embeddings
sess = tf.Session() 
for tfrecord in tfrecord_filenames: 
  for example in tf.python_io.tf_record_iterator(tfrecord):
    tf_example = tf.train.Example.FromString(example)
    vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8')
    if vid_id in examples:
      example_label = list(np.asarray(tf_example.features.feature['labels'].int64_list.value))
      tf_seq_example = tf.train.SequenceExample.FromString(example)
      n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)
      print(n_frames)
      audio_frame = []
      for i in range(n_frames):
        audio_frame.append(tf.cast(tf.decode_raw(
             tf_seq_example.feature_lists.feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8)
            ,tf.float32).eval(session=sess))
      audio_embeddings_dict[vid_id] = audio_frame
      audio_labels_dict[vid_id] = example_label

10
10
10
10


# 每个数据，对应多个标签，其数值可以 class_labels_indices.csv中查到对应的含义

In [42]:
print(audio_labels_dict)

{'0FMdORf5iGs': [0, 63, 137, 387], '1_DouJRW3PM': [0, 63, 67], '1IxBagCJeZc': [0, 16, 63], '2y9ikTsTsl0': [0, 63, 67]}


# 其数据是1秒1帧，共10秒，所以是10帧的128D特征，（10，128）

In [43]:
import matplotlib.pyplot as plt
feature = np.array(audio_embeddings_dict['0FMdORf5iGs'])

feature.shape

(10, 128)