# [Exploration 05] Spectrogram Classification
- 2차원의 spectrogram 데이터를 분류하는 task
- 기본 모델과 skip connection을 사용한 모델로 나누어 성능을 비교한다. 
- data source : [Kaggle/TensorFlow Speech Recognition Challenge](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge)
## 1. load data 

In [1]:
import numpy as np
import os

data_path = os.getenv("HOME")+'/aiffel/speech_recognition/data/speech_wav_8000.npz'
speech_data = np.load(data_path)

print(speech_data)
print(speech_data.keys())
print(speech_data.values())


<numpy.lib.npyio.NpzFile object at 0x7f0b50ede2d0>
KeysView(<numpy.lib.npyio.NpzFile object at 0x7f0b50ede2d0>)
ValuesView(<numpy.lib.npyio.NpzFile object at 0x7f0b50ede2d0>)


- speech_data는 wav_vals, label_vals로 이루어져있다.

In [2]:
print("Wave data shape : ", speech_data["wav_vals"].shape)
print("Label data shape : ", speech_data["label_vals"].shape)

Wave data shape :  (50620, 8000)
Label data shape :  (50620, 1)


In [7]:
print(speech_data["wav_vals"][0])
print(len(speech_data["wav_vals"][0]))
print(speech_data["label_vals"])

[-1.27418665e-04 -1.12644804e-04 -1.86756923e-04 ... -1.62762426e-05
 -4.93293861e-04 -3.55132594e-04]
8000
[['down']
 ['down']
 ['down']
 ...
 ['silence']
 ['silence']
 ['silence']]


- data는 총 50,620개
- 모두 1초 음성데이터이다. 
- sampling rate : 8,000   


- data 확인

In [9]:
import IPython.display as ipd
import random

rand = random.randint(0, len(speech_data["wav_vals"]))
print("rand num : ", rand)

sr = 8000 # sampling rate : 1초동안 재생되는 샘플의 갯수
data = speech_data["wav_vals"][rand]
print("Wave data shape : ", data.shape)
print("label : ", speech_data["label_vals"][rand])

ipd.Audio(data, rate=sr)

rand num :  32476
Wave data shape :  (8000,)
label :  ['on']


## 2. data 전처리
- text로 된 label data 처리 
- split data set    



- label data 확인

In [16]:
labels = speech_data["label_vals"]
np.unique(labels)

array(['down', 'go', 'left', 'no', 'off', 'on', 'right', 'silence',
       'stop', 'unknown', 'up', 'yes'], dtype='<U7')

- 'unkown', 'silence'를 추가한다.
- 각 label에 인덱스를 부여한다.

In [22]:
target_list = ['yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go']

label_value = target_list
label_value.append('unknown')
label_value.append('silence')

print('LABEL : ', label_value)

new_label_value = dict()
for i, l in enumerate(label_value):
    new_label_value[l] = i
label_value = new_label_value

print('Indexed LABEL : ', label_value)

LABEL :  ['yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence']
Indexed LABEL :  {'yes': 0, 'no': 1, 'up': 2, 'down': 3, 'left': 4, 'right': 5, 'on': 6, 'off': 7, 'stop': 8, 'go': 9, 'unknown': 10, 'silence': 11}


- 위에서 만들어 둔 index를 이용해 label data를 숫자(index)로 바꾼다. 

In [21]:
temp = []
for v in speech_data["label_vals"]:
    temp.append(label_value[v[0]])
label_data = np.array(temp)

label_data

array([ 3,  3,  3, ..., 11, 11, 11])

- data split : sklearn.model_selection을 이용한다. 

In [25]:
from sklearn.model_selection import train_test_split

sr = 8000
train_wav, test_wav, train_label, test_label = train_test_split(speech_data["wav_vals"], 
                                                                label_data, 
                                                                test_size=0.1,
                                                                shuffle=True)
#print(train_wav)

train_wav = train_wav.reshape([-1, sr, 1]) # CNN모델에 넣기 위해 차원을 늘려준다. 
test_wav = test_wav.reshape([-1, sr, 1])

print("train data : ", train_wav.shape)
print("train labels : ", train_label.shape)
print("test data : ", test_wav.shape)
print("test labels : ", test_label.shape)

[[ 1.8621128e-04  2.6658605e-04  2.7197480e-04 ... -3.0310202e-04
  -3.9047256e-04 -4.3400424e-04]
 [ 1.1070432e-03  3.2986582e-03  3.5639675e-03 ... -5.2775326e-03
  -3.5871803e-03 -1.5902136e-03]
 [-1.2921431e-03 -2.4213421e-03 -2.1725181e-03 ...  1.4598890e-03
   1.7290368e-03  1.9406232e-03]
 ...
 [ 5.0295381e-05 -1.2204549e-04 -1.5505822e-03 ... -1.1435593e-03
  -1.9260619e-03 -1.4948880e-03]
 [-3.8951489e-03 -4.0228721e-03 -5.5823219e-03 ...  1.1377191e-03
  -2.6414087e-03 -6.7140074e-03]
 [ 3.1981855e-03  4.0235277e-03 -8.1596465e-04 ...  3.1718728e-03
   2.5466348e-03 -1.9184483e-03]]
train data :  (45558, 8000, 1)
train labels :  (45558,)
test data :  (5062, 8000, 1)
test labels :  (5062,)
