#VoiceRecognition wav file exploration and understanding

This notebook is to understand the data format and layout of recording using Python Wave, wav files, and numpy arrays all representing the same underlying audio

##1) Record some audio to investigate with

###Prep
Import what we need

In [10]:
import wave
import numpy as np
import struct
from array import array
import sys
import pyaudio

###Config for the stream

In [11]:
THRESHOLD = 150  # Originally 500, but new is_silent() works better with this value
CHUNK_SIZE = 1024
FORMAT = pyaudio.paInt16
RATE = 44100

RECORD_SECONDS = 5

###Set up and record a stream from the mic


In [13]:
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT, channels=1, rate=RATE,
        input=True, output=True,
        frames_per_buffer=CHUNK_SIZE)

print("* recording")

frames = []

for i in range(0, int(RATE / CHUNK_SIZE * RECORD_SECONDS)):
    stream_data = stream.read(CHUNK_SIZE)
    frames.append(stream_data) # 2 bytes(16 bits) per channel
    # print len(data)

print("* done recording")

stream.stop_stream()
stream.close()
p.terminate()

* recording
* done recording


###Display some information on what we've recorded

*len(data) will show us how much data is captured durectly from stream.read() based on our CHUNK_SIZE
*len(frames) will show us how much data has been recorded directly into our frames buffer

In [14]:
print "Data recorded for each stream.read(): ", len(stream_data)
print "Total number of frames captured in frames: ", len(frames)


Data recorded for each stream.read():  2048
Total number of frames captured in frames:  215


In [15]:
print "Number of CHUNKs recorded in total:", len(stream_data) * len(frames)

Number of CHUNKs recorded in total: 440320


Now display some of the data that we've captured to get a flavour of what it looks like

In [16]:
print "Data from stream.read():"
print stream_data
print "\nData in frames:"
print frames[:2]

Data from stream.read():
                                 ��    ������    ��    ������      ��        ��    ��      ����������  ����                ��                    ��     ��  ��                                     ��      ����    ����    ����    ����                                                                ��    ��    ��      ����   ��  ��    ��        ��                    ��                  ����    ��       ��                        ��     ��         ��������������        ��             ��    ����    ����    ��    ����            ��                                                     ��                                    ��                  ��      ������    ��                     ��  ��  ����              ����  ����  ��  ��������������   ����          ��                                  ��������                         ������������������������    ������          ���������

###Write this file to disk for later

In [17]:
wf = wave.open("testing.wav", 'wb')
wf.setnchannels(1)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes('b'.join(frames))
wf.close()

###Using Numpy
Let's view some of this data in numpy as that's have we're going to manipulate it.

1) First we need to convert the frames data from integer into string (or can we get around this without as the stream.read() call returns a string in data???

In [18]:
fs = ",".join(str(f) for f in frames)
audio_data = np.array(struct.unpack_from(
        "%dh" % len(frames)*CHUNK_SIZE, 
        fs))

# Now print some stats about what we've just created in np and a snippet of the numpy array
print audio_data.shape
print audio_data[4000:5000]

(220160,)
[  768   256     0   256   256  -256    -1   511     0     0     0     0
     0     0     0   256  -256   511   768   256     0     0     0   512
   256   256     0   256   256   256     0     0     0  -512    -1   255
   512     0  -512   255     0     0  1024   512     0   768     0     0
   256   512     0     0     0     0     0     0     0     0     0     0
   512   256   512   512   256   512   768   768   768  1024  1024  1024
   768   768   768   256     0     0     0     0     0     0   256   768
   768   768   768   768   512     0     0   256   256   256   768     0
     0 11264     0     0     0    -2    -1    -1    -2    -2    -3    -1
     0    -1    -2    -2    -4    -2     0    -1    -1    -1     0     2
     0     0     1     0     2     2     3     2     0     0    -2    -1
     0    -1     0     0     0     0     0     2     2     1     1     2
     1     2     2     1     1     0     0     2     0     0     0     1
     2     0     1     1     0     1     

##2) Now start to process features
In this part of the code we are going to extract the features from the audio following a [Mel Frequency Cepstrum](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) algorithm so that we can analyse them.


We are now going to use my modified version of [Hanoop Hallur's](https://github.com/anooprh/PyOhio-Prsesentation) code to extract features from the audio into a vector. For testing purposes we'll also write the feature vectors to a file in case we want to use them later for training or testing purposes.

In [23]:
from audioUtils import extractfeatures as features
from audioUtils import audioconfig as config
import time

sample_time = time.time()
features = features.extractfeatures(audio_data)
features_filename = "features.data/features-" + str(sample_time) + ".data"
print "Writing features to " + features_filename
np.savetxt(features_filename, features)

Writing features to features.data/features-1460954637.68.data
