<a href="https://colab.research.google.com/github/marathomas/Protein-secondary-structure-prediction/blob/master/meerkat_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Meerkat Preprocessing

In this script, I am preprocessing audio data so that I can reproduce the unsupervised clustering analysis performed by Gentner et al. 2019 (https://www.biorxiv.org/content/10.1101/870311v1.full.pdf) with meerkat vocalizations. I am planning to use the functions made available by Tim Sainsberg (https://github.com/timsainb/avgn_paper), a package called AVGN that should allow me to reproduce the analyses in the paper. Therefore, I need to bring my data into the format that is necessary to run the AVGN analysis functions.

Thus, the aim of this script is to generate four types of files: 
- fileID_call[number].WAV files, each containing a meerkat call
- fileID_noise[number].WAV files, each containing noise recorded prior or after a meerkat call
- fileID_call[number].json files, each containing the metadata for a meerkat call
- fileID_label.csv files

The input to this script are long audio files usually containing up to 300 calls and csv files containing the start and stop times and labels of meerkat calls as annotated manually.

## Prerequisites

- Project folder should already exist, save path as PROJECT_PATH
- Project folder must contain subfolder called "in_labels", containing all label tables in csv format 
- Project folder must contain subfolder called "in_wavs", containing all audio files

### Mounting drive

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Or select "Mount Drive" in Files menu!

### Installing and loading libraries

In [0]:
import os

In [3]:
os.system('pip install pydub')

0

("Currently, software installations within Google Colaboratory are not persistent, in that you must reinstall libraries every time you (re-)connect to an instance.")

In [0]:
import pandas as pd
import sys
import re
from IPython.display import Audio 
import librosa
from pydub import AudioSegment
import matplotlib.pyplot as plt
import librosa.display
import numpy as np
import os
import statistics
import matplotlib.pyplot as plt
from datetime import time
from datetime import datetime
import glob
from pandas.core.common import flatten

### Setting constants

Setting project, input and output folders.

In [0]:
# these directories should already exist
PROJECT_PATH = "/content/drive/My Drive/meerkat/" 
AUDIO_IN = PROJECT_PATH+"in_wavs/" 
LABELS_IN = PROJECT_PATH+"in_labels/" 

# these directories are created during execution
AUDIO_OUT = PROJECT_PATH+"segmented_audios/" 
LABELS_OUT = PROJECT_PATH+"labels/"
JSON_OUT = PROJECT_PATH+'json_files/'
NOISE_OUT = PROJECT_PATH+'noise_files/'

dirs2create = [AUDIO_OUT, LABELS_OUT, JSON_OUT, NOISE_OUT]

In [0]:
noise_params = {
    "min_noise_ms": 1000,
    "max_noise_ms": 2000
}

Constants for parsing label files:

- column names in labels CSV that indicate start and duration
- irrelevant labels that are discarded in the process (labels for beeps, noise, synch calls..)
- minimum call duration in ms

In [0]:
START_COL = 'Start'
DUR_COL = 'Duration'

IRRELEVANT_LABELS = ['SYNCH', 'START', 'END', 'NOISE', 'BEEP', 'CHEW']
IRRELEVANT_LABELS = IRRELEVANT_LABELS+[item.lower() for item in IRRELEVANT_LABELS]

MIN_DURATION = 100 

### Installing AVGN

Cloning the AVGN_paper repository:

In [20]:
# not sure if this works, need to test it
os.chdir(PROJECT_PATH)

if (not os.path.exists(PROJECT_PATH+'avgn_paper')):
  os.mkdir("avgn_paper")
  os.chdir("avgn_paper")
  os.sytem('git clone https://github.com/timsainb/avgn_paper.git')
else:
  os.chdir("avgn_paper")

os.system('python setup.py develop')

0

In [0]:
# jupyter notebook style:

#os.mkdir('avgn_paper')
#! git clone https://github.com/timsainb/avgn_paper.git
#os.chdir('avgn_paper')
#!python setup.py develop

In [0]:
# jupyter notebook style
#!pip install pathlib2

In [0]:
# need to check if this works:
os.system('pip install pathlib2')

In [0]:
import pathlib2

In [0]:
from importlib.machinery import SourceFileLoader
#avgn = SourceFileLoader('avgnpaper/avgn', join(PROJECT_PATH, 'utils/somelib.py')).load_module()
avgn = SourceFileLoader('avgnpaper/avgn', PROJECT_PATH+'avgn_paper/avgn/'+'utils/__init__.py').load_module()
import avgn


### Creating output directories

In [0]:
os.chdir(PROJECT_PATH)

for dirpath in dirs2create:
  if not os.path.exists(os.path.basename(dirpath[:-1])):
    os.mkdir(os.path.basename(dirpath[:-1]))

## Functions

What I need are short .wav files, each containing one single vocalization of a meerkat. What I have are long .wav files, containing many vocalizations and periods of silence (noise), and a label file (CSV) indicating at what time vocalisations occur (and what type of vocalisation they are). In addition, I need a JSON metadata file for each vocalization .wav file.

In [0]:
os.chdir(PROJECT_PATH)

### General

In [0]:
# Function that gets fileID from csv filename
# Input: csv_filename (not path!) (String)
# Output: csv_filename up to the last numeric character

def fileID_from_csv_filename(csv_in):
  csv_in = csv_in[::-1]  # reverse string
  def rem_nonnumeric(str): # removes everything up to the first numeric char
    foundDigit=False
    pos=0
    for char in str:
      if(char.isdigit()):
        return(str[pos:])
      pos=pos+1
    return str 
  csv_out = rem_nonnumeric(csv_in)
  csv_out = csv_out[::-1] # reverse to normal again
  return (csv_out)

### Functions for parsing label files

First, I'll parse the csv label files into a pandas dataframe. Then, because PyDub segments audio in milliseconds, I have to turn the format h:min:s.ms to ms, so that I have the start and stop times of the calls in milliseconds (f.e. 10032002-10032144)

In [0]:
# Function that gets datatime object from timestring
# timestring must match one of the given time_patterns
# Input: some string containing a time (String)
# Output: datetime object
# Example usage: dt = get_time("01:02:30.555")
def get_time(timestring):
    time_patterns = ['%H:%M:%S.%f', '%M:%S.%f']
    for pattern in time_patterns:
        try:
            return datetime.strptime(timestring, pattern)
        except:
            pass

    print("Date is not in expected format")
    sys.exit(0)

# Function that converts time in datatime object to ms 
# Input: datatime (datatime.datatime)
# Output: time in ms (float)
# Example usage: ms = get_ms(datatime_obj)
def get_ms(dt):
    return dt.microsecond/1000+dt.second*1000 + dt.minute*60*1000 + dt.hour*60*60*1000


In [0]:
# Function that generates labels dataframe from csv file
# - adds start and stop times of calls in milliseconds
# - removes irrelevant labelled sections (noise, synch, beep etc.)
# - removes labelled sections below minimum duration
# Input: filepath to label csv (String)
# Output: pandas dataframe, each row representing one call
# Example usage: labels = parse_labels(label_filepath)

def prep_labels(label_filepath):
  print("Parsing...")
  # read in labels
  labels = pd.read_csv(label_filepath, sep="\t")
  
  # Remove irrelevant labels
  # find name of column that contains the labels. Should contain 'Name'
  name_col = [col for col in labels.columns if 'Name' in col]
  # hopefully only one result
  if(len(name_col)==1):
    name_col = name_col[0]
    labels = labels[~labels[name_col].str.contains('|'.join(IRRELEVANT_LABELS))]
  else:
    print("Cannot find label name column")
  
  # Add start stop ms
  if (labels.shape[0]!=0):
    labels['start_ms'] = labels.apply(lambda row: get_ms(get_time(row['Start'])), axis=1)
    labels['duration_ms'] = labels.apply(lambda row: get_ms(get_time(row['Duration'])), axis=1)
    labels['stop_ms'] = labels['start_ms']+labels['duration_ms']

    # Remove super short calls (possibly mistakes?)
    labels = labels.loc[labels['duration_ms'] >= MIN_DURATION]
  
  return labels

### Functions for segmenting audio files

I'll segment the audio files based on the timings given in the label files.

In [0]:
# Function that generates audio chunks based on a label file that
# provides start and stop times in ms
# Input: filepath to audiofile (.wav) (String), 
#        filepath to labelsfile (.csv) (String)
# Output: None, audio chunks are exported in current working directory, named
#         filename_call_[number].wav (numbered 1,2,3...)
# Example usage: generate_audio_chunks(audio_filepath, label_filepath)

def generate_audio_chunks(audio_filepath, label_filepath):
  
  print("Processing "+os.path.basename(audio_filepath))

  # Parse labels
  labels = prep_labels(label_filepath)
  
  # If labels is non-empty...
  if (labels.shape[0]!=0):
    # Create audio chunks based on start and stop ms 
    print("Chunking...")
    audio_filename = os.path.basename(audio_filepath)
    audio = AudioSegment.from_wav(audio_filepath)
    chunks = labels.apply(lambda row: (audio[row['start_ms']:row['stop_ms']]), axis = 1)

    # export chunks in current working directory
    chunks.index=range(chunks.shape[0])
    for index, content in chunks.items():
      content.export((audio_filename[:-4]+"_call"+str(index)+".wav"), format="wav")
  
  # save modified labels file
    labels['audio_file'] = [audio_filename[:-4]+"_call"+str(i)+".wav" for i in range(labels.shape[0])]
    labels.to_csv(LABELS_OUT+audio_filename[:-4]+"_labels.csv")

  else:
    print("No labelled calls for "+os.path.basename(audio_filepath))

### Functions for generating noise files

In [0]:
# Function to generate noise file for a call wav
# Input: filepath to call wav (String), 
#        filepath to full wav (String), 
#        filepath to label csv (String),
#        fileID (String), 
#        Parameters for generating noise file (Dict)
# Output: returns file name of noise wav file or NA (String), 
#         generates noise file in NOISE_OUT directory
# Comments: - uses subfunctions extract_noise_pre and extract_noise_post
#           - adapted from Sainsberg

def generate_noisewav(call_filepath, wav_filepath, label_filepath, fileID, noise_params):

  noise_file = "NA"  
  call_filename = os.path.basename(call_filepath)
  # not pretty but should work to get the call number
  # [:-4] to remove .wav, split _call to get number behind _call
  # (could also take row number of label table)
  call_num = call_filename[:-4].split("_call")[1]
  label_table = pd.read_csv(label_filepath)

  # these need to be converted to seconds
  call_start = label_table.loc[label_table['audio_file']==call_filename, 'start_ms'].values[0]/1000
  call_end = label_table.loc[label_table['audio_file']==call_filename, 'stop_ms'].values[0]/1000
  
  min_noise_size = noise_params["min_noise_ms"]/1000
  max_noise_size = noise_params["max_noise_ms"]/1000

  all_call_ends = np.asarray(label_table['stop_ms'])/1000

  # try to get noise from pre call
  noise_clip, sr = extract_noise_pre(call_start, call_end, wav_filepath, all_call_ends, min_noise_size, max_noise_size)
  if noise_clip is None:
    # try to get noise from post call
    all_call_starts = np.asarray(label_table['start_ms'])/1000
    noise_clip, sr = extract_noise_post(call_start, call_end, wav_filepath, all_call_starts, min_noise_size, max_noise_size)
  
  # save noise file (if one could be generated)
  if noise_clip is not None:
    librosa.output.write_wav(NOISE_OUT+fileID+'_noise'+call_num+'.wav', y=noise_clip, sr=sr, norm=True)
    noise_file = fileID+'_noise'+call_num+'.wav'

  return noise_file


def extract_noise_pre(call_start, call_end, wav_filepath, all_call_ends, min_noise_size, max_noise_size):
  # try to get a noise clip from the time preceding this clip
  if call_start > min_noise_size:
    # get time of preceding pulses
    td = call_start - all_call_ends
    td = td[td > 0]
    # if there is anything within this timeframe, this timeframe is unusable
    if not np.any(td < min_noise_size):
      # get times for noise clip
      noise_start = call_start - np.min(
          list(td - 1) + [max_noise_size]
          )
      noise_end = call_start

      # load the clip
      noise_clip, sr = librosa.load(
          wav_filepath,
          mono=True,
          sr=None,
          offset=noise_start,
          duration=(noise_end - noise_start),
          )
      return noise_clip, sr
  return None, None


def extract_noise_post(call_start, call_end, wav_filepath, all_call_starts, min_noise_size, max_noise_size):
  # try to get noise clip from end of file
  wav_duration = (librosa.get_duration(filename=wav_filepath))
  if wav_duration - call_end > min_noise_size:
    td = all_call_starts - call_end
    td = td[td > 0]
    if not np.any(td < min_noise_size):
      # get times for noise clip
      noise_start = call_end
      noise_end = call_end + np.min(
          list(td - min_noise_size / 2)
          + [max_noise_size]
          )
      # load the clip
      noise_clip, sr = librosa.load(
          wav_filepath,
          mono=True,
          sr=None,
          offset=noise_start,
          duration=(noise_end - noise_start),
          )
      return noise_clip, sr
  return None, None


### Functions for generating JSON files

In [0]:
# Function to get the meerkat ID (alphanumeric String) from filename
# filename is always HM_meerkatID_*.extension
# Input: filename (String)
# Output: meerkat ID (String)
# Example use: get_meerkatID('HM_VHMM003_HLT_AUDIO_R12_file_5_(2017_08_06-06_44_59)_ASWMUX221102.wav')

def get_meerkatID(filename):
  meerkatID = filename.replace('HM_','')
  meerkatID = str.split(meerkatID, sep='_')[0]
  return meerkatID

# TODO
# Function to generate JSON file for call wav
# Input: filepath to call wav (String), filepath to full wav (String), filepath to label csv (String), 
# Output: None, generates JSON files in JSON_OUT directory
def generate_json(call_filepath, wav_filepath, label_filepath, fileID, noise_wav):
  return 0

## Processing files

### Setting variables

Getting list of csvs and matching wavs

In [0]:
# Getting list of fileIDs, wavs and csvs (fileID, in_wav_loc and in_csv_loc)

in_csv_loc = glob.glob(LABELS_IN+'*.csv') + glob.glob(LABELS_IN+'*.CSV')
csv_filenames = [os.path.basename(csv) for csv in in_csv_loc]

fileIDs = [fileID_from_csv_filename(csv_filename) for csv_filename in csv_filenames]

in_wav_loc = [glob.glob(AUDIO_IN+fileID+'*') for fileID in fileIDs] # creates list of lists
in_wav_loc = [["NA"] if not x else x for x in in_wav_loc] # Replace empty lists with "NA"
in_wav_loc = list(flatten(in_wav_loc)) # Flatten list

### Segmenting audio files

Segmenting the audio files in AUDIO_IN into smaller chunks, each containing one call. Start and stop times of calls are taken from label files (csv) in LABELS_IN folder. 

generate_audio_chunks generates audio chunks based on a label file that provides start and stop times in ms. It takes a filepath to audiofile (.wav) (String) and a filepath to labelsfile (.csv) (String) as input. The output is none, but audio chunks are exported in AUDIO_OUT directory, named filename_call_[number].wav (numbered 1,2,3...)

In [0]:
os.chdir(AUDIO_OUT)

In [0]:
for wav, csv in zip(in_wav_loc, in_csv_loc):
  if not wav=='NA':
    generate_audio_chunks(wav, csv)
  else:
    print("No matching audio for: "+csv)

In [18]:
print("Generated "+str(len(glob.glob(LABELS_OUT+'*')))+" label files")
print("Generated "+str(len(glob.glob(AUDIO_OUT+'*')))+" audio chunks")

Generated 32 label files
Generated 16116 audio chunks


Segmented audios were generated for all files, except:
- HM_VHMM002_HRT_AUDIO_R09_file_5_(2017_08_06-06_44_59)_ASWMUX221110 (where all labels were something other than calls.)

- HM_VHMM003_SOUNDFOC_20170905_2
- HM_VLF206_SOUNDFOC_20170903
- HM_VHMM003_SOUNDFOC_20170905_4
- HM_VHMM003_SOUNDFOC_20170905_3
- HM_VLF206_SOUNDFOC_20170905_2
- HM_VLF206_SOUNDFOC_20170905_1

where there was no matching wav file found.

### Removing bad files

Should remove all files that are not usable for the analysis because later steps in the preprocessing are not written to handle missing files that are usually created during preprocessing.

Need to remove those with low quality, these are:
- HM_VHMM007_LT_AUDIO_R11_file_5_(2017_08_06-06_44_59)_ASWMUX221163
- HM_VHMM006_RT_AUDIO_R14_file_5_(2017_08_06-06_44_59)_ASWMUX221052

as well as the one without calls, which was:

- HM_VHMM002_HRT_AUDIO_R09_file_5_(2017_08_06-06_44_59)_ASWMUX221110


In [0]:
files2delete = ['HM_VHMM007_LT_AUDIO_R11_file_5_(2017_08_06-06_44_59)_ASWMUX221163', 
                'HM_VHMM006_RT_AUDIO_R14_file_5_(2017_08_06-06_44_59)_ASWMUX221052',
                'HM_VHMM002_HRT_AUDIO_R09_file_5_(2017_08_06-06_44_59)_ASWMUX221110']

In [0]:
# Removing segmented audios
for file in files2delete:
  fileList = glob.glob(AUDIO_OUT+file+'*.wav')
 # Iterate over the list of filepaths & remove each file.
  for filePath in fileList:
      try:
          os.remove(filePath)
      except:
          print("Error while deleting file : ", filePath)

In [0]:
print("Remaining: "+str(len(glob.glob(AUDIO_OUT+'*')))+" audio chunks")

In [0]:
# Removing labels files and audio in files
for file in files2delete:
  try:
    os.remove(LABELS_OUT+file+'_labels.csv')
  except:
    print("Error while deleting file csv for : ", file)
  try:
    os.remove(AUDIO_IN+file+'.wav')
  except:
    print("Error while deleting wav for file : ", file)


In [73]:
print(len(os.listdir(LABELS_OUT)))
print(len(os.listdir(AUDIO_IN))) 

30
30


### Generating noise files

Generating the noise file (a wav of the 1-2s prior of after a call if not another call occurs in this time window). Can be used later to denoise the call wav. The noise files are saved in the NOISE_OUT folder and labelled: 
- fileID_noise[call_number].wav

A column 'noise_wav' is added to the fileIDlabels.csv containing either the noise filename or "NA" if no noise file could be generated.

In [0]:
# List of all wav filepaths (to full wav files, not just the calls), where I have a labels file
# therefore, do it kind of backwards instead of just: wav_fileList = glob.glob(AUDIO_IN+'*.wav')
csv_filepathList = glob.glob(LABELS_OUT+'*.csv')
wav_filepathList = [AUDIO_IN+(os.path.basename(item).replace('_labels.csv', '.wav')) for item in csv_filepathList]

Noise file generation:

In [0]:
# for each long wav file
for wav_filepath in wav_filepathList:
  fileID = os.path.basename(wav_filepath).replace('.wav', '')
  # find the matching label file
  label_filepath = glob.glob(LABELS_OUT+fileID+'*.csv')[0]
  # generate list of call wavs assigned to it
  call_filepathList = glob.glob(AUDIO_OUT+fileID+'*.wav')

  # Then:
  # for each call wav
  noise_wavs = []
  for call_filepath in call_filepathList:
    # generate NOISE file
    noise_wavs.append(generate_noisewav(call_filepath, wav_filepath, label_filepath, fileID, noise_params))
  
  # add noise_wav column to labels.csv files
  lable_table = pd.read_csv(label_filepath)
  lable_table['noise_wav'] = noise_wavs
  lable_table.to_csv(label_filepath)


### Generating JSON files

Using the same loop as in noise files. A little redundant but I wanted to leave these steps separate so that I can more easily re-do  single steps in the pipeline.




If not still present from generating noise files...

In [0]:
csv_filepathList = glob.glob(LABELS_OUT+'*.csv')
wav_filepathList = [AUDIO_IN+(os.path.basename(item).replace('_labels.csv', '.wav')) for item in csv_filepathList]

JSON file generation:

In [0]:
# for each long wav file
for wav_filepath in wav_filepathList:
  fileID = os.path.basename(wav_filepath).replace('.wav', '')
  # find the matching label file
  label_filepath = glob.glob(LABELS_OUT+fileID+'*.csv')[0]
  # generate list of call wavs assigned to it
  call_filepathList = glob.glob(AUDIO_OUT+fileID+'*.wav')

  # Then:
  # for each call wav
  json_files = []
  for call_filepath in call_filepathList:
    # generate JSON file
    json_files.append(generate_json(call_filepath, wav_filepath, label_filepath, fileID))
  
  # add json file column to labels.csv files
  lable_table = pd.read_csv(label_filepath)
  lable_table['json_file'] = json_files
  lable_table.to_csv(label_filepath)

# Code-Reste

### Checking what files are present

In [9]:
# Checking which files are uploaded and which aren't even though they should have been

years = ['2017', '2019']

for year in years:
  print("Year is: "+year)
  matching = pd.read_csv(PROJECT_PATH+"/matching_"+year+".txt", sep="\t", header=None)
  matching.columns = ['name', 'wav', 'csv']

  print(matching.shape)
  matching = matching.dropna()
  print("After dropping NA:")
  print(matching.shape)

  wavs_file = [os.path.basename(x) for x in matching['wav']]
  if(year=='2017'):
    audio_dir = AUDIO_IN
  elif(year=='2019'):
    audio_dir = PROJECT_PATH+'matched_wavs_2019'
  
  wavs_drive = os.listdir(audio_dir)

  if(len(set(wavs_drive))==len(wavs_drive)):
    print("No duplicates")
  else:
    print("Duplicates!")

  diffs = pd.Series(list(set(wavs_drive).difference(set(wavs_file))))
  print("In drive, but not in file:"+str(diffs.size))
  print(diffs)

  diffs = pd.Series(list(set(wavs_file).difference(set(wavs_drive))))
  print("In file, but not in drive:"+str(diffs.size))
  print(diffs)
  diffs.to_csv(PROJECT_PATH+'diffs'+year+'.csv', sep=";")


Year is: 2017
(51, 3)
After dropping NA:
(45, 3)
No duplicates
In drive, but not in file:0
Series([], dtype: float64)
In file, but not in drive:0
Series([], dtype: float64)




Year is: 2019
(65, 3)
After dropping NA:
(65, 3)
No duplicates
In drive, but not in file:0
Series([], dtype: float64)
In file, but not in drive:0
Series([], dtype: float64)




In [0]:
# just a sanity check
for wav, label in zip(wav_filepathList, csv_filepathList):
  fileID = os.path.basename(wav).replace('.wav','')
  numfiles=len(glob.glob(AUDIO_OUT+fileID+'*.wav'))
  label_table = pd.read_csv(label)
  numlabels = label_table.shape[0]
  print(str(numfiles)+' : '+str(numlabels))

In [0]:
from avgn.utils.json import  NoIndentEncoder
from avgn.utils.audio import get_samplerate
import json
from avgn.utils.paths import DATA_DIR
import avgn

In [0]:
DATASET_ID = 'meerkat'
SPECIES = "Suricata suricatta"

def generate_json(call_filepath, audio_filepath, label_filepath):
    wav_duration = librosa.get_duration(filename=call_filepath)
    wavdate = datetime(year=int(row.year), day=int(row.day), month = int(row.month))
    wav_date = wavdate.strftime("%Y-%m-%d_%H-%M-%S")
    
    # wav samplerate and duration
    sr = get_samplerate(row.wav_loc.as_posix())
    wav_duration = librosa.get_duration(filename=row.wav_loc)
    
    # wav general information
    json_dict = {}
    json_dict["datetime"] = wav_date
    json_dict["samplerate_hz"] = sr
    json_dict["samplerate_hz"] = sr
    json_dict["length_s"] = wav_duration
    json_dict["species"] = "Suricata suricatta"
    json_dict["common_name"] = "Meerkat"
    json_dict["wav_loc"] = row.wav_loc.as_posix()
        json_dict = {}
    json_dict["bout_number"] = 
    json_dict["original_wav"] = bout_df.wav_loc.values[0].as_posix()
    json_dict["noise_loc"] = noise_out.as_posix()
    json_dict["indv"] = 
    
    json_txt = json.dumps(json_dict, cls=NoIndentEncoder, indent=2)
    json_out = JSON_OUT+(row.wav_loc.stem + ".JSON")

    # save json
    print(json_txt, file=open(json_out, "w"))

Example for a single file

Do the process first with a single file, just taking a random example file.


In [0]:
label_filename = "HM_VHMM007_LT_AUDIO_R11_file_5_(2017_08_06-06_44_59)_ASWMUX221163_label.CSV"
label_filepath = LABELS_IN+label_filename

audio_filename = "HM_VHMM007_LT_AUDIO_R11_file_5_(2017_08_06-06_44_59)_ASWMUX221163.wav"
audio_filepath = AUDIO_IN+audio_filename

Generate the chunks:

In [160]:
generate_audio_chunks(audio_filepath, label_filepath)

Processing HM_VHMM007_LT_AUDIO_R11_file_5_(2017_08_06-06_44_59)_ASWMUX221163.wav
Parsing...
Chunking...


Looking at results:

In [0]:
durations = [librosa.get_duration(filename=myfile) for myfile in glob.glob("*.wav")]
statistics.mean(durations)
plt.hist(durations)
#print(durations.index(max(durations)))
#print(durations[160])
#print(max(durations))
#y, sr = librosa.load("HM_HMB_R11_AUDIO_file_5_(2017_08_24-06_44_59)_ASWMUX221163_call160.wav", sr=None)
#Audio(y, rate=sr)

In [0]:
# Function that adds start and stop times in milliseconds as additional columns
# to a dataframe containing start time and duration in format 
# h:min:s:ms (column 'Start') and min:s:ms (column 'Duration')
# Input: labels dataframe (Pandas dataframe)
# Output: labels dataframe with additional columns 'start_ms' and
#         'stop_ms' (Pandas dataframe)
# Example usage: labels = add_startstop_ms(labels)

def add_startstop_ms(labels):
  
  if (labels.shape[0]!=0):
  
    # Start
    start = labels.Start.str.split(":", expand=True)
    start.columns = ['h', 'min', 's']
    start = pd.concat([start.drop(columns="s"),start.s.str.split(".", expand=True)], axis=1)

    start.columns = ['h', 'min', 's', 'ms']
    for i in list(start): start[i] = start[i].astype(str).astype(int)

    start['total']= start.apply(lambda row: (row['h']*60*60*1000+ 
                              row['min']*60*1000+ row['s']*1000+row['ms']), axis = 1)
  
    duration = labels.Duration.str.split(":", expand=True)
    duration.columns = ['min', 's']
    duration = pd.concat([duration.drop(columns="s"),duration.s.str.split(".", expand=True)], axis=1)

    duration.columns = ['min', 's', 'ms']
    for i in list(duration): duration[i] = duration[i].astype(str).astype(int)

    duration['total']= duration.apply(lambda row: (row['min']*60*1000+ 
                                               row['s']*1000+row['ms']), axis = 1)

    labels['start_ms']=start['total']
    labels['stop_ms']=start['total']+duration['total']

  return labels

### Removing files:

In [0]:
fileList = glob.glob('*_call*.wav')
 # Iterate over the list of filepaths & remove each file.
for filePath in fileList:
    try:
        os.remove(filePath)
    except:
        print("Error while deleting file : ", filePath)

In [0]:
labels = pd.read_csv(label_filepath, sep="\t")

In [0]:
labels

Next, I remove irrelevant rows, i.e. those that mark start, end, synch, beep or are marked as noise:

In [0]:
irrelevant = ['SYNCH', 'START', 'END', 'NOISE', 'BEEP']
irrelevant=irrelevant+[item.lower() for item in irrelevant]

# relevant labels
labels = labels[~labels['Name'].str.contains('|'.join(irrelevant))]

False

In [0]:
# Start
start = labels.Start.str.split(":", expand=True)
start.columns = ['h', 'min', 's']
start = pd.concat([start.drop(columns="s"),start.s.str.split(".", expand=True)], axis=1)

start.columns = ['h', 'min', 's', 'ms']
for i in list(start): start[i] = start[i].astype(str).astype(int)

start['total']= start.apply(lambda row: (row['h']*60*60*1000+ 
                     row['min']*60*1000+ row['s']*1000+row['ms']), axis = 1)

In [0]:
# Call duration
duration = labels.Duration.str.split(":", expand=True)
duration.columns = ['min', 's']
duration = pd.concat([duration.drop(columns="s"),duration.s.str.split(".", expand=True)], axis=1)

duration.columns = ['min', 's', 'ms']
for i in list(duration): duration[i] = duration[i].astype(str).astype(int)

duration['total']= duration.apply(lambda row: (row['min']*60*1000+ 
                                               row['s']*1000+row['ms']), axis = 1)

In [0]:
# Add start and stop ms to labels table
labels['start_ms']=start['total']
labels['stop_ms']=start['total']+duration['total']

The start and stop times in ms are now in the labels table (start_ms, stop_ms).

In [0]:
labels

Now, I can chunk the audio file according to these start and stop times. This will create new, very short audio files.

In [0]:
%cd segmented_audios

In [0]:
audio = AudioSegment.from_wav(audio_filepath)
chunks = labels.apply(lambda row: (audio[row['start_ms']:row['stop_ms']]), axis = 1)

chunks.index=range(chunks.shape[0])

for index, content in chunks.items():
    content.export((audio_filename[:-4]+"_call"+str(index)+".wav"), format="wav")

In [0]:
# Function to get list of call files to a given file ID
# Input: file ID (String)
# Output: List of paths to wav call files(List of Strings)
# Example use: get_call_filepaths('HM_VHMM003_HLT_AUDIO_R12_file_5_(2017_08_06-06_44_59)_ASWMUX221102.wav')

def get_call_filepaths(fileID):
  wav_filename = os.path.splitext(os.path.basename(wav_file))[0]
  call_filepathList = glob.glob(AUDIO_OUT+file_ID+'*.wav')
  return call_filepathList