## Introduction

Workbook contains loop to convert all `.amr` files in selected folder to `.wav` files.

Adapted from [this](https://gist.github.com/Kronopath/c94c93d8279e3bac19f2) process

[Primer for azure storage concepts](https://k21academy.com/microsoft-azure/dp-100/datastores-and-datasets-in-azure/?utm_source=youtube&utm_medium=referral&utm_campaign=dp10019_december21) to create dataset from Azure datastore

# Azure storage concepts

The idea is that a workspace can have many registered datasets that can be accessed

* `Datastores` store connection information to azure data services eg: blob container 

* `Datasets` are references to data source w/ schema
    * `FileDataset()` - unstructured data
    * `TabularDataset()` - easy conversion to `pd.DF`
    
* `DatasetConsumptionConfig` dictate how registered Dataset delivered to compute target
    * `.as_download`
    * `.as_hdfs`
    * `.as_mount`
    

Both the Datastores and Datasets must be registered to be explosed to the workspace


In [33]:
import azureml.core, os
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.data_reference import DataReference
from azureml.exceptions import UserErrorException

# load subscription info from config.json
ws = Workspace.from_config()

### Registering datastore

In [11]:
# name of datastore to ws
blob_datastore_name='recordings'
# name of az blob container
container_name=os.getenv("BLOB_CONTAINER", "recordings")
# storage account name
account_name=os.getenv("BLOB_ACCOUNTNAME", "callcentrerecordings")
# storage account access key
account_key=os.getenv("BLOB_ACCOUNT_KEY", "KDxWNJYwT00yQd1AhYU0H5vIZjDTHYB4zN4V+Fvhufv/2e/8Dy1MtzWh8AUMU5uF3f7OmT9T5VRV+ASt+XFaoA==")

################################################
### REGISTER DATASTORE (if not already done) ###
################################################

try:
    blob_datastore = Datastore.get(ws, blob_datastore_name)
    print("Found Blob Datastore with name: %s" % blob_datastore_name)
except UserErrorException:
    blob_datastore = Datastore.register_azure_blob_container(
        workspace=ws,
        datastore_name=blob_datastore_name,
        account_name=account_name, 
        container_name=container_name, 
        account_key=account_key) 
    print("Registered blob datastore with name: %s" % blob_datastore_name)

blob_data_ref = DataReference(
    datastore=blob_datastore,
    data_reference_name="recordings",
    path_on_datastore="1",
    mode='mount', 
    path_on_compute='/recordings', 
    overwrite=False)

Found Blob Datastore with name: recordings


### Registering Dataset

In [12]:
blob_dataset = 'recordings'

default_datastore = ws.get_default_datastore()

################################################
### REGISTER DATASET (if not already done) #####
################################################

try:
    # retrieve registered dataset
    dataset = Dataset.get_by_name(ws, blob_dataset)
    print("Found Dataset with name: %s" % blob_dataset)
except UserErrorException:
    # register dataset w/ ws
    dataset = Dataset.File.from_files(path=(blob_datastore,'/1/*.amr'))
    dataset = dataset.register(workspace=ws, name=blob_dataset)
    print("Registered dataset with name: %s" % blob_dataset)

Found Dataset with name: recordings


In [16]:
# listing files in our dataset
FILES = []
for path in dataset.to_path():
    print(path)
    FILES.append(path)

/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.amr
/0000000528184405#-10557#PORTIAM13#TCRAMDA6-295#20220209071824490.amr
/1.amr
/2.amr
/3.amr
/4.amr
/5.amr
/6.amr
/amr1.amr
/amr2.amr


### Consuming registered dataset via [`DatasetConsumptionConfig` object](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_consumption_config.datasetconsumptionconfig?view=azure-ml-py)

TODO: Downloading this is almost definitely the best way to do this, but for right now it works 🙈

In [18]:
data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)
dataset.download(data_folder, overwrite=True)

['/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.amr',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/3.amr',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/amr1.amr',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/0000000528184405#-10557#PORTIAM13#TCRAMDA6-295#20220209071824490.amr',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/4.amr',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/amr2.amr',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/1.amr',
 '/mnt/batch/tasks/shared/LS_

Cofirming that ffmpeg drivers are avaialable on the compute

In [9]:
!which ffmpeg

/usr/bin/ffmpeg


In [22]:
import os, argparse, subprocess
from datetime import datetime


def convert_amr_file(input_dir, input_file_name, verbose=True, remove_intermediate=False):
    
    """
    converts single to intermediate .aud file before converting to .wav
    
    """
    # Create an additional folder to stored the converted .wav files
    output_dir = os.path.join(input_dir, 'converted' )
    if not os.path.isdir(output_dir): os.mkdir(output_dir)
        
    # Find the absolute file path and append the .amr file to this path
    input_file_path = os.path.join(input_dir, input_file_name)
    if verbose: print(input_file_path)
    
    # Open this input .amr file
    input_file = open(input_file_path, 'rb')
    if verbose: print(input_file)
        
    # Replace the input .amr file name with an intermediatiary .aud file
    intermediate_file_name = input_file_name.replace(".amr",".aud")
    if verbose: print(intermediate_file_name)
    
    # Create a path for the .aud file in the "/converted" folder space
    intermediate_file_path = os.path.join(output_dir, intermediate_file_name)
    if verbose: print(intermediate_file_path)
    
    # Open the intermediate file
    intermediate_file = open(intermediate_file_path, 'wb')
    if verbose: print(intermediate_file)
    
    # Write the input file to the intermediate file and close both
    intermediate_file.write(input_file.read())
    input_file.close()
    intermediate_file.close()
    
    # Replace .amr with .wav naming
    output_file_name = input_file_name.replace(".amr", ".wav")
    if verbose: print(output_file_name)
        
    # Join this .wav file path to the autput directory path
    output_file_path = os.path.join(output_dir, output_file_name)
    if verbose: print(output_file_path)

    # Create a file to dump the .aud files
    black_hole_file = open("black_hole", "w")
    
    # Convert the .aud files to .wav files with a specific sampling rate
    
    # sampling rate alteration follows this method: https://stackoverflow.com/questions/63793137/convert-mp3-to-wav-with-custom-sampling-rate
    subprocess.call([ "ffmpeg", "-i", intermediate_file_path, "-ar", "16k", output_file_path],
                    stdout = black_hole_file, stderr = black_hole_file)
    
    # Close the dump file
    black_hole_file.close()

    if remove_intermediate:
        # Delete the junk files
        os.remove("black_hole")
        os.remove(intermediate_file_path)
        
        ###~
###~        




In [26]:
#'/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/'
data_folder

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data'

Testing single conversion

In [27]:
convert_amr_file(data_folder,'amr1.amr', verbose=True, remove_intermediate=False)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/amr1.amr
<_io.BufferedReader name='/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/amr1.amr'>
amr1.aud
/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/amr1.aud
<_io.BufferedWriter name='/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/amr1.aud'>
amr1.wav
/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/amr1.wav


In [38]:
os.chdir(data_folder+'/converted')
os.listdir()

['0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.aud',
 '0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.wav',
 '0000000528184405#-10557#PORTIAM13#TCRAMDA6-295#20220209071824490.aud',
 '0000000528184405#-10557#PORTIAM13#TCRAMDA6-295#20220209071824490.wav',
 '1.aud',
 '1.wav',
 '2.aud',
 '2.wav',
 '3.aud',
 '3.wav',
 '4.aud',
 '4.wav',
 '5.aud',
 '5.wav',
 '6.aud',
 '6.wav',
 'amr1.aud',
 'amr1.wav',
 'amr2.aud',
 'amr2.wav',
 'converted']

# Bulk conversion

In [37]:
def bulk_convert(audio_src=data_folder, verbose=True, remove_intermediate=True):
    """
    converts all .amr file is audio_src to intermediate .aud file before converting to .wav in audio_src/converted
    
    ***Args:
    - audio_src: folder containing .amr files that need conversion
               : assumes this is in same folder as workbook
    - verbose: will print progress if True
    - remove_intermediate: will remove intermediate .aud if True
    
    ***Returns:
    - folder audio_src/converted: contains converted .wav versions of the .amr files in audio_src
    
    """
    # Find the absolute file path to your audio files
    audio_src = os.path.join(os.getcwd(), audio_src)
    
    # Loop over each file in the directory and convert to .wav
    for dirname, dirnames, filenames in os.walk(audio_src):
        for filename in filenames:
            input_path = os.path.join(dirname, filename)
            if verbose: print(f'input_path: {input_path}')
            try: 
                convert_amr_file(dirname, filename,verbose=verbose)
                if verbose: print(f"Done converting {filename}!")
            except:
                if verbose:  print(f"ERROR on {filename}")
                    
                    
bulk_convert(audio_src=data_folder, verbose=True, remove_intermediate=False)

input_path: /mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.amr
/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.amr
<_io.BufferedReader name='/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.amr'>
0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.aud
/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.aud
<_io.BufferedWriter name='/mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/0000000229520431#-10569#

In [39]:
os.chdir(data_folder+'/converted')
os.listdir()

['0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.aud',
 '0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.wav',
 '0000000528184405#-10557#PORTIAM13#TCRAMDA6-295#20220209071824490.aud',
 '0000000528184405#-10557#PORTIAM13#TCRAMDA6-295#20220209071824490.wav',
 '1.aud',
 '1.wav',
 '2.aud',
 '2.wav',
 '3.aud',
 '3.wav',
 '4.aud',
 '4.wav',
 '5.aud',
 '5.wav',
 '6.aud',
 '6.wav',
 'amr1.aud',
 'amr1.wav',
 'amr2.aud',
 'amr2.wav',
 'converted']

In [43]:
converted = data_folder + '/converted'

In [49]:

local_files = [os.path.join(converted,f) for f in os.listdir(converted) if ".wav" in f ]

# get the datastore to upload prepared data
blob_datastore.upload_files(files=local_files, target_path=None, show_progress=True)

"datastore.upload_files" is deprecated after version 1.0.69. Please use "FileDatasetFactory.upload_directory" instead. See Dataset API change notice at https://aka.ms/dataset-deprecation.


Uploading an estimated of 10 files
Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.wav
Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.wav, 1 files out of an estimated total of 10
Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/1.wav
Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/1.wav, 2 files out of an estimated total of 10
Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/Users/neelan/asr/data-pre-processing/data/converted/2.wav
Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/neelan-compute/code/User

$AZUREML_DATAREFERENCE_recordings

In [None]:
datadir= osjoin(workdir,"data")
 local_files = [ osjoin(datadir,f) for f in listdir(datadir) if ".parquet" in f ]
    
 # get the datastore to upload prepared data
 datastore = ws.get_default_datastore()
 datastore.upload_files(files=local_files, target_path=None, show_progress=True)