![Bilby Stampede](https://www.assemblyai.com/_next/static/image/src/modules/layout/branding/logo/assets/default.a451cfb665ce63899754f3d832630ece.svg)
   
Note book tests [assembly ai api](https://docs.assemblyai.com/#introduction) for transcription uploaded directly from Azure Blob Storage


[Review of assembly API](https://nordicapis.com/review-of-assemblyai-speech-to-text-api/)

In [70]:
############################################################################
###################### installing needed packages ##########################
############################################################################
#%%capture

### uncomment, install one-by-one

#!pip install requests

#%pip uninstall numpy --yes
#%pip install numpy==1.21


#!sudo apt install ffmpeg --yes #may need to install via terminal
#!conda install -c conda-forge librosa --yes
#%pip install librosa
#!sudo apt-get install blobfuse fuse


In [71]:
%%sh
############################################################################
###################### checking driver versions  ###########################
############################################################################
which ffmpeg
which blobfuse


/usr/bin/ffmpeg
/usr/bin/blobfuse


In [17]:
############################################################################
###################### importing needed packages ###########################
############################################################################

# basics
import os # to handle local files
import numpy as np
np.__version__
import pandas as pd
import glob # unix style pathname pattern expansion -- whatever that means 🙈
import requests # to handle apis
from datetime import datetime as dt # timers 
import IPython # interactive components

import matplotlib.pyplot as plt # plots
%matplotlib inline



# sound
import librosa # sound utility 
import librosa.display # plots
from matplotlib.pyplot import specgram # plots

# azure
import azureml.core, os
from azureml.core import Workspace, Datastore, Dataset, Run
from azureml.data.datapath  import DataPath
from azureml.data.data_reference import DataReference
from azureml.exceptions import UserErrorException

# load subscription info from config.json
ws = Workspace.from_config()

# to be able to write on mounted datastore
import fuse

In [15]:

!export AZURE_STORAGE_ACCOUNT="callcentrerecordings"

In [16]:
!export AZURE_STORAGE_ACCESS_KEY="KDxWNJYwT00yQd1AhYU0H5vIZjDTHYB4zN4V+Fvhufv/2e/8Dy1MtzWh8AUMU5uF3f7OmT9T5VRV+ASt+XFaoA=="

In [4]:
user = os.popen('whoami').read()[:-1]
#mount_path must be empty
mount_path   = "/home/azureuser/cloudfiles/data/mount"
if not os.path.exists(mount_path): os.mkdir(mount_path)
if os.listdir(mount_path) != []:
    print("mount_path not empty")

#make sure user has write access to cache_path - if not, create & chown to your user.
cache_path   = "/home/azureuser/cloudfiles/data/tmp"
if not os.path.exists(cache_path): os.mkdir(cache_path)

config_path  = "/home/azureuser/cloudfiles/code/Users/neelan/asr/pipeline/connection.cfg"
if not os.path.exists(cache_path):
    print("get connection file from https://github.com/elucidate-ai/asr/blob/main/pipeline/connection.cfg")

In [10]:
def check_access(folder):
    if os.access(folder, os.R_OK):
        print(f"{folder} has read access")
    if os.access(folder, os.W_OK):
        print(f"{folder} has write access")
    if os.access(folder, os.X_OK):
        print(f"{folder} has execution access!")
    if os.access(folder, os.X_OK | os.W_OK):
        print(f"{folder}  can write file to the directory")
    print()

In [11]:
for i in [mount_path, cache_path]:
    check_access(i)

/home/azureuser/cloudfiles/data/mount has read access
/home/azureuser/cloudfiles/data/mount has write access
/home/azureuser/cloudfiles/data/mount has execution access!
/home/azureuser/cloudfiles/data/mount  can write file to the directory

/home/azureuser/cloudfiles/data/tmp has read access
/home/azureuser/cloudfiles/data/tmp has write access
/home/azureuser/cloudfiles/data/tmp has execution access!
/home/azureuser/cloudfiles/data/tmp  can write file to the directory



In [12]:
!chown $user $cache_path

In [13]:
!chown $user $mount_path

In [14]:
!blobfuse /home/azureuser/cloudfiles/data/mount --tmp-path=/home/azureuser/cloudfiles/data/tmp  --config-file=/home/azureuser/cloudfiles/code/Users/neelan/asr/pipeline/connection.cfg


In [18]:
# testing write capacity of mount
ASSEMBLY_OUT = os.path.join(mount_path,'assembly_ai')
if not os.path.exists(ASSEMBLY_OUT):
    os.mkdir(ASSEMBLY_OUT)

In [19]:
# testing read capacity of mount
os.chdir(mount_path)
os.listdir()[:3]

['0291506962#-10505#SITHATIL#TCRCBD-E45#20220215135504102.wav',
 '029512090788#-10541#JERRYL#TCR-TOSH420#20220215072323987.wav',
 '0301269833#-10505#BUSISIWEM26#TCR4-254782#20220215145030457.wav']

In [43]:
FILES=os.listdir(mount_path)
NAMES=[]
FILE_PATHS=[]
for F in FILES:
    NAMES.append(str(F))
    FILE_PATHS.append(os.path.join(mount_path,F))

In [56]:
FILE_PATHS[:3]

['/home/azureuser/cloudfiles/data/mount/0291506962#-10505#SITHATIL#TCRCBD-E45#20220215135504102.wav',
 '/home/azureuser/cloudfiles/data/mount/029512090788#-10541#JERRYL#TCR-TOSH420#20220215072323987.wav',
 '/home/azureuser/cloudfiles/data/mount/0301269833#-10505#BUSISIWEM26#TCR4-254782#20220215145030457.wav']

# Azure storage concepts
[Primer for azure storage concepts](https://k21academy.com/microsoft-azure/dp-100/datastores-and-datasets-in-azure/?utm_source=youtube&utm_medium=referral&utm_campaign=dp10019_december21) to create dataset from Azure datastore

[Moving data into and between ML pipeline steps](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/machine-learning/how-to-move-data-in-out-of-pipelines.md)

The idea is that a workspace can have many registered datasets that can be accessed

* `Datastores` objects store connection information to azure data services eg: blob container 

* `Datasets` objects are references to data source w/ schema: preferred way to ingest data into a pipeline
    * `FileDataset()` - unstructured data
    * `TabularDataset()` - easy conversion to `pd.DF`
    
* `DatasetConsumptionConfig` objects to transfer data to the next pipeline step & dictate how registered Dataset delivered to compute target
    * `.as_download`
    * `.as_hdfs`
    * `.as_mount`

### How to use data in a pipeline:
*  To pass `Dataset` to pipeline:

    * Use `TabularDataset.as_named_input()` or `FileDataset.as_named_input()` (no 's' at end) to create a DatasetConsumptionConfig object
    * Use `as_mount()` or `as_download()` to set the access mode
    * Pass the datasets to your pipeline steps using either the `arguments` or the inputs argument
    

Both the Datastores and Datasets must be registered to be exposed to the workspace


### Details of `Azure Blob Storage` where client call recording files are uploaded

In [11]:
############################################################################
###################### Details of Azure Blob Storage #######################
############################################################################


# name of datastore to ws
blob_datastore_name='elucidate_recordings'
# name of az blob container
container_name=os.getenv("BLOB_CONTAINER", "recordings")
# storage account name
account_name=os.getenv("BLOB_ACCOUNTNAME", "callcentrerecordings")
# storage account access key
account_key=os.getenv("BLOB_ACCOUNT_KEY", "KDxWNJYwT00yQd1AhYU0H5vIZjDTHYB4zN4V+Fvhufv/2e/8Dy1MtzWh8AUMU5uF3f7OmT9T5VRV+ASt+XFaoA==")



### Registering `Datastore` to workspace (if not already done)

In [12]:
################################################
### REGISTER DATASTORE (if not already done) ###
################################################

try:
    blob_datastore = Datastore.get(ws, blob_datastore_name)
    print("Found Blob Datastore registered to ws with name: %s" % blob_datastore_name)
except UserErrorException:
    blob_datastore = Datastore.register_azure_blob_container(
        workspace=ws,
        datastore_name=blob_datastore_name,
        account_name=account_name, 
        container_name=container_name, 
        account_key=account_key) 
    print("Registered blob datastore with name: %s" % blob_datastore_name)

blob_data_ref = DataReference(
    datastore=blob_datastore,
    data_reference_name="recordings",
    path_on_datastore="/",
    mode='mount', 
    path_on_compute='/recordings', 
    overwrite=False)

datastore_path = DataPath(blob_datastore, '/*.wav')


Found Blob Datastore registered to ws with name: elucidate_recordings


### Registering `Dataset` (if not already done)

Here we specify what files in `Datastore` we want to make into `Dataset`

In [13]:
blob_dataset = 'wav_recordings'

ws_datastore = ws.get_default_datastore()

################################################
### REGISTER DATASET (if not already done) #####
################################################

try:
    # retrieve registered dataset
    dataset = Dataset.get_by_name(ws, blob_dataset)
    print("Found Dataset with name: %s" % blob_dataset)
except UserErrorException:
    # register dataset w/ ws 
    ### DATA LOCATION ON BLOB
    dataset = Dataset.File.from_files(path=datastore_path)
    dataset = dataset.register(workspace=ws, name=blob_dataset)
    print("Registered dataset with name: %s" % blob_dataset)

Found Dataset with name: wav_recordings


### Mounting `Dataset` to compute

Follow [this](https://stackoverflow.com/questions/67504900/cant-access-mounted-dataset-on-azure-machine-learning-service-notebook) to mount dataset & start context afterward

[more context on mounting](https://social.msdn.microsoft.com/Forums/azure/en-US/7e5f3ff4-5981-4d37-8450-5b7bb7bdeedc/mounting-an-azuremldatafiledataset?forum=AzureMachineLearningService)

    * Note: Where you mount blob matters! Incorrect location can confuse Azure SDK

In [None]:
os.getcwd()

In [21]:

mount_path = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/asr-compute/code/Users/neelan/asr/pipeline/mnt'
if not os.path.exists(mount_path): os.mkdir(mount_path)
if os.listdir(mount_path) != []:
    print("mount_path not empty")

In [None]:

# mount dataset onto mount_path of Linux-based compute
mount_context = dataset.mount(mount_path)

In [8]:
# context manager will be returned to manage the lifecycle of the mount. 
# mount only supported on Unix or Unix-like operating systems 
# libfuse must be present

mount_context.start()

### Resources to understand how to mount blob as file system

[More on `mount_context`](https://azure.github.io/azureml-sdk-for-r/reference/mount_file_dataset.html)

[How to mount Blob storage as a file system with blobfuse](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-how-to-mount-container-linux)

[Add Linux software repo for MS products](https://docs.microsoft.com/en-us/windows-server/administration/Linux-Package-Repository-for-Microsoft-Software)

In [20]:
# get linux version of compute
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic


In [9]:
################################################
### Add Linux software repo for MS products ####
################################################

# (probably better to do in terminal w/o `!`)

#!curl -sSL https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -

#!sudo apt-add-repository https://packages.microsoft.com/ubuntu/18.04/prod

#!sudo apt-get update

In [22]:
################################################
######### Configure MS products repo ###########
################################################

#!wget https://packages.microsoft.com/config/ubuntu/16.04/packages-microsoft-prod.deb
#!sudo dpkg -i packages-microsoft-prod.deb
#!sudo apt-get update

In [None]:
#################################################
######### Configure storage account credentials #
#################################################

# save following in: ~/fuse_connection.cfg
# then run: chmod 600 ~/fuse_connection.cfg  (restrict access so no other users can read it)

accountName callcentrerecordings
accountKey KDxWNJYwT00yQd1AhYU0H5vIZjDTHYB4zN4V+Fvhufv/2e/8Dy1MtzWh8AUMU5uF3f7OmT9T5VRV+ASt+XFaoA==
authType Key
containerName recordings

In [None]:
# create empty directory for mounting


Importantly note differnce in current dir outfile from inside Notebook (seems like is inside compute) vs. from Terminal

In [27]:
user = os.popen('whoami').read()[:-1]
#mount_path must be empty
mount_path   = "/home/azureuser/cloudfiles/code/Users/neelan/data/mycontainer"
if os.listdir(mount_path) != []:
    print("mount_path not empty")

#make sure user has write access to cache_path - if not, create & chown to your user.
cache_path   = "/home/azureuser/cloudfiles/code/Users/neelan/mnt/resource/blobfusetmp"
config_path  = "./connection.cfg" #/home/azureuser/cloudfiles/code/Users/neelan/asr/pipeline/connection.cfg"

In [24]:

!chown $user $cache_path

from Terminal `/home/azureuser/cloudfiles/code/Users/neelan/asr/pipeline`

from Notebook `/mnt/batch/tasks/shared/LS_root/mounts/clusters/asr-compute/code/Users/neelan/asr/pipeline`

In [33]:

# user must have access to mount_path cache_path & have full-access to those otherwise it will not work.
!sudo blobfuse $mount_point --tmp-path=$cache_path  --config-file=../connection.cfg #-o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120

No config file found at ../connection.cfg.


In [None]:
!sudo blobfuse /home/azureuser/cloudfiles/code/Users/neelan/data/mycontainer

In [None]:
#!sudo apt-get install blobfuse fuse --yes

In [10]:
os.listdir(mount_path)[:3]

['0000000229520431#-10569#MAGANYAM#LPT-MAGANYAM#20220209100209148.wav',
 '0000000528184405#-10557#PORTIAM13#TCRAMDA6-295#20220209071824490.wav',
 '0000003034735115#-10365#ZANDISWAM#TCRAMDA6-329#20220215075100132.wav']

### "Environment" variables

In [55]:
# from old .as_download() code
"""
# environment variables
FILE_PATHS = []

for f in os.listdir(data_folder)[:50]:
    if '.wav' in f:
        FILE_PATHS.append(os.path.join(data_folder,f))
"""

# get file references, names
FILES = []
NAMES = []
MOUNTS = []
# checking that mounted files are available
os.listdir(mount_path)

for f in os.listdir(mount_path):
    path = os.path.join(mount_path,f)
    FILES.append(path)
    MOUNTS.append(f)
    NAMES.append(str(f))


TOKEN='291224eea5ab48a4ae5b3a6f1ba92250' # to access assembly ai SaaS


# from old .as_download() code

# chunking for bulk processing
CHUNK_SIZE = 50
MOUNT_CHUNKS = [MOUNTS[x:x+CHUNK_SIZE] for x in range(0, len(MOUNTS), CHUNK_SIZE)]

"""
if not os.path.exists(os.path.join(data_folder,'sound_eye')):
     os.mkdir(os.path.join(data_folder,'sound_eye'))

SAVE_PATH = os.path.join(data_folder,'sound_eye')


if not os.path.exists(os.path.join(data_folder,'sound_eye')):
     os.mkdir(os.path.join(mount_path,'sound_eye'))

SAVE_PATH = os.path.join(mount_path,'sound_eye')

DPI = 200

if not os.path.exists(os.path.join(mount_path, 'sound_eye')):
    os.mkdir(os.path.join(mount_path, 'sound_eye'))


ROOT=data_folder

FILE = os.path.join(data_folder +'/1.wav') # our test file

# listening to test file 
print(FILE)
y, sr = librosa.load(FILE)
print(sr)
plt.figure(figsize=(14, 5))
IPython.display.Audio(y, rate=sr)
"""

"\nif not os.path.exists(os.path.join(data_folder,'sound_eye')):\n     os.mkdir(os.path.join(data_folder,'sound_eye'))\n\nSAVE_PATH = os.path.join(data_folder,'sound_eye')\n\n\nif not os.path.exists(os.path.join(data_folder,'sound_eye')):\n     os.mkdir(os.path.join(mount_path,'sound_eye'))\n\nSAVE_PATH = os.path.join(mount_path,'sound_eye')\n\nDPI = 200\n\nif not os.path.exists(os.path.join(mount_path, 'sound_eye')):\n    os.mkdir(os.path.join(mount_path, 'sound_eye'))\n\n\nROOT=data_folder\n\nFILE = os.path.join(data_folder +'/1.wav') # our test file\n\n# listening to test file \nprint(FILE)\ny, sr = librosa.load(FILE)\nprint(sr)\nplt.figure(figsize=(14, 5))\nIPython.display.Audio(y, rate=sr)\n"

## Using API with `requests`

[Functions largely from this blog](https://towardsdatascience.com/how-to-build-a-web-app-to-transcribe-audio-using-python-and-assemblyai-18f197253fd8)

The AssemblyAI model expects the file to be accessible via a URL. Therefore, we will need to upload the audio file to blob storage to make it accessible via a URL. Fortunately, AssemblyAI provides a quick and easy way to do this.

We need to make a `POST request` to the following AssemblyAI API endpoint: `https://api.assemblyai.com/v2/upload`

The response will contain a temporary URL to the file, we can pass this URL to the back to the AssemblyAI `transcript` API endpoint. The URL is a private URL accessible only to the AssemblyAI servers. All the uploads are immediately deleted after transcription and never stored.

We will use Python’s request library that we installed earlier to make the `POST request`

In [44]:
def get_url(token, data):
    """
    Helper Function 1: Uploading a local audio file to AssemblyAI
    Adapted from: https://towardsdatascience.com/how-to-build-a-web-app-to-transcribe-audio-using-python-and-assemblyai-18f197253fd8
    
    -Args:
    - token: The API key
    - data : The File Object to upload
    
    ***Returns:
    url : url to uploaded file
    """
    
    def read_file(filename, chunk_size=5242880):
        """
        Helper function to read file onto server
        """
        with open(filename, 'rb') as _file:
            while True:
                data = _file.read(chunk_size)
                if not data:
                    break
                yield data

    
    headers = {"authorization": token}
    response = requests.post("https://api.assemblyai.com/v2/upload", headers=headers, data=read_file(data))
    url = response.json()["upload_url"]
    print("Uploaded File and got temporary URL to file")
     
    return url

# getting temporary url
URL = get_url(TOKEN, MOUNTS[0])
URL



Uploaded File and got temporary URL to file


'https://cdn.assemblyai.com/upload/b684be30-439f-4b0f-b2f2-2c2954a516c0'


Now that we have a function to get a URL for our audio file, we will use this URL and make a request to the endpoint which will actually transcribe the file. 

We  make a request to the Transcription Endpoint along with the URL to the file. We need to make a request to the following AssemblyAI API endpoint


In [45]:
def get_transcribe_id(token, url):
    """
    Helper Function 2: Uploading a file for transcription
    Adapted from: https://towardsdatascience.com/how-to-build-a-web-app-to-transcribe-audio-using-python-and-assemblyai-18f197253fd8
    
    
    ***Args:
    - token: The API key
    - url : Url to uploaded file
    
    ***Returna:
    - id : file transcribe id
    """
    endpoint = "https://api.assemblyai.com/v2/transcript"
    json = {
    "audio_url": url
    }
    headers = {
    "authorization": token,
    "content-type": "application/json"
    }
    response = requests.post(endpoint, json=json, headers=headers)
    id_ = response.json()["id"]
    print("Made request and file is currently queued")
    
    return id_

ID = get_transcribe_id(TOKEN, URL)
ID

Made request and file is currently queued


'ozy4pkc108-01eb-48b8-9411-8995509e4631'

Once we have the transcription ID of our audio file, we can make a `GET request` to the following AssemblyAI API endpoint to check the status of the transcription:

In [46]:
def get_status(token, transcribe_id):
    """
    Helper Function 3: Get transcription status
    Adapted from: https://towardsdatascience.com/how-to-build-a-web-app-to-transcribe-audio-using-python-and-assemblyai-18f197253fd8
    

    ***Args:
    - token: The API key
    - transcribe_id: The ID of the file which is being

    ***Returns:
    result : The response object
    """ 
    endpoint= f"https://api.assemblyai.com/v2/transcript/{transcribe_id}"
    headers = {
    "authorization": token
               }
    result = requests.get(endpoint, headers=headers).json()
     
    return result

STATUS = get_status(TOKEN, ID)
print(STATUS['status'])
#while STATUS['status'] != 'completed':
#    STATUS = get_status(TOKEN, ID)
#print(STATUS['text'])   

processing


## Creating bulk loops ⏳

In [62]:
def bulk_uploader(token: str, file_paths: str):
    """
    Helper function to upload all files in folder to Assembly AI server
    """
    urls = []
    done = []
    for i, f in enumerate(file_paths):
        print(i)
        
        if f.endswith('.wav') is not True:
            print(f"{f} not .wav")
            break
        else:
            urls.append(get_url(token,f))
            done.append(f)
    print("files uploaded to following urls")        
    return urls, done
            
            
URLS, DONE = bulk_uploader(TOKEN, MOUNT_CHUNKS[0])
REF = zip(DONE,URLS)
ref_df = pd.DataFrame(REF)
ref_df.to_csv('MOUNT_CHUNKS_0.csv')

    

0
Uploaded File and got temporary URL to file
1
Uploaded File and got temporary URL to file
2
Uploaded File and got temporary URL to file
3
Uploaded File and got temporary URL to file
4
Uploaded File and got temporary URL to file
5
Uploaded File and got temporary URL to file
6
Uploaded File and got temporary URL to file
7
Uploaded File and got temporary URL to file
8
Uploaded File and got temporary URL to file
9
Uploaded File and got temporary URL to file
10
Uploaded File and got temporary URL to file
11
Uploaded File and got temporary URL to file
12
Uploaded File and got temporary URL to file
13
Uploaded File and got temporary URL to file
14
Uploaded File and got temporary URL to file
15
Uploaded File and got temporary URL to file
16
Uploaded File and got temporary URL to file
17
Uploaded File and got temporary URL to file
18
Uploaded File and got temporary URL to file
19
Uploaded File and got temporary URL to file
20
Uploaded File and got temporary URL to file
21
Uploaded File and go

In [63]:
def bulk_get_ids(token: str, urls: list):
    ids = []
    for url in urls:
        ids.append(get_transcribe_id(token, url))
    return ids

IDS = bulk_get_ids(TOKEN, URLS)
IDS

Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently queued
Made request and file is currently

['ozytiz4h8l-9773-4106-87f4-53ad9d68b0ea',
 'ozyti4ekxj-8ad8-4bf5-a875-b8dcd357bb3d',
 'ozyti1kt1w-d7a2-431d-a751-d5ed629f2040',
 'ozyti1wb2y-71ae-492e-ad7a-914e4801853f',
 'ozytiy7hwd-184c-4125-b279-0f56f26e9a94',
 'ozytixnuy5-db63-48f6-b452-6280670ac02a',
 'ozyti71kwo-7463-4a3b-9e29-8b4b35c50534',
 'ozytilsmgv-2f2b-4e85-b233-e2f58fd8106b',
 'ozytii9s5r-0af7-422a-aaf1-9bc0b942c283',
 'ozytitlgxx-df53-4c6c-8855-21e5942bb314',
 'ozyti8881j-5a93-4b81-922d-9046a5f790d5',
 'ozyti39ijv-274e-498a-8693-7e6957d24239',
 'ozytijfdk8-56af-4468-bb7c-720037fac02d',
 'ozytihvx00-d10c-4781-ba9d-bb5f01521d02',
 'ozyticbjqa-f45e-4541-a572-849d40790c9f',
 'ozytidx4pj-4b21-4a89-a689-0034f82af148',
 'ozytipmz6v-9dfc-4b68-9a26-4fcdb8eb4118',
 'ozytias6qk-600d-4482-9fb0-355103e34853',
 'ozytinkx5a-1507-47e9-90c2-c1e5994f7c82',
 'ozytifsbgw-18ac-4824-ad63-51e2496cd449',
 'ozytiw5a5k-45fe-4f35-9923-fc23cbc42f79',
 'ozyti5mhqq-1f6b-457d-995f-99b2d6569c1f',
 'ozytqes3np-63ef-408c-82a2-12d14c4bd064',
 'ozytqox52

In [67]:
REF = zip(DONE,URLS, IDS)
ref_df = pd.DataFrame(REF, columns=['file_name', 'url', 'id'])
ref_df.to_csv()

',file_name,url,id\n0,0291506962#-10505#SITHATIL#TCRCBD-E45#20220215135504102.wav,https://cdn.assemblyai.com/upload/5627a41b-1462-42f8-a1de-407943a14419,ozytiz4h8l-9773-4106-87f4-53ad9d68b0ea\n1,029512090788#-10541#JERRYL#TCR-TOSH420#20220215072323987.wav,https://cdn.assemblyai.com/upload/c103d95c-ecc1-40da-9836-87a806f3ff02,ozyti4ekxj-8ad8-4bf5-a875-b8dcd357bb3d\n2,0301269833#-10505#BUSISIWEM26#TCR4-254782#20220215145030457.wav,https://cdn.assemblyai.com/upload/7fa1dc5e-de44-4cbb-8ec5-7853419a8382,ozyti1kt1w-d7a2-431d-a751-d5ed629f2040\n3,030136911B08#-10530#TSWALANEM#TCRLASUS-125#20220215114704654.wav,https://cdn.assemblyai.com/upload/c8b2c850-3317-4359-a619-5bfc7abd6ef4,ozyti1wb2y-71ae-492e-ad7a-914e4801853f\n4,0301758082#-10505#KHANYAM2#TCRCBD3-E40#20220215145548064.wav,https://cdn.assemblyai.com/upload/48a7d8c4-26f1-4b70-bd0b-2f1588c15510,ozytiy7hwd-184c-4125-b279-0f56f26e9a94\n5,0301786778#-10505#BUSISIWEM24#LPTP-BUSIM24#20220215074645058.wav,https://cdn.assemblyai.com/upload/556

In [72]:
ref_df

Unnamed: 0,file_name,url,id
0,0291506962#-10505#SITHATIL#TCRCBD-E45#20220215...,https://cdn.assemblyai.com/upload/5627a41b-146...,ozytiz4h8l-9773-4106-87f4-53ad9d68b0ea
1,029512090788#-10541#JERRYL#TCR-TOSH420#2022021...,https://cdn.assemblyai.com/upload/c103d95c-ecc...,ozyti4ekxj-8ad8-4bf5-a875-b8dcd357bb3d
2,0301269833#-10505#BUSISIWEM26#TCR4-254782#2022...,https://cdn.assemblyai.com/upload/7fa1dc5e-de4...,ozyti1kt1w-d7a2-431d-a751-d5ed629f2040
3,030136911B08#-10530#TSWALANEM#TCRLASUS-125#202...,https://cdn.assemblyai.com/upload/c8b2c850-331...,ozyti1wb2y-71ae-492e-ad7a-914e4801853f
4,0301758082#-10505#KHANYAM2#TCRCBD3-E40#2022021...,https://cdn.assemblyai.com/upload/48a7d8c4-26f...,ozytiy7hwd-184c-4125-b279-0f56f26e9a94
5,0301786778#-10505#BUSISIWEM24#LPTP-BUSIM24#202...,https://cdn.assemblyai.com/upload/55609efa-514...,ozytixnuy5-db63-48f6-b452-6280670ac02a
6,0301908430#-10505#PULANEL#TCRLENA-156#20220215...,https://cdn.assemblyai.com/upload/fcf8fbb3-820...,ozyti71kwo-7463-4a3b-9e29-8b4b35c50534
7,0303115455#-10479#SIYAMUKELAY#TCRCBD4-B12#2022...,https://cdn.assemblyai.com/upload/59716b2e-701...,ozytilsmgv-2f2b-4e85-b233-e2f58fd8106b
8,0303226708#-10505#KARABOS2#TCRCBD3-F08#2022021...,https://cdn.assemblyai.com/upload/86c3ef61-69d...,ozytii9s5r-0af7-422a-aaf1-9bc0b942c283
9,0307160945#-10505#KHANYAM2#TCRCBD3-E40#2022021...,https://cdn.assemblyai.com/upload/59ea26b8-2a5...,ozytitlgxx-df53-4c6c-8855-21e5942bb314


In [74]:
def bulk_get_status(token: str, ids: list):
    status = []
    for id in ids:
        status.append(get_status(token, id))
    return status

STATUS = bulk_get_status(TOKEN, IDS)


In [80]:
ASSEMBLY_OUT = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/asr-compute/data/transcripts/assembly_ai'
os.mkdir(ASSEMBLY_OUT)

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/asr-compute/data/mount'

In [84]:
os.chdir('/mnt/batch/tasks/shared/LS_root/mounts/clusters/asr-compute/')

In [85]:
os.listdir()

['data', 'code', '.nbvm']

In [81]:
os.chdir(ASSEMBLY_OUT)
for i, json in enumerate(STATUS):
    print(i)
    with open(NAMES[i]+'.txt', 'w') as f:
        f.write(json['text'])
    

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


### Understanding Assembly AI API

The status of transcription changes from “queued” to “processing” to “completed” as long as no errors are encountered.

We will need to poll this endpoint until we get a response object with the status “completed”.

We can make use of a while loop to continuously make requests to the endpoint. During each iteration of the loop, we will check the status of the transcription. The loop will keep on running till the status is “completed”. This process of making requests and waiting till the status is complete is known as polling

In [None]:
while get_status(TOKEN, ID)['status'] != 'completed':
    transcript = get_status(TOKEN, ID)

text = transcript.text

### Workin with api reponse `dict`