# Books on Tape

In this project, we will attempt to classify audio books written by classic authors, Jane Austen and Charles Dickens.

## Prequisites and Preprocessing

To begin, upload one each of an audio book by Austen and Dickens in MP3 format to an Amazon S3 bucket. The audio book MP3 files should be per chapter (i.e. each chapter in its own file).

### Permissions and environment variables

Next, configure SageMaker Execution Role to access other AWS resources, including:

* Audio file bucket

In [1]:
%%time

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()
print(role)

training_image = get_image_uri(boto3.Session().region_name, 'blazingtext')

arn:aws:iam::487757292854:role/service-role/AmazonSageMaker-ExecutionRole-20190122T171669
CPU times: user 813 ms, sys: 201 ms, total: 1.01 s
Wall time: 6.62 s


### Data Preparation

Audio books have already been stored in Amazon S3 Bucket. To prepare data for analysis, use Amazon Transcribe to create labelled training and validation data.

In [3]:
# get the SageMaker session
sess = sagemaker.Session()

# operate in the default SageMaker bucket when generating output of Transcribe job
sm_bucket = sess.default_bucket() # Replace with your own bucket name if needed
print(sm_bucket)
prefix = 'transcribe'

# identify where the Dickens and Austen book files are held
mp3_bucket = 'ml-classification-books-on-tape'
austen_book = 'austen_sense_and_sensibility'
dickens_book = 'dickens_great_expectations'

sagemaker-us-east-2-487757292854


For each of Austen and Dickens, we need to grep for the list of MP3 files and start a Transcribe job for each. The output of the Transcribe job should be stored in a properly prefixed output directory in our SageMaker S3 bucket.

In [7]:
# Builds the S3 URI for the passed mp3_file
def media_file_uri(mp3_file):
    return 'https://s3-{}.amazonaws.com/{}/{}'.format(
        boto3.Session().region_name,
        mp3_bucket,
        mp3_file
    )

# Starts an Amazon Transcribe job for the passed mp3_file
def start_transcribe_job(mp3_file):
    transcribe = boto3.client('transcribe')
    response = transcribe.start_transcription_job(
        # need to remove the slash from job name
        TranscriptionJobName=mp3_file.replace('/', '--'),
        LanguageCode='en-US',
        MediaFormat='mp3',
        Media={
            'MediaFileUri': media_file_uri(mp3_file)
        },
        OutputBucketName=sm_bucket
    )
    print('Transcribe job status: {}'.format(response['TranscriptionJob']['TranscriptionJobStatus']))

In [5]:
# Returns an array of mp3 files in the specified prefix
def get_mp3_objects_for(author_directory):
    s3 = boto3.client('s3')
    response = s3.list_objects(
        Bucket=mp3_bucket,
        Prefix=author_directory
    )
    
    keys=[]
    for obj in response['Contents']:
        keys.append(obj['Key'])
    
    print('Found {} MP3 files in {}'.format(len(keys), author_directory))
    return keys

In [8]:
# start with Jane Austen...
austen_files = get_mp3_objects_for(austen_book)

print(austen_files[0])
start_transcribe_job(austen_files[0])

Found 50 MP3 files in austen_sense_and_sensibility
austen_sense_and_sensibility/senseandsensibility_01_austen_64kb.mp3
Transcribe job status: IN_PROGRESS
