# Books on Tape

In this project, we will attempt to classify audio books written by classic authors, Jane Austen and Charles Dickens.

## Prequisites and Preprocessing

To begin, upload one each of an audio book by Austen and Dickens in MP3 format to an Amazon S3 bucket. The audio book MP3 files should be per chapter (i.e. each chapter in its own file).

### Permissions and environment variables

Next, configure SageMaker Execution Role to access other AWS resources, including:

* Audio file bucket

In [5]:
%%time

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()
print(role)

training_image = get_image_uri(boto3.Session().region_name, 'blazingtext')

arn:aws:iam::487757292854:role/service-role/AmazonSageMaker-ExecutionRole-20180321T112820
CPU times: user 67.7 ms, sys: 3.84 ms, total: 71.6 ms
Wall time: 137 ms


### Data Preparation

Audio books have already been stored in Amazon S3 Bucket. To prepare data for analysis, use Amazon Transcribe to create labelled training and validation data.

In [6]:
# get the SageMaker session
sess = sagemaker.Session()

# operate in the default SageMaker bucket when generating output of Transcribe job
sm_bucket = sess.default_bucket() # Replace with your own bucket name if needed
print(bucket)
prefix = 'transcribe'

# identify where the Dickens and Austen book files are held
mp3_bucket = 'ml-classification-books-on-tape'
austen_book = 'austen_sense_and_sensibility'
dickens_book = 'dickens_great_expectations'

ml-classification-books-on-tape


For each of Austen and Dickens, we need to grep for the list of MP3 files and start a Transcribe job for each. The output of the Transcribe job should be stored in a properly prefixed output directory in our SageMaker S3 bucket.

In [8]:
def start_transcribe_job(mp3_file):
    transcribe = boto3.client('transcribe')
    response = transcribe.start_transcription_job(
        TranscriptionJobName=mp3_file,
        LanguageCode='en-US',
        MediaFormat='mp3',
        Media={
            MediaFileUri: 'https://s3-us-east-1.amazonaws.com/examplebucket/example.mp4'
        },
        OutputBucketName=sm_bucket
    )
    print('Transcribe job status: {}').format(response.TranscriptionJob.TranscriptionJobStatus)

In [11]:
def list_mp3_objects_for(author_directory):
    s3 = boto3.client('s3')
    response = s3.list_objects(
        Bucket=mp3_bucket,
        Prefix=author_directory
    )
    
    keys=[]
    for obj in response['Contents']:
        keys.append(obj['Key'])
    
    print(keys)
    return keys

In [13]:
list_mp3_objects_for(austen_book)

['austen_sense_and_sensibility/senseandsensibility_01_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_02_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_03_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_04_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_05_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_06_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_07_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_08_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_09_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_10_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_11_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_12_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_13_austen_64kb.mp3', 'austen_sense_and_sensibility/senseandsensibility_14_austen_64kb.mp3', 'aust

['austen_sense_and_sensibility/senseandsensibility_01_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_02_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_03_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_04_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_05_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_06_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_07_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_08_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_09_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_10_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_11_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_12_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_13_austen_64kb.mp3',
 'austen_sense_and_sensibility/senseandsensibility_14_austen_64k