# Books on Tape

In this project, we will attempt to classify audio books written by classic authors, Jane Austen and Charles Dickens.

## Prequisites and Preprocessing

To begin, upload one each of an audio book by Austen and Dickens in MP3 format to an Amazon S3 bucket. The audio book MP3 files should be per chapter (i.e. each chapter in its own file).

### Permissions and environment variables

Next, configure SageMaker Execution Role to access other AWS resources, including:

* Audio file bucket

In [1]:
%%time

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()
print(role)

training_image = get_image_uri(boto3.Session().region_name, 'blazingtext')

arn:aws:iam::487757292854:role/service-role/AmazonSageMaker-ExecutionRole-20190122T171669
CPU times: user 813 ms, sys: 201 ms, total: 1.01 s
Wall time: 6.62 s


### Data Preparation: Transcription

Audio books have already been stored in Amazon S3 Bucket. To prepare data for analysis, use Amazon Transcribe to create labelled training and validation data.

In [3]:
# get the SageMaker session
sess = sagemaker.Session()

# operate in the default SageMaker bucket when generating output of Transcribe job
sm_bucket = sess.default_bucket() # Replace with your own bucket name if needed
print(sm_bucket)
prefix = 'transcribe'

# identify where the Dickens and Austen book files are held
mp3_bucket = 'ml-classification-books-on-tape'
austen_book = 'austen_sense_and_sensibility'
dickens_book = 'dickens_great_expectations'

sagemaker-us-east-2-487757292854


For each of Austen and Dickens, we need to grep for the list of MP3 files and start a Transcribe job for each. The output of the Transcribe job should be stored in a properly prefixed output directory in our SageMaker S3 bucket.

In [7]:
# Builds the S3 URI for the passed mp3_file
def media_file_uri(mp3_file):
    return 'https://s3-{}.amazonaws.com/{}/{}'.format(
        boto3.Session().region_name,
        mp3_bucket,
        mp3_file
    )

# Starts an Amazon Transcribe job for the passed mp3_file
def start_transcribe_job(mp3_file):
    transcribe = boto3.client('transcribe')
    response = transcribe.start_transcription_job(
        # need to remove the slash from job name
        TranscriptionJobName=mp3_file.replace('/', '--'),
        LanguageCode='en-US',
        MediaFormat='mp3',
        Media={
            'MediaFileUri': media_file_uri(mp3_file)
        },
        OutputBucketName=sm_bucket
    )
    print('Transcribe job status: {}'.format(response['TranscriptionJob']['TranscriptionJobStatus']))

In [5]:
# Returns an array of mp3 files in the specified prefix
def get_mp3_objects_for(book):
    s3 = boto3.client('s3')
    response = s3.list_objects(
        Bucket=mp3_bucket,
        Prefix=book
    )
    
    keys=[]
    for obj in response['Contents']:
        keys.append(obj['Key'])
    
    print('Found {} MP3 files in {}'.format(len(keys), book))
    return keys

In [8]:
# start with Jane Austen...
austen_files = get_mp3_objects_for(austen_book)

# TODO: Iterate through all files and run transcribe jobs... this could take awhile
print(austen_files[0])
start_transcribe_job(austen_files[0])

Found 50 MP3 files in austen_sense_and_sensibility
austen_sense_and_sensibility/senseandsensibility_01_austen_64kb.mp3
Transcribe job status: IN_PROGRESS


---

**NOTE:** Transcribing all files can take some time. Please pause here until complete!

---

### Data Preprocessing

After transcribing the various audio files, we need to preprocess the training data into a format that can be consumed by the BlazingText algorithm. Per [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html), a *space-separated tokenized text format* with class labels (prefixed by `__label__`) in the same line as the original sentence is appropriate. Each setence will be on its own line. We'll use the `nltk` library to tokenize the input sentences. Raw data must be retrieved by processing each `.json` file in the session default S3 bucket.

In [68]:
from random import shuffle
import csv
import json
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [95]:
def transform_chapter(author, chapter):
    curr_chapter = []
    label = '__label__{}'.format(author)
    
    # tokenize chapter in to sentences
    chapter = nltk.sent_tokenize(chapter)
    for sent in chapter:
        curr_sent = []
        curr_sent.append(label)
        # then tokenize to words per BlazingText input spec
        curr_sent.extend(nltk.word_tokenize(sent.lower()))
        curr_chapter.append(curr_sent)
        
    return curr_chapter

In [96]:
def preprocess(book, output_file):
    transformed_chapters = []
    
    s3 = boto3.resource('s3')
    files = list(s3.Bucket(sm_bucket).objects.filter(Prefix=book))
    
    for file in files:
        author = file.key[:file.key.index('_')]
        data = json.loads(file.get()['Body'].read().decode('utf-8')) # read json string
        chapter = data['results']['transcripts'][0]['transcript']
        transformed_chapters.append(transform_chapter(author, chapter))
       
    # randomize and hold out some % of data set for validation
    
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
        csv_writer.writerows(transformed_chapters)

In [97]:
%%time

## Preparing training dataset

preprocess(austen_book, 'books.train')

CPU times: user 49.2 ms, sys: 0 ns, total: 49.2 ms
Wall time: 129 ms


Preprocessing can take a few minutes. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.

In [74]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

# sess.upload_data(path='dbpedia.train', bucket=sm_bucket, key_prefix=train_channel)
# sess.upload_data(path='dbpedia.validation', bucket=sm_bucket, key_prefix=validation_channel)

# s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
# s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.87 µs


Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [None]:
s3_output_location = 's3://{}/{}/output'.format(sm_bucket, prefix)

## Training