### Amazon Transcribe

In this notebook, we will take a look at ASR (Automatic Speech Recognition) service called Transcribe which can be used to generate transcripts for audio files stored in S3 buckets. At the time of writing, Transcribe supports English and Spanish audio files in `mp3`, `mp4`, `wav` and `flac` formats.

To start using Transcribe API, first we need to initialize client:

In [15]:
import boto3
import IPython.display as ipd
from pprint import pprint

session = boto3.session.Session()
transcribe_client = session.client('transcribe')

### Start transcription job

First method that we cover in this notebook will be `start_transcription_job`. Since transcription can take a significant amount of time, AWS doesn't offer a call that will instantly return transcripts. In order to transcript audio file, first we need to start transcription job. 

In this example, we will analyze a short part of Episode 39 of [Syntax.fm](https://syntax.fm/show/039/is-jquery-dead).

In [16]:
ipd.Audio('./syntax039part.mp3')

However, we cannot use local mp3's directly with Transcribe - we need to use files stored in S3 buckets and our mp3 is stored in a bucket named `aws-ml-apis` under the key `transcribe/syntax039part.mp3` - and that's what we're going to use. In addition to that, we also need to provide `TranscriptionJobName`, which will be an unique identifier of our transcription job, `LanguageCode`, in our case `en-US`, as well as `MediaFormat` which in our case will be `mp3`. 

In [18]:
start_response = transcribe_client.start_transcription_job(
    TranscriptionJobName='short-syntaxfm-transcript-2',
    LanguageCode='en-US',
    MediaFormat='mp3',
    Media={
        'MediaFileUri': 'https://s3.amazonaws.com/aws-ml-apis/transcribe/syntax039part.mp3'
    }
)

pprint(start_response)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '282',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Sat, 07 Apr 2018 20:06:34 GMT',
                                      'x-amzn-requestid': '2bb6793a-3a9f-11e8-b745-e78bca8441df'},
                      'HTTPStatusCode': 200,
                      'RequestId': '2bb6793a-3a9f-11e8-b745-e78bca8441df',
                      'RetryAttempts': 0},
 'TranscriptionJob': {'CreationTime': datetime.datetime(2018, 4, 7, 22, 6, 34, 294000, tzinfo=tzlocal()),
                      'LanguageCode': 'en-US',
                      'Media': {'MediaFileUri': 'https://s3.amazonaws.com/aws-ml-apis/transcribe/syntax039part.mp3'},
                      'MediaFormat': 'mp3',
                      'TranscriptionJobName': 'short-syntaxfm-transcript-2',
                      'TranscriptionJobStatus': 'IN_PR

### Get transcription job

Second method that we cover will be `get_transcription_job`. We can use it, as name suggests, to fetch data about previously started jobs.

In [20]:
get_response = transcribe_client.get_transcription_job(
    TranscriptionJobName='short-syntaxfm-transcript-2'
)

pprint(list_response)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '55',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Sat, 07 Apr 2018 19:55:44 GMT',
                                      'x-amzn-requestid': 'a8e534d6-3a9d-11e8-b780-b712b075752e'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'a8e534d6-3a9d-11e8-b780-b712b075752e',
                      'RetryAttempts': 0},
 'Status': 'IN_PROGRESS',
 'TranscriptionJobSummaries': []}
