# Transcribe
This file was used for transcribing MP3 files and creating the initial dataframe

### Set Up
In this file, we uploaded all of the files from our S3 bucket titled "napoli-bucket" to the Github repo "Napoli-Polly".

The first step was to determine whether all of our MP3 files were in the S3 bucket. We did this by calling:

In [79]:
#!aws s3 ls napoli-project

2021-04-22 01:00:49      29668 english-batter-barter.mp3
2021-04-22 01:05:31      22144 english-beat-bit.mp3
2021-04-22 00:58:16      12740 english-crazy-glue-sticky.mp3
2021-04-22 01:02:39      16032 english-croissant.mp3
2021-04-22 01:02:09       9919 english-look-luke.mp3
2021-04-22 01:04:39      13837 english-nearby-target.mp3
2021-04-22 01:00:24      15091 english-note-not.mp3
2021-04-22 01:03:04       9449 english-roll-row.mp3
2021-04-22 00:59:32       7881 english-sheep-ship.mp3
2021-04-22 01:05:01      16502 english-th-sounds.mp3
2021-04-22 00:59:57      12427 english-vine-wine.mp3
2021-04-13 21:36:08      41423 french-batter-barter.mp3
2021-04-13 21:34:26      29354 french-beat-bit.mp3
2021-04-07 21:25:39      14621 french-crazy-glue-sticky.mp3
2021-04-07 21:20:12      16502 french-croissant.mp3
2021-04-13 21:33:16      12270 french-look-luke.mp3
2021-04-07 21:22:32      18069 french-nearby-target.mp3
2021-04-13 21:35:33      19166 french-note-not.mp3
2021-04-13 21:37:36      

Once we saw that 11 MP3 files for each of the 4 languages were in the bucket, we pushed them to our SageMaker folder (which was also our Github repo). Next, we got started creating a dataframe.

In [124]:
#!aws s3 cp s3://napoli-project/ ../Napoli-Polly/MP3\ files --recursive

We need to import the following 4 packages so that the code can run:

In [36]:
import boto3
import s3fs
import numpy as np
import pandas as pd

Next, we defined the resources that we would be using:
1. s3_resource : The S3 bucket

2. polly : The client that created the MP3 files

3. transcribe : The client that transcribes MP3 files

4. napoli_bucket : The bucket where our files were all stored

In [37]:
# Defining the resource that we are using (The S3 bucket)
s3_resource = boto3.resource('s3') 

# Defining the clients that we will be using
polly = boto3.client('polly') 
transcribe = boto3.client('transcribe')

# Defining the bucket that we have our data in
napoli_bucket = s3_resource.Bucket('napoli-project')

Create a variable that holds all of the objects in our S3 bucket

In [125]:
summaries = napoli_bucket.objects.all()

Extracting the names from each MP3 file in the bucket

In [126]:
mp3_names = [mp3.key for mp3 in summaries]

### Dataframe Creation
Creating a new dataframe and adding the names of each file to the dataframe. This dataframe is the basis that we will be working with throughout the project.

In [127]:
data=pd.DataFrame()
data['Names'] = mp3_names

### Adding the english transcriptions to the folder "Transcriptions"
* english-batter-barter.mp3 
* english-beat-bit.mp3
* english-crazy-glue-sticky.mp3
* english-croissant.mp3
* english-look-luke.mp3
* english-nearby-target.mp3
* english-note-not.mp3
* english-roll-row.mp3
* english-sheep-ship.mp3
* english-th-sounds.mp3
* english-vine-wine.mp3

In [120]:
# Defining each job uri path and job name
job_uri = "s3://napoli-project/english-batter-barter.mp3"
job_name= "english-batter-barter"

In [123]:
def transcribe_job(file_name):
    job_uri = "s3://napoli-project/"+file_name
    job_name = file_name + "9"
    
    transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': job_uri},
        MediaFormat='mp3',
        LanguageCode='en-US'
    )
    while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
            break
        print("Not ready yet...")
        time.sleep(5)
    # This prints the Uri
    Uri = job['TranscriptionJob']['Transcript']['TranscriptFileUri']
    print(Uri)

    # Trying to get all of the Uri's downloaded as jsons
    !wget f'"{Uri}"'

In [94]:
for file in data["Names"]:
    transcribe_job(file)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEgaCXVzLWVhc3QtMSJHMEUCIQDNv8uq8q4UlX4jEMi5DOzY1qvR0G95J%2BVt%2BOz1Km%2BVRAIgXYVsvOTmZrfmXj5DQ%2Bpy2DCBct2TV4jce42wYHNsA1AqvQMIsf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgwyNzY2NTY0MzMxNTMiDI%2F5aMSBKoz2P2uWjyqRA%2Fsg5rs1a5VY77zJlILn39%2B3OCm7a1WUOHKIVbC688hibvXIRDVW%2BFkhokh%2FJ5ceTjVppr98I%2BXNSxhFQlBglCG1dfEL5%2FLbukyJulDIQmtb7USQNmSCjfhUGXc50wWv2evjDuPOoFgPeGpN2kzBTHBF%2BFHDe%2B9kDREO2NXe55EHJLpkmt9FoKi%2BCHAVwRxd8emZ8uI215ku%2BcGJmTm4XetxgKgMS%2FYQ7BJCnUMzXFicIL7XVwRy5huCFFKnVMn1DCm%2FvYSLE%2BIctJgy%2BdsRxRvwpZTVf0Aq5muabUplOjBd%2BHTfmVmZCrsdgHPSGNraDS6xB6FLGxJx5DnwdNMn5JjbpnNvJZHO%2BhHPA1K8W7BSihlOTf6Bp1cOBRkhwMPbw%2BLV48K9%2FutQ%2BBkVHX%2BfHbBzQXvHnihd8AVch1XYUWQXnFFjjWUxXrYY8Ul4%2FMdZrdxSkcBuKmNPNE0QmENiRBf2b9hTwVpk%2B%2BmYRpdESAID3XB8H

HTTPError: HTTP Error 403: Forbidden

In [122]:
transcribe_job("english-batter-barter.mp3")

ConflictException: An error occurred (ConflictException) when calling the StartTranscriptionJob operation: The requested job name already exists. Use a different job name.

In [113]:
!wget "https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEgaCXVzLWVhc3QtMSJHMEUCIQDNv8uq8q4UlX4jEMi5DOzY1qvR0G95J%2BVt%2BOz1Km%2BVRAIgXYVsvOTmZrfmXj5DQ%2Bpy2DCBct2TV4jce42wYHNsA1AqvQMIsf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgwyNzY2NTY0MzMxNTMiDI%2F5aMSBKoz2P2uWjyqRA%2Fsg5rs1a5VY77zJlILn39%2B3OCm7a1WUOHKIVbC688hibvXIRDVW%2BFkhokh%2FJ5ceTjVppr98I%2BXNSxhFQlBglCG1dfEL5%2FLbukyJulDIQmtb7USQNmSCjfhUGXc50wWv2evjDuPOoFgPeGpN2kzBTHBF%2BFHDe%2B9kDREO2NXe55EHJLpkmt9FoKi%2BCHAVwRxd8emZ8uI215ku%2BcGJmTm4XetxgKgMS%2FYQ7BJCnUMzXFicIL7XVwRy5huCFFKnVMn1DCm%2FvYSLE%2BIctJgy%2BdsRxRvwpZTVf0Aq5muabUplOjBd%2BHTfmVmZCrsdgHPSGNraDS6xB6FLGxJx5DnwdNMn5JjbpnNvJZHO%2BhHPA1K8W7BSihlOTf6Bp1cOBRkhwMPbw%2BLV48K9%2FutQ%2BBkVHX%2BfHbBzQXvHnihd8AVch1XYUWQXnFFjjWUxXrYY8Ul4%2FMdZrdxSkcBuKmNPNE0QmENiRBf2b9hTwVpk%2B%2BmYRpdESAID3XB8HA53RymlUKc2%2Bm0Pb6OXGjf91bPp7m3TWwRbrJFeML7xgoQGOusBKGlVtGl3UBT1lvufpt7UGO%2B8G8vXGOrlFmTJzapvMdN2Tmz%2BWzMTLuJOySsvUQapcjRUZ5NfyFlgfw91aF3JQKjfM5xkbaMNIPVOWBflkbMMdoOSQQ2dgv2ao67T%2BRaJF0FAvcv8t%2Fq0oi4QQN7OiriqLy20h4arbdE5kDDgiDnjKsba1Kk5hKl2Wt%2FlZsItaWysdhGlo%2F%2B9NJFqaGul3JA36FMsSaOlBI7QQkdUN%2BXnzznjAtPnyfM8fN05eWVwiVqv4DXVbp0myxZQ7DYrKaZ7j8aAgbguJ3wjfGNVx6CjPTmyscK6O0lLrg%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210422T005026Z&X-Amz-SignedHeaders=host&X-Amz-Expires=899&X-Amz-Credential=ASIAUA2QCFAARREWOVHC%2F20210422%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=69912d69727d9a4a30eb06da7f389f267fb26433d9a3898bcd211923fe0587b2"

The name is too long, 1456 chars total.
Trying to shorten...
New name is asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEgaCXVzLWVhc3QtMSJHMEUCIQDNv8uq8q4UlX4jEMi5DOzY1qvR0G95J%2BVt%2BOz1Km%2BVRAIgXYVsvOTmZrfmXj5DQ%2Bpy2DCBct2TV4jce42wYHNsA1AqvQMIsf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgwyNzY2NTY0MzMxNTMi.
--2021-04-22 02:38:10--  https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEgaCXVzLWVhc3QtMSJHMEUCIQDNv8uq8q4UlX4jEMi5DOzY1qvR0G95J%2BVt%2BOz1Km%2BVRAIgXYVsvOTmZrfmXj5DQ%2Bpy2DCBct2TV4jce42wYHNsA1AqvQMIsf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgwyNzY2NTY0MzMxNTMiDI%2F5aMSBKoz2P2uWjyqRA%2Fsg5rs1a5VY77zJlILn39%2B3OCm7a1WUOHKIVbC688hibvXIRDVW%2BFkhokh%2FJ5ceTjVppr98I%2BXNSxhFQlBglCG1dfEL5%2FLbukyJulDIQmtb7USQNmSCjfhUGXc50wWv2evjDuPOoFgPeGpN2kzBTHBF%2BFHDe%2B9kDREO2NXe55EHJLpkmt9FoKi%2BCHAVwRxd8emZ8uI215ku%2BcGJmTm4XetxgKgMS%2FYQ7BJCnUMzXFicIL7XVwRy5huCFFKnV

In [148]:
# This code transcribes our .mp3 files 
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='mp3',
    LanguageCode='en-US'
)
# This tells Python to tell us that it is transcribing
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(5)
# This prints the TranscriptionJob
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'hindi-vine-wine', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 22050, 'MediaFormat': 'mp3', 'Media': {'MediaFileUri': 's3://napoli-project/hindi-vine-wine.mp3'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIQDtshPks9Hg4QLvlDvacn9mdkW34a2jQjfQb6H8PET06gIgE87ZV9NVcZFJYBlVO9Vazm85Xegi919D810WmY%2FL2hUqtAMIGhACGgwyNzY2NTY0MzMxNTMiDAHGe5Rb%2BoqapMD4OSqRA4WBertVMPGb7F8yQB4UXBnREj3Wp4RUD6jpvN0UYTkMCXt4sZfpHoM73PpLLBxXFx8GsOUe0yEcJFLp1d70ZYqLO8gQxJbj%2BQ1bf4Gok

In [149]:
# This code outputs the URI path
job = transcribe.get_transcription_job(TranscriptionJobName=job_name)

job['TranscriptionJob']['Transcript']['TranscriptFileUri']

'https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIAdRNaPfDMiuV3u31aSpBF7C%2FAHgH3s24wkgKSZNXdKqAiBjPmInCatJRyoHlbvGum626Y9N%2BW6tzNUNL%2FGL1ZWhpiq0AwgbEAIaDDI3NjY1NjQzMzE1MyIMgj3l0XzpJmFXxvWnKpEDue80AyEvz8UWelPPLtgsDx3pmnPFJct8p3J%2BfazXqqV4j7Z1rtM8cHlVNPV2bMgRadgzMVZixA6XSSMBhLPn3pzP6KXv%2BU6VOdpQ91Ac%2BLvTZy79rbuWTkv8lPI4IRVdHHSqmaxvH%2FH1nURwJUYlsUWj2pfcNVqvydueZGIWsJ2Ok%2F3Lez5lKO92Q4kzDFP9A3YvB6pm%2FBeE3Y32KAXEH68OExmP00Zmq8CC8HLKatPK1EO3oxYnav77tfw9N7uIIcvauyDQoaWTB3EuGtp8gqW%2FjScnuGJRLrDDncZZ2FJ1JEGr%2FTPpQIkXYW2yv0QC%2Fll830iQooJUdXT%2FSXpdrvlAPtxEa2URVQMcbGyc%2FZZpJeT91yBNlq%2FxAvQ36LfzPbHgFcBuxhJERpMkUOpd3astbzd%2B%2FCh0ESFdoclvIFCwmOkm40GACKuCiv8WO%2BO%2FaeOeHEK2Rl%2BT3qWcZJs1eiYIEstYZltx08VgPZrUrkjC%2FIBDfPjiEVNdzbcdJsrXXEGDv1FLIS%2FMVen70y3WpV8w4P%2FhgwY67AGG5wJo%2B7RCpU6RGjnOk6ntyOL2

In [150]:
# This code is used to download the transcription to our local folder
!wget 'https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIAdRNaPfDMiuV3u31aSpBF7C%2FAHgH3s24wkgKSZNXdKqAiBjPmInCatJRyoHlbvGum626Y9N%2BW6tzNUNL%2FGL1ZWhpiq0AwgbEAIaDDI3NjY1NjQzMzE1MyIMgj3l0XzpJmFXxvWnKpEDue80AyEvz8UWelPPLtgsDx3pmnPFJct8p3J%2BfazXqqV4j7Z1rtM8cHlVNPV2bMgRadgzMVZixA6XSSMBhLPn3pzP6KXv%2BU6VOdpQ91Ac%2BLvTZy79rbuWTkv8lPI4IRVdHHSqmaxvH%2FH1nURwJUYlsUWj2pfcNVqvydueZGIWsJ2Ok%2F3Lez5lKO92Q4kzDFP9A3YvB6pm%2FBeE3Y32KAXEH68OExmP00Zmq8CC8HLKatPK1EO3oxYnav77tfw9N7uIIcvauyDQoaWTB3EuGtp8gqW%2FjScnuGJRLrDDncZZ2FJ1JEGr%2FTPpQIkXYW2yv0QC%2Fll830iQooJUdXT%2FSXpdrvlAPtxEa2URVQMcbGyc%2FZZpJeT91yBNlq%2FxAvQ36LfzPbHgFcBuxhJERpMkUOpd3astbzd%2B%2FCh0ESFdoclvIFCwmOkm40GACKuCiv8WO%2BO%2FaeOeHEK2Rl%2BT3qWcZJs1eiYIEstYZltx08VgPZrUrkjC%2FIBDfPjiEVNdzbcdJsrXXEGDv1FLIS%2FMVen70y3WpV8w4P%2FhgwY67AGG5wJo%2B7RCpU6RGjnOk6ntyOL25QbckYWHivQdE1VPmFwfRGlI%2Fay5QhrA76oXNJmCXnLMDajzqlctf9feyyIJuDjO9WouHYNOCMWHLkLHJ9mevrorWiDlhthztobg5Zg0RypQZQYfMuNMx%2BtL5v6loVxB%2BckmLmWDdjmDGXkiu%2BhOt94xWKuhkiG7dohNGMKABbTkWBB9FWh3skTUGOh%2BmGxDfLWFkKyYRt34z24OciSfGExZbtzeSgtpraAColWAuvphoucVU93MfyEazEF9Qj8AXDh9M8q3bbDybD798TZlAb%2Fut0kK5ZOqzQ%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210415T181632Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=ASIAUA2QCFAAZSTT5J6P%2F20210415%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=57fe290b0bba65d8c6f55c56347fc145d74c8e4e2b429d4823a8379412d60098'

The name is too long, 1442 chars total.
Trying to shorten...
New name is asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIAdRNaPfDMiuV3u31aSpBF7C%2FAHgH3s24wkgKSZNXdKqAiBjPmInCatJRyoHlbvGum626Y9N%2BW6tzNUNL%2FGL1ZWhpiq0AwgbEAIaDDI3NjY1NjQzMzE1MyIMgj.
--2021-04-15 18:16:48--  https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIAdRNaPfDMiuV3u31aSpBF7C%2FAHgH3s24wkgKSZNXdKqAiBjPmInCatJRyoHlbvGum626Y9N%2BW6tzNUNL%2FGL1ZWhpiq0AwgbEAIaDDI3NjY1NjQzMzE1MyIMgj3l0XzpJmFXxvWnKpEDue80AyEvz8UWelPPLtgsDx3pmnPFJct8p3J%2BfazXqqV4j7Z1rtM8cHlVNPV2bMgRadgzMVZixA6XSSMBhLPn3pzP6KXv%2BU6VOdpQ91Ac%2BLvTZy79rbuWTkv8lPI4IRVdHHSqmaxvH%2FH1nURwJUYlsUWj2pfcNVqvydueZGIWsJ2Ok%2F3Lez5lKO92Q4kzDFP9A3YvB6pm%2FBeE3Y32KAXEH68OExmP00Zmq8CC8HLKatPK1EO3oxYnav77tfw9N7uIIcvauyDQoaW

In [128]:
!aws s3 ls napoli-project

2021-04-22 01:00:49      29668 english-batter-barter.mp3
2021-04-22 01:05:31      22144 english-beat-bit.mp3
2021-04-22 00:58:16      12740 english-crazy-glue-sticky.mp3
2021-04-22 01:02:39      16032 english-croissant.mp3
2021-04-22 01:02:09       9919 english-look-luke.mp3
2021-04-22 01:04:39      13837 english-nearby-target.mp3
2021-04-22 01:00:24      15091 english-note-not.mp3
2021-04-22 01:03:04       9449 english-roll-row.mp3
2021-04-22 00:59:32       7881 english-sheep-ship.mp3
2021-04-22 01:05:01      16502 english-th-sounds.mp3
2021-04-22 00:59:57      12427 english-vine-wine.mp3
2021-04-13 21:36:08      41423 french-batter-barter.mp3
2021-04-13 21:34:26      29354 french-beat-bit.mp3
2021-04-07 21:25:39      14621 french-crazy-glue-sticky.mp3
2021-04-07 21:20:12      16502 french-croissant.mp3
2021-04-13 21:33:16      12270 french-look-luke.mp3
2021-04-07 21:22:32      18069 french-nearby-target.mp3
2021-04-13 21:35:33      19166 french-note-not.mp3
2021-04-13 21:37:36      