# Transcribe and Dataframe Creation
This file was used for transcribing MP3 files and creating the initial dataframe

## Set Up
In this file, we uploaded all of the files from our S3 bucket titled "napoli-bucket" to the Github repo "Napoli-Polly".

The first step was to determine whether all of our MP3 files were in the S3 bucket. We did this by calling:

In [109]:
#!aws s3 ls napoli-project

Once we saw that 11 MP3 files for each of the 4 languages were in the bucket, we pushed them to our SageMaker folder (which was also our Github repo). 

Next, we got started creating a dataframe.

In [124]:
#!aws s3 cp s3://napoli-project/ ../Napoli-Polly/MP3\ files --recursive

We need to import the following 4 packages so that the code can run:

In [128]:
import boto3
import s3fs
import numpy as np
import pandas as pd
import time

The following are the resources that we utilized in this project:
1. s3_resource : The S3 bucket

2. polly : The client that created the MP3 files

3. transcribe : The client that transcribes MP3 files

4. napoli_bucket : The bucket where our files were all stored

In [110]:
s3_resource = boto3.resource('s3') 

polly = boto3.client('polly') 
transcribe = boto3.client('transcribe')

napoli_bucket = s3_resource.Bucket('napoli-project')

Create a variable that holds all of the objects in our S3 bucket

In [3]:
summaries = napoli_bucket.objects.all()

## Dataframe Creation
Creating a new dataframe and adding the names of each file to the dataframe. This dataframe is the basis that we will be working with throughout the project.

Extracting the names from each MP3 file in the bucket

In [4]:
mp3_names = [mp3.key for mp3 in summaries]

In [5]:
data=pd.DataFrame()
data['Names'] = mp3_names

## Function 1: 

In this function, we take the name of each MP3 file and output its transcription using AWS Transcribe. When executing this code, it is critical that all job names are unique, as once they are run, you cannot rerun them.

In [127]:
# Function to transcribe each file
def transcribe_job(file_name):
    
    # Define path to the file
    job_uri = "s3://napoli-project/"+file_name
    # Define unique job name
    job_name = file_name + "df1"
    
    # Transcription settings
    transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': job_uri},
        MediaFormat='mp3',
        LanguageCode='en-US'
    )
    while True:
        result = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
            break
        time.sleep(15)
    # Extracting the transcription from the json
    if result['TranscriptionJob']['TranscriptionJobStatus'] == "COMPLETED":
        data = pd.read_json(result['TranscriptionJob']['Transcript']['TranscriptFileUri'])
    # Diving deeper into the json to extract the transcript
    return data['results'][1][0]['transcript']

## Function 2:

This function gives us the average confidence level that AWS Transcribe measures for each sentence.
AWS Transcribe outputs a confience score for each word that it transcribes. In the function, we focus our output to be the mean confidence of the sentence.

In [114]:
def transcribe_confidence(file_name):
    # Define path to the file
    job_uri = 's3://napoli-project/'+file_name
    # Define unique job name
    job_name = file_name + 'df2'

    # Transcription settings
    transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': job_uri},
        MediaFormat='mp3',
        LanguageCode='en-US'
    )
    while True:
        result = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
            break
        time.sleep(15)
    if result['TranscriptionJob']['TranscriptionJobStatus'] == "COMPLETED":
        data = pd.read_json(result['TranscriptionJob']['Transcript']['TranscriptFileUri'])
    
    # Computing the mean of confidence for each sentence
    sum_conf = 0
    for i in range(len(data['results'][0])):
        try:
            # Diving deeper into the json to extract the confidence
            sum_conf = sum_conf + float(data['results'][0][i]["alternatives"][0]["confidence"])
        except:
            pass

    # Define mean
    mean = sum_conf / len(data['results'][0])
  
    return mean

### Background on levels

**Levels for transcript**

A common theme in these functions was the use of brackets `[]` to extract information. The output from a Transription job includes multiple levels. To get to the confidence, we can filter our result all the way down using the following:

`data['results'][1][0]['transcript']`

First, we take the data and select the `[results]`. This can be further broken down by extracting the second value "transcripts". In our case, Python is 0 indexed, so the correct element in results is [1]:

`data['results'][1]`

The next layer is the transcript, but this is embedded in a folder `[0]`. All that is left, now, is to select the full transcript. This leads us back up to the following:

`data['results'][1][0]['transcript']`

**Levels for confidence**

To extract the confidence level of each word, we can do the same. We first select data['results'], but now since we want confidence, which is in the 1st folder, we choose the 1st element.

`data['results'][0]`

Then, since we are iterating through each word in the sentence, we take the i'th word, and filter further to the "alternatives" folder.
`data['results'][0][i]["alternatives"]`

Next, we select the first folder again, which holds the confidence level of the word.
`data['results'][0][i]["alternatives"][0]`

Lastly, select the "confidence" element. These are values between 0 and 1 that tell us how confident Transcribe is in its ability to discern the MP3 files.
`data['results'][0][i]["alternatives"][0]["confidence"]`




## Dataframe Creation (Continued)
Now that the functions are created, let's apply them to each file. We have 44 mp3 files.

Using the `.apply()` function, we can easily create a new column in our dataframe that holds the transcriptions.

In [103]:
data['Transcript'] = data['Names'].apply(transcribe_job)

Before moving on, we saved our progress. You can convert a dataframe to a .csv file using the `.to_csv()`, and uploaded the file to our new bucket called "napoli-project-analysis".

In [111]:
data.to_csv("Data.csv")

In [None]:
!aws s3 cp Data.csv s3://napoli-project-analysis/Data.csv

Once the Transcriptions were added to the dataframe, we added the confidence scores for each file. We did this similarly, by applying the `transcribe_confidence` function to each MP3 file. The new column is called "Confidence".

In [106]:
data['Confidence'] = data['Names'].apply(transcribe_confidence)

In [112]:
data.to_csv("Data.csv")

In [113]:
!aws s3 cp Data.csv s3://napoli-project-analysis/Data.csv

upload: ./Data.csv to s3://napoli-project-analysis/Data.csv    


**Let's take a look at our dataframe!**

In [126]:
data.head()

Unnamed: 0,Names,Transcript,Confidence
0,english-batter-barter.mp3,The batter is ready to swing on the next pitch...,0.906591
1,english-beat-bit.mp3,"The beat is a bit off, but I'm sure with enoug...",0.892737
2,english-crazy-glue-sticky.mp3,The crazy glue is extremely sticky.,0.765429
3,english-croissant.mp3,Alexa Where did I leave my Khorasan?,0.767625
4,english-look-luke.mp3,Look at the clouds luke.,0.799667


To view the dataframe in your browser, click the following link:
[Data.csv](https://napoli-project-analysis.s3.amazonaws.com/Data.csv)

# Alternative way to extract jsons

### Breaking down Transcribe

Say you have the following MP3 file:
* english-batter-barter.mp3 

To create new jobs, we have to define the path to the files, as well as create a job name.

In [None]:
#!aws s3 ls napoli-project

In [6]:
job_uri = "s3://napoli-project/english-batter-barter.mp3"
job_name= "english-batter-barter"

In order to run our Transcribe call, you need to import `time`. 

In [18]:
import time

In [20]:
# This code transcribes our .mp3 files 
transcribe.start_transcription_job(
    TranscriptionJobName=job_name + "0",
    Media={'MediaFileUri': job_uri},
    MediaFormat='mp3',
    LanguageCode='en-US'
)
# This tells Python to tell us that it is transcribing
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(5)

# This prints the TranscriptionJob
print(status)

{'TranscriptionJob': {'TranscriptionJobName': 'english-batter-barter', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 22050, 'MediaFormat': 'mp3', 'Media': {'MediaFileUri': 's3://napoli-project/english-batter-barter.mp3'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/english-batter-barter/3ba46f68-6d2d-412f-9432-92edf7f71c4a/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFUaCXVzLWVhc3QtMSJIMEYCIQDjoAI8IhweBZ6pOPmxCNXq%2B5DiWlhFe3rye8gjcDqFWwIhAKohFtJE93EYkZPmcK1LLz5Whq4lSNbWpOEi10iD30Z6Kr0DCL7%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQAhoMMjc2NjU2NDMzMTUzIgxo5H67S%2BZb5dhRnG4qkQOsBzWHpOi%2FpGAtu9mTP7bCJU%2F5thvKme6Y5ns5FJx6bXh7CC2QX7I0EERKrMgzOw1MSRo3GO%2F9n1Nrc3QS%2FlCL1oVQUzLm0VDFe41akIScWn%2F6Eb1SLKTSv0RV1I%2B1l%2FqmKhirKVLV3kDGKFP5kgIjkIK0bZfmvTqCctmYdnsm3Js08TqA0fNB0HyH6L2SR%2BOApknTLdi5GM%2BAZCuZALQMEiv3rfTyM%2BGTXkMO00HEQ55NMWo2rRK%2FUERmJ1eIc2qHVMC6hyjkqqO%2FxYnLlfVLi9gf3t

Now that the transcriptoin has been created, we can call the transcription and save it as a new variable. Use `.get_transcription_job` to select the job that was just created.

After that, you can access the Uri path by using brackets (similar to what was done above).

In [21]:
job = transcribe.get_transcription_job(TranscriptionJobName=job_name)

# This code outputs the URI path
job['TranscriptionJob']['Transcript']['TranscriptFileUri']

'https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/english-batter-barter/3ba46f68-6d2d-412f-9432-92edf7f71c4a/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFUaCXVzLWVhc3QtMSJGMEQCH0NcVrSHmc0hBvF1HoUpgo%2FJw5%2B7SOnizUSXXIoHemoCIQC9wr9SyUWq7%2FVQE3kviBlf4xexU8qONzyPwPTXtsoeZSq9Awi%2B%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAIaDDI3NjY1NjQzMzE1MyIM42SgSWP4DYU7KQ7CKpEDAIIT5WYrAJBHZgTXapiAM3r8ZtYDIw3fHf9LA%2Fp53F6RWXhDfAN1XzaXUIFMsshhox9G%2F%2B0Rj2C%2BotbvRcD2vAHVkkcxWZJ%2Bc61M2qQm4YzYRvWIWE4LaTgpcxA%2B2gsILh19LhWDHBl1ZYHBuh0yz8mdEM1p6nxgwcQpA9iokIB%2B2PWzDtS102nOFI1YnhQ%2BvYmKVs1jlF%2Bj%2B3iCY0Q3sihIkakj7oF%2FzU%2FJ9srxzNNxnlZgwUjfjlX5E4OBkZjmtW%2F3fT92iOJaMrvfPbXwBSat44VS4AkdLtBaBVJleUJMg3lVrKb98zrZzvQF8vVs5PwMy9SuJ5F99j2uMG6n%2BzcLx0MuXduYtGZaYnA2v1kRx63A1qV%2F2m%2FIRZnwqK6p%2FBYigrxr9ZlPsSawmC0Tq%2BbZ5JCITcdtloKcNTuHmnwWEGJLv%2FhLrdoQgvZ8FJEs3fiaW9umVZf2ZZ9WccKr9PwunYMLtXn7WeX64ov4Sfwycymtn3mfjcoMezcaw4dvi%2BJmjlRCvh9UK0YI1o%2FiFJIwyeaFhAY67AEZ78r%2BvsgzKEgEiW

The URI path is long. Copy the entire link and use Linux's `!wget` to download the json to your local directory.

Once it is there, you can rename it and view it manually.

In [150]:
# This code is used to download the transcription to our local folder
!wget 'https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIAdRNaPfDMiuV3u31aSpBF7C%2FAHgH3s24wkgKSZNXdKqAiBjPmInCatJRyoHlbvGum626Y9N%2BW6tzNUNL%2FGL1ZWhpiq0AwgbEAIaDDI3NjY1NjQzMzE1MyIMgj3l0XzpJmFXxvWnKpEDue80AyEvz8UWelPPLtgsDx3pmnPFJct8p3J%2BfazXqqV4j7Z1rtM8cHlVNPV2bMgRadgzMVZixA6XSSMBhLPn3pzP6KXv%2BU6VOdpQ91Ac%2BLvTZy79rbuWTkv8lPI4IRVdHHSqmaxvH%2FH1nURwJUYlsUWj2pfcNVqvydueZGIWsJ2Ok%2F3Lez5lKO92Q4kzDFP9A3YvB6pm%2FBeE3Y32KAXEH68OExmP00Zmq8CC8HLKatPK1EO3oxYnav77tfw9N7uIIcvauyDQoaWTB3EuGtp8gqW%2FjScnuGJRLrDDncZZ2FJ1JEGr%2FTPpQIkXYW2yv0QC%2Fll830iQooJUdXT%2FSXpdrvlAPtxEa2URVQMcbGyc%2FZZpJeT91yBNlq%2FxAvQ36LfzPbHgFcBuxhJERpMkUOpd3astbzd%2B%2FCh0ESFdoclvIFCwmOkm40GACKuCiv8WO%2BO%2FaeOeHEK2Rl%2BT3qWcZJs1eiYIEstYZltx08VgPZrUrkjC%2FIBDfPjiEVNdzbcdJsrXXEGDv1FLIS%2FMVen70y3WpV8w4P%2FhgwY67AGG5wJo%2B7RCpU6RGjnOk6ntyOL25QbckYWHivQdE1VPmFwfRGlI%2Fay5QhrA76oXNJmCXnLMDajzqlctf9feyyIJuDjO9WouHYNOCMWHLkLHJ9mevrorWiDlhthztobg5Zg0RypQZQYfMuNMx%2BtL5v6loVxB%2BckmLmWDdjmDGXkiu%2BhOt94xWKuhkiG7dohNGMKABbTkWBB9FWh3skTUGOh%2BmGxDfLWFkKyYRt34z24OciSfGExZbtzeSgtpraAColWAuvphoucVU93MfyEazEF9Qj8AXDh9M8q3bbDybD798TZlAb%2Fut0kK5ZOqzQ%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210415T181632Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=ASIAUA2QCFAAZSTT5J6P%2F20210415%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=57fe290b0bba65d8c6f55c56347fc145d74c8e4e2b429d4823a8379412d60098'

The name is too long, 1442 chars total.
Trying to shorten...
New name is asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIAdRNaPfDMiuV3u31aSpBF7C%2FAHgH3s24wkgKSZNXdKqAiBjPmInCatJRyoHlbvGum626Y9N%2BW6tzNUNL%2FGL1ZWhpiq0AwgbEAIaDDI3NjY1NjQzMzE1MyIMgj.
--2021-04-15 18:16:48--  https://s3.us-east-1.amazonaws.com/aws-transcribe-us-east-1-prod/795731225536/hindi-vine-wine/b4fd3625-d73a-4e9f-9500-d2934e3df281/asrOutput.json?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIAdRNaPfDMiuV3u31aSpBF7C%2FAHgH3s24wkgKSZNXdKqAiBjPmInCatJRyoHlbvGum626Y9N%2BW6tzNUNL%2FGL1ZWhpiq0AwgbEAIaDDI3NjY1NjQzMzE1MyIMgj3l0XzpJmFXxvWnKpEDue80AyEvz8UWelPPLtgsDx3pmnPFJct8p3J%2BfazXqqV4j7Z1rtM8cHlVNPV2bMgRadgzMVZixA6XSSMBhLPn3pzP6KXv%2BU6VOdpQ91Ac%2BLvTZy79rbuWTkv8lPI4IRVdHHSqmaxvH%2FH1nURwJUYlsUWj2pfcNVqvydueZGIWsJ2Ok%2F3Lez5lKO92Q4kzDFP9A3YvB6pm%2FBeE3Y32KAXEH68OExmP00Zmq8CC8HLKatPK1EO3oxYnav77tfw9N7uIIcvauyDQoaW