# Speech-to-text Transcription using Google API
This notebook shows a few options for converting video to audio to text

- Convert video -> audio
- Convert audio -> text using Google Cloud [speech-to-text API](https://cloud.google.com/speech-to-text/)
    - Simple conversion for small files (no GCP account required)
    - Queue up larger files up to 480 minutes

# Convert video -> audio

First, install and import libraries

In [3]:
#pip install SpeechRecognition moviepy

In [1]:
import moviepy.editor as mp

Conversion

In [14]:
# Load video file
clip = mp.VideoFileClip(r"fold_in_the_cheese.mp4") 

# Create .wav file
clip.audio.write_audiofile(r"fold_in_the_cheese.wav")

chunk:  26%|██▋       | 307/1159 [00:00<00:00, 3063.62it/s, now=None]MoviePy - Writing audio in fold_in_the_cheese.wav
                                                                      MoviePy - Done.


File should be created in directory

# Convert audio to text

Google Cloud Platform has 3 types of speech recognition platforms. More details on [Google Cloud Platform](https://cloud.google.com/speech-to-text/docs/basics)

1. Small audio files - for audio files ~1 minute long
2. Larger audio files - use the **Long Running Operation**. up to ~480 minutes
3. Real-time conversion - *not* used in this notebook.

## 1. Small audio files - No Google Account

You can get a limited amount of transcription *without* a Google account by using this code.

The amount of text is really limited. Not covered in this notebook, but you can still get around this limitation with longer files by doing the following:

1. Chunk your audio file into small sizes. There are ways to do this by silence so that words are not cut off unexpectedly
2. Iterate through this synchronous API call with each audio chunk
3. Merge resulting transcriptions together

In [15]:
import speech_recognition as sr 

# Define recognizer
r = sr.Recognizer()

# Load audio file
audio_filename = "fold_in_the_cheese.wav"
audio = sr.AudioFile(path + audio_filename)

In [10]:
with audio as source:
  audio_file = r.record(source)
result = r.recognize_google(audio_file)

print(f'Output: {result}')

Output: it's not your mother's recipe to you to try to keep up next step is to fold in the cheese


Write that to a file in directory.

In [11]:
transcript_filename = audio_filename.split('.')[0] + '.txt'

f = open(transcript_filename,"w")
f.write(result)
f.close()

print(f'File {transcript_filename} is created')

File fold_in_the_cheese.txt is created


## Small audio files - with Google account

Limit is ~ 1 minute duration. 

In this next section, we will use a Google account to perform speech recognition.

Steps:

1. Speech-to-text API service setup
2. Storage setup - required to upload to storage in order to queue audio file for conversion
3. Convert audio -> text

### Step 1 - Speech-to-text API service Setup

It is required to **first** sign up for GCP account and create a speech-to-text service. Follow this Google [guide](https://cloud.google.com/speech-to-text/docs/quickstart-client-libraries) to do that

Below are steps part of the install guide to set json path and confirm functionality.

In [74]:
#pip install --upgrade google-cloud-speech

The steps provided for setting the environment variable ```GOOGLE_APPLICATION_CREDENTIALS``` from the terminal did not work for me, but this code below did

In [12]:
import os
json_path = {your_path}
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=json_path

Code below is provided by Google to test your setup. Output should read:

"Transcript: how old is the Brooklyn Bridge"

In [13]:
# Imports the Google Cloud client library
from google.cloud import speech

# Instantiates a client
client = speech.SpeechClient()

# The name of the audio file to transcribe
gcs_uri = "gs://cloud-samples-data/speech/brooklyn_bridge.raw"

audio = speech.RecognitionAudio(uri=gcs_uri)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
)

# Detects speech in the audio file
response = client.recognize(config=config, audio=audio)

for result in response.results:
    print("Transcript: {}".format(result.alternatives[0].transcript))

Transcript: how old is the Brooklyn Bridge


### Step 2 - Google Cloud Storage Setup

Follow Google [guide](https://cloud.google.com/storage) to create storage

OR 

Use the following to create new storage bucket

In [51]:
#pip install --upgrade google-cloud-storage

In [89]:
# import library
from google.cloud import storage

# set client
storage_client = storage.Client()

# name bucket
bucket_name = "my-new-bucket-creek"

# create bucket
bucket = storage_client.create_bucket(bucket_name)

print("Bucket {} created.".format(bucket.name))

Bucket my-new-bucket-creek created.


### Step 3 - Convert audio -> text

In [16]:
# The name of the audio file to transcribe
bucket = 'my-new-bucket-creek'
audio_file = 'fold_in_the_cheese.wav'

gcs_uri = 'gs://' + bucket + '/' + audio_file

audio = speech.RecognitionAudio(uri=gcs_uri)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    #sample_rate_hertz=44100, #not required for .wav files since the frame rate is in the header
    language_code="en-US",
    model='video', #specify model if applicable
    enable_automatic_punctuation=True)

# Detects speech in the audio file
response = client.recognize(config=config, audio=audio) #synchronous

for result in response.results:
    print("Transcript: {}".format(result.alternatives[0].transcript))

Transcript: This is not your mother's recipe. Yes, and now I'm passing it on to you. So try to keep up. Oh next step is to fold in the cheese.
Transcript:  What does that mean? What does fold in the cheese mean? He holds it in I understand that but how do you fold it? Do you fold it in half like a piece of paper and drop it in the pot or what do you do and I cannot show you everything. Okay? Well, can you show me one thing you just what you do? You just fold it in. Okay. I don't know how to fold broken cheese like that. I don't know how to be any clearer. You take that thing that's in your head, huh? And you if you say fooled in one more time.
Transcript:  I'm that's follows it in. This is your recipe you fooled in the cheese, then don't you dare you fold it in David?


## 2. Larger audio files - up to ~480 minutes

Import libraries and helper functions

In [None]:
#pip install pydub

In [24]:
from pydub import AudioSegment
from google.cloud import speech
from google.cloud import storage

def prep_audio_file(audio_file):
    '''
    This function makes sure audio file meets requirements for transcription:
    - Must be mono
    '''
    # modify audio file
    sound = AudioSegment.from_wav(audio_file)
    sound = sound.set_channels(1)

    # can be useful to resample rate to 16000. google recommends to not do this but can be used to tune
    # sound = sound.set_frame_rate(16000) 
    sound.export(audio_file, format="wav")
    return

def upload_blob(bucket_name, audio_file, destination_blob_name):
    """Uploads a file to the bucket.
    Inputs: 
        # bucket_name = "your bucket name"
        # audio_path = "path to file"
        # audio_file = "file name"
        # destination_blob_name = "storage object name"
    """

    # upload audio file to storage bucket    
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    blob.chunk_size = 5 * 1024 * 1024 # Set 5 MB blob size
    blob.upload_from_filename(audio_file)

    print('File upload complete')
    return

def write_transcripts(transcript_file, transcript):
    f = open(transcript_file,"w")
    f.write(transcript)
    f.close()
    return

def delete_blob(bucket_name, blob_name):
    """Deletes a blob from the bucket.
    Inputs:
        # bucket_name = "your bucket name"
        # blob_name = "storage object name"
    """
    storage_client = storage.Client()

    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(blob_name)
    blob.delete()

    print(f'Blob {blob_name} deleted')
    return

### Simple transcription - single blob of text without speakers

In [22]:
def google_transcribe_single(audio_file, bucket): 
    # convert audio to text
    gcs_uri = 'gs://' + bucket + '/' + audio_file
    transcript = ''
    
    client = speech.SpeechClient()
    audio = speech.RecognitionAudio(uri=gcs_uri)
    frame_rate = 44100 

    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=frame_rate,
        language_code='en-US',
        model='video', # optional: specify audio source. This increased transcription accuracy when turned on
        enable_automatic_punctuation=True) # optional: Enable automatic punctuation

    # Detects speech in the audio file
    operation = client.long_running_recognize(config=config, audio=audio) #asynchronous
    response = operation.result(timeout=10000)

    for result in response.results:
        transcript += result.alternatives[0].transcript
    
    return transcript

In [27]:
bucket = 'my-new-bucket-creek'
audio_file = 'fold_in_the_cheese.wav'

# do only if file is .wav
prep_audio_file(audio_file)

# # upload audio file to storage bucket    
upload_blob(bucket, audio_file, audio_file)

# create transcript
transcript = google_transcribe_single(audio_file, bucket)
transcript_file = audio_file.split('.')[0] + '.txt'

write_transcripts(transcript_file, transcript)
print(f'Transcript {transcript_file} created')

File upload complete
Transcript fold_in_the_cheese.txt created


In [189]:
# remove audio file from bucket
delete_blob(bucket, audio_file)

### Multiple speaker transcription

In [28]:
from google.cloud import speech_v1p1beta1 as speech_beta
from google.cloud.speech_v1p1beta1 import types as types_beta

In [29]:
def google_transcribe_speakers(audio_file, bucket):
    
    gcs_uri = 'gs://' + bucket + '/' + audio_file
    transcript = ''
    
    client = speech_beta.SpeechClient()
    audio = speech_beta.RecognitionAudio(uri=gcs_uri)
    frame_rate = 44100 #set your frame rate

    config = types_beta.RecognitionConfig(
        encoding=speech_beta.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=frame_rate,
        language_code='en-US',
        enable_automatic_punctuation=True, # Enable automatic punctuation
        model='video',
        enable_speaker_diarization=True,
        diarization_speaker_count=2)

    # Detects speech in the audio file
    operation = client.long_running_recognize(config=config, audio=audio)
    response = operation.result(timeout=10000)

    # The transcript within each result is separate and sequential per result.
    # However, the words list within an alternative includes all the words
    # from all the results thus far. Thus, to get all the words with speaker
    # tags, you only have to take the words list from the last result:
    result = response.results[-1] 
    words_info = result.alternatives[0].words 
    
    tag=1 
    speak="" 

    for word_info in words_info:
        if word_info.speaker_tag==tag:
            speak = speak + " " + word_info.word
        else:
            transcript += f'speaker {tag}: {speak}' + '\n'
            tag = word_info.speaker_tag 
            speak = word_info.word
 
    transcript += f'speaker {tag}: {speak}'
    return transcript

In [37]:
bucket = 'my-new-bucket-creek'
audio_file = 'fold_in_the_cheese.wav'

# do only if file is .wav
prep_audio_file(audio_file)

# upload audio file to storage bucket    
upload_blob(bucket, audio_file, audio_file)

transcript = google_transcribe_speakers(audio_file, bucket)
transcript_file = audio_file.split('.')[0] + '_speakers' + '.txt'

write_transcripts(transcript_file, transcript)
print(f'Transcript {transcript_file} created')

File upload complete
Transcript fold_in_the_cheese_speakers.txt created


In [340]:
# remove audio file from bucket
delete_blob(bucket, audio_file)

### Pulling word timestamps

In [35]:
def transcribe_word_time_offsets(audio_file, bucket):
    """Transcribe the given audio file asynchronously and output the word time
    offsets."""
    gcs_uri = 'gs://' + bucket + '/' + audio_file
    transcript = ''

    client = speech_beta.SpeechClient()
    audio = speech_beta.RecognitionAudio(uri=gcs_uri)
    frame_rate = 44100

    config = types_beta.RecognitionConfig(
        encoding=speech_beta.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=frame_rate,
        language_code='en-US',
        enable_automatic_punctuation=True, # Enable automatic punctuation
        enable_word_time_offsets=True)

    operation = client.long_running_recognize(config=config, audio=audio)

    print('Waiting for operation to complete...')
    result = operation.result(timeout=10000)

    for result in result.results:
        alternative = result.alternatives[0]
        print('Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.total_seconds(),
                end_time.total_seconds()))
    return

In [36]:
bucket = 'my-new-bucket-creek'
audio_file = 'fold_in_the_cheese.wav'

# do only if file is .wav
prep_audio_file(audio_file)

# upload audio file to storage bucket    
upload_blob(bucket, audio_file, audio_file)

transcribe_word_time_offsets(audio_file, bucket)

File upload complete
Waiting for operation to complete...
Transcript: It's not your mother's recipe said now, I'm passing it on to you. So try to keep up next step is to fold in the cheese.
Confidence: 0.9347954988479614
Word: It's, start_time: 0.0, end_time: 0.2
Word: not, start_time: 0.2, end_time: 0.4
Word: your, start_time: 0.4, end_time: 0.5
Word: mother's, start_time: 0.5, end_time: 0.8
Word: recipe, start_time: 0.8, end_time: 1.2
Word: said, start_time: 1.2, end_time: 1.7
Word: now,, start_time: 1.7, end_time: 1.8
Word: I'm, start_time: 1.8, end_time: 2.0
Word: passing, start_time: 2.0, end_time: 2.3
Word: it, start_time: 2.3, end_time: 2.4
Word: on, start_time: 2.4, end_time: 2.5
Word: to, start_time: 2.5, end_time: 2.7
Word: you., start_time: 2.7, end_time: 3.0
Word: So, start_time: 3.0, end_time: 3.5
Word: try, start_time: 3.5, end_time: 3.8
Word: to, start_time: 3.8, end_time: 3.9
Word: keep, start_time: 3.9, end_time: 4.1
Word: up, start_time: 4.1, end_time: 4.4
Word: next,

TypeError: write() argument must be str, not None