## Project Setup

I have already installed the Google Cloud SDK for Python and then added:

```pip install --upgrade google-cloud-speech```

I have already enabled the [Google Cloud Speech-to-Text API](https://cloud.google.com/speech-to-text/docs/libraries) in my Google Cloud project. I've also set the required environment variable for my Google Cloud credentials.

```export GOOGLE_APPLICATION_CREDENTIALS="/PATH/TO/CREDENTIALS"```

And—maybe more importantly—I've already recorded a couple of sample audio clips. The API is quite particular as to how it wants these clips formatted. [The documentation](https://cloud.google.com/speech-to-text/docs/encoding) is well written, but you have to remember it's an issue first.

You could do this in Python itself, but honestly, it's a lot easier to use `ffmpeg` to do it for you. The command you're looking for (which also handles video files) is:

```ffmpeg -y -i INPUT_FILE -vn -ar 44100 -ac 1 -ab 192k -f wav -loglevel panic OUTPUT_FILE```

Don't worry if you forget, the numerous and constant onslaught of error messages will remind you. 🤣😔

My samples are currently formatted as 44100 KHz, single channel (mono), .wav files.

## Code Init

In [6]:
# From the Google Cloud SDK
from google.cloud import speech
from google.cloud import speech_v1p1beta1 as speech
from google.cloud import storage # required for longer files

In [2]:
# Sample file paths
fp_samples = [
    'samples/audio-sample-01.m4a', # 10 second sample, incorrectly formatted
    'samples/audio-sample-01.wav', # 10 second sample, correctly formatted
    'samples/audio-sample-02.wav', # 3+ minute sample, correctly formatted
]

In [3]:
# Handy function to upload a file to Google Cloud Storage and return the GS URI
import os
def upload_file(fp, bucket_name):
    """
    Upload the specified file to the specified Google Cloud Storage bucket
    """
    result = None

    if not os.path.exists(fp): return result
    
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)

    blob_fn = os.path.basename(fp)
    blob = bucket.blob(blob_fn)    

    try:
        blob.upload_from_filename(fp)
        print("Uploaded [{}] to Google Cloud Storage bucket [{}]".format(blob_fn, bucket_name))
        result = "gs://{}/{}".format(bucket_name, blob_fn)
    except Exception as err:
        print("Could not upload [{}] to Google Cloud Storage bucket [{}]. Threw exception:\n{}\n".format(fp, bucket_name, err))

    return result

In [4]:
# Upload the samples to the Google Cloud Storage bucket and save the GCS paths
gcs_samples = []
for fp in fp_samples:
    gcs_uri = upload_file(fp, 'tcp-data')
    if gcs_uri:
        gcs_samples.append(gcs_uri)
    
print("\n".join(gcs_samples))

Uploaded [audio-sample-01.m4a] to Google Cloud Storage bucket [tcp-data]
Uploaded [audio-sample-01.wav] to Google Cloud Storage bucket [tcp-data]
Uploaded [audio-sample-02.wav] to Google Cloud Storage bucket [tcp-data]
gs://tcp-data/audio-sample-01.m4a
gs://tcp-data/audio-sample-01.wav
gs://tcp-data/audio-sample-02.wav


In [19]:
client = speech.SpeechClient()
audio = speech.types.RecognitionAudio(uri=gcs_samples[-1])
config = speech.types.RecognitionConfig(
    sample_rate_hertz=44100,
    language_code='en-US',
    enable_word_time_offsets=True,
    enable_automatic_punctuation=True,
    enable_speaker_diarization=True,
    diarization_speaker_count=2)
    
results = None
try:
    operation = client.long_running_recognize(config=config, audio=audio)
    print('Waiting for transcription operation to complete...')
    response = operation.result(timeout=3000)
    results = response.results
except Exception as err:
    print("Could not transcribe audio file. Threw exception:\n{}\n".format(err))

if results:
    print(results)

Waiting for transcription operation to complete...
[alternatives {
  transcript: "This is from an article called updates on the twitch security. Incident posted to the twitch blog on October 6th, 2021 update regarding stream keys out of an abundance of caution. We have reset all stream key so you can get your new stream key here. Then it listen to around, depending on which broadcast offer you use. You may need to manually update yourself, or with this new key to start your next stream. Twitch Studio streamlabs Xbox PlayStation, and twitch mobile app users. Do not need to take any action for your new key to work OBS users who connected their Twitter account. Should also not need to take any action lbs users that if not connected their Twitter account. OBS will need to manually copy their string key for their twitch dashboard and paste it into OBS, for all others. Please refer to specific, set up instructions for the software virtuous, and there\'s an update late on the night of October

In [11]:
def parse_transcription_results(results):
    """
    Parse the transcription results
    """
    full_text = ""
    timing_text = []
    segments = []
    running_time_in_ms = 0
    for result in results:
        full_text += "\n{}".format(result.alternatives[0].transcript)
        timing = []
        for word in result.alternatives[0].words:
            time_start_in_ms = (word.start_time.seconds * 1000) + (word.start_time.microseconds / 1000)
            time_finish_in_ms = (word.end_time.seconds * 1000) + (word.end_time.microseconds / 1000)
            timecode_start = get_timecode(time_start_in_ms)
            timecode_finish = get_timecode(time_finish_in_ms)

            w = {
                'word': word.word,
                'timecode_start': timecode_start,
                'timecode_finish': timecode_finish,
                'timecode_start_as_ms': time_start_in_ms,
                'timecode_finish_as_ms': time_finish_in_ms,
                'duration_as_ms': int(time_finish_in_ms - time_start_in_ms),
                }

            timing_text.append(w)
            segments.append({
                "text": result.alternatives[0].transcript,
                "confidence": result.alternatives[0].confidence,
                "timing": w,
                })

    return {
        "full_text": full_text,
        "segments": segments,
        "timing": timing_text,
        }

def get_timecode(time_in_ms):
    # Figure out the timestamp
    sec_as_ms = 1000
    min_as_ms = 60*sec_as_ms
    hr_as_ms = 60*min_as_ms

    remaining = time_in_ms

    hours = int(remaining / hr_as_ms)
    remaining = remaining - (hours * hr_as_ms)

    minutes = int(remaining / min_as_ms)
    remaining = remaining - (minutes * min_as_ms)

    seconds = int(remaining / sec_as_ms)
    remaining = remaining - (seconds * sec_as_ms)

    ms = abs(int(remaining))

    return "{:02d}:{:02d}:{:02d},{:03d}".format(hours, minutes, seconds, ms)

In [20]:
structured_results = parse_transcription_results(results)

In [21]:
print(structured_results['timing'][1])

{'word': 'is', 'timecode_start': '00:00:02,100', 'timecode_finish': '00:00:02,200', 'timecode_start_as_ms': 2100.0, 'timecode_finish_as_ms': 2200.0, 'duration_as_ms': 100}


In [16]:
def generate_srt(word_timing, line_size=35):
    """
    Generate the SRT file from a list of words and their timings
    """
    # NN
    # TIMECODE --> TIMECODE
    # WWW WWW WWW

    # check for a restart in the word timings
    starts_at = 0
    first_entry = { 'word': word_timing[0]['word'], 'timecode_start_as_ms': word_timing[0]['timecode_start_as_ms'] }
    for i, w in enumerate(word_timing):
        if (w['word'] == first_entry['word'] and w['timecode_start_as_ms'] == first_entry['timecode_start_as_ms']) or (int(w['timecode_start_as_ms']) < 1):
            starts_at = i

    srt = ""
    caption = 1
    current_entry = {'words': "", 'start': None, 'finish': None }
    for i, word in enumerate(word_timing):
        if i < starts_at: continue
        if not current_entry['start']: current_entry['start'] = word['timecode_start']
        if len(current_entry['words']) > line_size or i >= len(word_timing):
            # print the entry
            srt += "{}\n{} --> {}\n{}\n\n".format(caption, current_entry['start'], current_entry['finish'], current_entry['words'].strip())
            caption += 1
            current_entry = {'words': "", 'start': word['timecode_start'], 'finish': None }

        current_entry['words'] += " {}".format(word['word'])
        current_entry['finish'] = word['timecode_finish']

    srt += "{}\n{} --> {}\n{}\n\n".format(caption, current_entry['start'], current_entry['finish'], current_entry['words'].strip())		

    return srt

In [22]:
print(generate_srt(structured_results['timing'], line_size=35))

1
00:00:01,600 --> 00:00:03,700
This is from an article called updates

2
00:00:03,700 --> 00:00:05,500
on the twitch security. Incident posted

3
00:00:05,500 --> 00:00:08,100
to the twitch blog on October 6th, 2021

4
00:00:08,100 --> 00:00:11,200
update regarding stream keys out of

5
00:00:11,200 --> 00:00:13,500
an abundance of caution. We have reset

6
00:00:13,500 --> 00:00:15,300
all stream key so you can get your new

7
00:00:15,300 --> 00:00:17,200
stream key here. Then it listen to around,

8
00:00:17,200 --> 00:00:19,200
depending on which broadcast offer you

9
00:00:19,200 --> 00:00:20,500
use. You may need to manually update

10
00:00:20,500 --> 00:00:22,400
yourself, or with this new key to start

11
00:00:22,400 --> 00:00:25,100
your next stream. Twitch Studio streamlabs

12
00:00:25,100 --> 00:00:26,700
Xbox PlayStation, and twitch mobile

13
00:00:26,700 --> 00:00:28,600
app users. Do not need to take any action

14
00:00:28,600 --> 00:00:31,400
for your new key to w