## Initial Setup

Before we start, let's make sure everything is working as it should.

#### 1. Python Dependencies

We need to install 2 libraries (`google-cloud-storage` and `google-cloud-speech`). If you're running this tutorial on [notebooks.ai](https://notebooks.ai) you're all set. In other case, you'll have to manually install them (check `requirements.txt`).

To verify everything is working, let's run the imports:

In [None]:
import textwrap

In [None]:
import json
from google.oauth2 import service_account

from google.cloud import storage
from google.cloud import speech_v1p1beta1 as speech
from google.cloud.speech_v1p1beta1 import enums
from google.cloud.speech_v1p1beta1 import types

#### 2. Credentials

If you follow the tutorial from the other notebook (`Tutorial.ipynb`), you saw you have to replace Google credentials in the file `google-demo-speech-to-text.json`.

In [None]:
info = json.load(open('google-demo-speech-to-text.json'))

credentials = service_account.Credentials.from_service_account_info(info)

Let's verify they're valid:

In [None]:
credentials.expired

#### 3. Audio uploaded to Cloud Storage

As we mentioned in the tutorial, the audio files to be transcribed should be uploaded to _Google Cloud Storage_. Replace the following variables with the correct values:

In [None]:
# CHANGE IT TO YOUR OWN BUCKET NAME
BUCKET_NAME = 'rmotr-speech-to-text-demo'

In [None]:
# CHANGE IT TO THE FILE NAME OF YOUR AUDIO
AUDIO_FILE_NAME = 'jacob-keynote.flac'

Now we'll connect to the service using the library client:

In [None]:
storage_client = storage.Client(project=credentials.project_id, credentials=credentials)

In [None]:
bucket = storage_client.get_bucket(BUCKET_NAME)

In [None]:
blob = bucket.blob(AUDIO_FILE_NAME)

In [None]:
assert blob.exists(), "Warning! Audio file not accesible"
blob.exists()

Let's now test if the audio is correct. We'll download it in memory and play it.

**WARNING:** Only do this if your audio is small (<2MB)! Large files will take time and memory, just skip this step.

In [None]:
audio_content = blob.download_as_string()

In [None]:
import IPython

In [None]:
IPython.display.Audio(audio_content)

We're all set! Time to do the transcription

### Transcribing audio file

First, we'll initialize a `SpeechClient` from the library:

In [None]:
client = speech.SpeechClient(credentials=credentials)

We need to construct the URI based on bucket and file name to point the Speech to Text service to the audio file:

In [None]:
# DO NOT CHANGE THIS
audio_uri = f'gs://{BUCKET_NAME}/{AUDIO_FILE_NAME}'

In [None]:
print(f"Audio uri: {audio_uri}")

ðŸ‘† is this correct? It should if it worked in the previous setup step. Just make sure.

Now it's time to create the transcription config, this is the place where you can customize the process. All the parameters accepted are available at the docs (**recommended**): https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig

The most important parameters I'm specifying here are:

* `encoding`, use FLAC. Full reference: https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig#AudioEncoding
* `language_code`. Full reference of languages https://cloud.google.com/speech-to-text/docs/languages
* `model` (short commands, phone_call, video, etc)

Some parameters are from the _beta_ version of the service, we can consider them more advanced: https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig

* `enable_automatic_punctuation`: adds punctuation to recognition result hypotheses
* `enable_speaker_diarization` recognizes different speakers
* `diarization_speaker_count` how many speakers in the interview

Finally, the full Python docs (for the library) are here: https://google-cloud-python.readthedocs.io/en/0.32.0/index.html#

In [None]:
config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
    #sample_rate_hertz=16000,
    language_code='en-US',

    model='default',

    enable_automatic_punctuation=True,
    
    enable_word_time_offsets=False, # this simplifies the output.
                                    # Remove if you want more detail

    # enable_speaker_diarization=True,  # good for interviews
    # diarization_speaker_count=2
)

Now we'll create a `RecognitionAudio` type:

In [None]:
audio = types.RecognitionAudio(uri=audio_uri)

Finally we start the _long running_ transcription process. Google will download the audio file from Cloud Storage, and transcribe it directly from there. This operation will return immediatelly, **but the processing is transfered to Google's servers**.

In [None]:
operation = client.long_running_recognize(config, audio)

Now it's time to get the results from the long running operation. Depending how long your audio was, this might take more time. For our demo audio (~30 secs), it'll be done pretty much immediately:

In [None]:
operation.done()

We can now access the results:

In [None]:
response = operation.result()

If it worked correctly, it'll divide the transcription in multiple chunks, that are accessed as `results`:

In [None]:
len(response.results)

Here's a preview:

In [None]:
response.results

Each result has multiple "alternatives", based on the configuration that we defined at the beginning. As I kept my configuration simple, I only have 1 alternative, which also lists the confidence:

In [None]:
response.results[0].alternatives

In [None]:
response.results[0].alternatives[0].confidence

This is the transcribed text:

In [None]:
response.results[0].alternatives[0].transcript

We can combine all the results to generate the full text:

In [None]:
full_text = "\n".join([result.alternatives[0].transcript for result in response.results])

In [None]:
print(textwrap.fill(full_text, 80))

Compare it again to our original audio:

In [None]:
import IPython

In [None]:
IPython.display.Audio("jacob-keynote.flac")

#### Fine tuning our transcription

If you listen carefully, Jacob says he's "Director of Security at the **_Heroku_**". But the transcription picks _"Roku"_. We can improve this by passing the service a list of important terms we know are either proper nouns, or names, or special terms that won't be so easily interpreted. To do that we'll pass _speech context_, here are the docs: https://cloud.google.com/speech-to-text/docs/basics#phrase-hints

Let's create another configuration object, this time check at the `speach_contexts` parameter:

In [None]:
config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
    #sample_rate_hertz=16000,
    language_code='en-US',

    model='default',

    enable_automatic_punctuation=True,
    
    enable_word_time_offsets=False, # this simplifies the output.
                                    # Remove if you want more detail

    speech_contexts=[speech.types.SpeechContext(
        phrases=['Heroku',],  # Only one term
    )],
    # enable_speaker_diarization=True,  # good for interviews
    # diarization_speaker_count=2
)

I'll restart the process now:

In [None]:
operation = client.long_running_recognize(config, audio)

In [None]:
response = operation.result()

In [None]:
full_text = "\n".join([result.alternatives[0].transcript for result in response.results])

In [None]:
print(textwrap.fill(full_text, 80))

That's amazing! It now says _"director of security at Heroku"_. You can do this if you have a list of terms that your transcriptions have to pick up.

### Finally, writing the results to a file

We can now write the results to a file. **Warning!** Python will erase the contents of `result.txt` if it already exists:

In [None]:
with open('result.txt', 'w') as fp:
    fp.write(textwrap.fill(full_text, 80))