In this exercise you'll download a youtube video's audio, convert it to the WAV format, and then use the Google Cloud API to transcribe it's content

The lines below import some libraries that make this quite simple. Summerized they are:

 * **pafy**, a library to download youtube video and audio.
 * **pydub**, a library to convert audio, for example from mp3 to wav.
 * **google api**, contains a lot of stuff, in particular audio transcription using the speech API. The neast thing is that this is actually done on a Google server, you send it audio and get a transcription back.  This way Google can improve their machine learning algorithms and serve this to you.


Select the 'cell' below and press CTRL+ENTER or SHIFT+ENTER to run the code inside it.

In [1]:
import os
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'credentials.json'

import pafy
from pydub import AudioSegment

from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

from googleapiclient import discovery

First use `pafy` to get some information of a video.

Run the cell.

In [2]:
url = 'https://www.youtube.com/watch?v=lWWKBY7gx_0'

video = pafy.new(url)

print("Url:", url)
print("Title:", video.title)
print("Author:", video.author)
print("Description:", video.description)
print("Length:", video.length)

Url: https://www.youtube.com/watch?v=lWWKBY7gx_0
Title: Le Grand Content
Author: enlarge
Description: A Film by Clemens Kogler together with Karo Szmit. Voice by Andre Tschinder.

Le Grand Content examines the omnipresent Powerpoint-culture in search for its philosophical potential. Intersections and diagrams are assembled to form a grand 'association-chain-massacre'. which challenges itself to answer all questions of the universe and some more. Of course, it totally fails this assignment, but in its failure it still manages to produce some magical nuance and shades between the great topics death, cable tv, emotions and hamsters.

For more Information:
http://www.clemenskogler.net/grandcontent
Length: 238


Now let's actually download the vide and save it. The first line gets the best audio format form the existing ones, as youtube provides multiple formats and encodings of video and audio.

The second line downloads the file to `audio.webm`.

Run the cell

In [3]:
audio = video.getbestaudio()

filename = audio.download(filepath="audio." + audio.extension)
print("Filename:", filename)

Filename: audio.webm00.00%] received. Rate: [2381 KB/s].  ETA: [0 secs]    


Google's api works easier with a WAV file than with a WEBM file (even though webm is their own format). Moreover, theirs a 60 seconds limit on the easy way of doing this. Larger files need the so called `streaming api`, which is a bit harder to use.

Let's keep it simple. The recipe below is, line by line:

 * open the WEBM file
 * save it as WAV
 
Run the cell.

In [4]:
sound = AudioSegment.from_file(filename)

sound.export("audio.wav", format="wav", bitrate="128k")

<_io.BufferedRandom name='audio.wav'>

Now that we have a WAV file, we use the tools google provide to load these files in a data type Google likes, and we also specify a configuration which states that we want English transcriptions. This improves the transcriptions quality, as now the system know that *Je t'adore* is less likely to occur as *Shut the door*.

The first two lines open the file and place all of it's bytes in memory
The thirt line converts to a format in which Google can administrate certain relevant information
The last line create a configuration of the transcription task and specifies the language.

Run the cell.

In [5]:
with open('audio.wav', 'rb') as audio_file:
    content = audio_file.read()

audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig( language_code='en-US' )

Let's start transcoding. The first line creates a client, sort of a telephone that does the communication with Google.

The second line does the hard work, or at least asks Google to do so.

Run the cell, **it will result in an error**.

Read the message, however cryptic it may seem. What does this error mean?

 * A) You need to pay for this Google service,
 * B) You need to make an appointment (rendezvous) with a service agent from Google
 * C) The audio was too long, and this API only accepts smaller files
 * D) The transcription service was permanently terminated in 2016, Google now only offers web search and e-mail.

In [6]:
client = speech.SpeechClient()

response = client.recognize(config, audio)

RetryError: RetryError(Exception occurred in retry method that was not classified as transient, caused by <_Rendezvous of RPC that terminated with (StatusCode.INVALID_ARGUMENT, Request payload size exceeds the limit: 10485760 bytes.)>)

An important part of programming is failing. No code was ever right the first time and most code isn't even right in the final version. `github.com` is a place to distribute software, but also register softare bugs and manage solutions. Browse some of the issues there and you'll get the point (or not browse the site, and trust me on this)

This issue is with the file length, it's too large as Google only allows 60 seconds. So let's extract **the first 30 seconds** and try again.

To extract a time slice, the syntax is `[ start_in_miliseconds : end_in_miliseconds ]`, if you leave out the start, it starts from the beginning. Leaving the end out means 

Choose the correct line out of the 4 starting with `#`, then run the cell. Choose it by removing the `#`, *unchoose* it by recovering the `#`.

Note, a `#` in front of a line means 'ignore it' for the computer. This way you can disable code or add comments.

Run the 4 cells to test it.

In [7]:
sound = AudioSegment.from_file(filename)

### Select one of these 4 options by removing the #:
#sound = sound[:30]
#sound = sound[30*1000:]
sound = sound[0 : 30*1000]
#sound = sound[30:]

# Some youtube videos have multiple channels, for example two such that it's stereo and not mono.
# Google can only transcribe to one channel at a time. This line selects the first channel, and
# ignores all other if they exist.

sound = sound.split_to_mono()[0]

sound.export("audio.wav", format="wav", bitrate="128k")

<_io.BufferedRandom name='audio.wav'>

In [8]:
with open('audio.wav', 'rb') as audio_file:
    content = audio_file.read()

audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig( language_code='en-US' )

In [9]:
client = speech.SpeechClient()

response = client.recognize(config, audio)

In [10]:
response

results {
  alternatives {
    transcript: "people keep asking me questions about life summarized the questions are why how it worked and I always tell them these are the words the teenage poetry albums"
    confidence: 0.9000371694564819
  }
}
results {
  alternatives {
    transcript: " but you have to face the teenagers like coffee and that\'s a the stock market of things that wants"
    confidence: 0.9185096621513367
  }
}

The result is a transcript and the confidence Google had. Feel free to try to do the same for a different video, make sure it contains speech.

# Advanced


Let's try to structurally extract each 30-second interval of the video, and transcribe it. For this we have a `for`-loop. It would be better to use the Google streaming version of the API, which allows longer audio clips, as the transcription will be done wholistic. The one below will not

The `for`-loop runs the code inside multiple times, below it is set to iterate through all the 30-second intervals of the video.

Replace `None` with the correct expression using `start` and `end`.

In [None]:
sound = AudioSegment.from_file(filename)
sound = sound.split_to_mono()[0]

client = speech.SpeechClient()
response = None

for start, end in zip(
    range(0, len(sound), 30 * 1000),
    range(30 * 1000, len(sound), 30 * 1000)
):
    print("Start:", start, ", end:", end)
    
    
    sound_piece = None # replace None with the right interval.
    sound_piece = sound[start: end]
    
    
    sound_piece.export("audio.wav", format="wav", bitrate="128k")
    
    with open('audio.wav', 'rb') as audio_file:
        content = audio_file.read()

    audio = types.RecognitionAudio(content=content)
    config = types.RecognitionConfig( language_code='en-US' )

    response = client.recognize(config, audio)
    print(response)