# GCP Diarization

Author: **Rommel Silva**

Date: **11/11/2019**

This is my attempt at creating a simple pipeline to send audio files through GCP's Speech-to-Text API that returns the transcript suplemented with basic speaker diarization. This API is incredibly robust and I would suggest at least taking a look the [config file documentation](https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig) to get a good grasp on what this function can do.

In [1]:
from google.cloud import speech_v1p1beta1
import io, sys, getopt, json, csv, time, string
import pandas as pd

In [2]:
def sample_long_running_recognize(audio_file, upload_method):
    """
    Print confidence level for individual words in a transcription of a short audio
    file
    Separating different speakers in an audio file recording

    Args:
      file_path Path to local audio file, e.g. /path/audio.wav
      storage_uri, e.g 'gs://audio_analsis/jordan_peterson_mono.wav'
      upload_method, which can be either 'local' or 'uri'
      
    """

    client = speech_v1p1beta1.SpeechClient()

    # local_file_path = 'resources/commercial_mono.wav'

    # If enabled, each word in the first alternative of each result will be
    # tagged with a speaker tag to identify the speaker.
    enable_speaker_diarization = True

    # Optional. Specifies the estimated number of speakers in the conversation.
    diarization_speaker_count = 3

    #It was giving me this error before: Must use single channel (mono) audio, but WAV header indicates 2 channels.
    audio_channel_count = 2
    
    #If enabled, it will detect punctuation.
    enable_automatic_punctuation = True

    # The language of the supplied audio
    language_code = "en-US"
    config = {
        "enable_speaker_diarization": enable_speaker_diarization,
        #"diarization_speaker_count": diarization_speaker_count,
        "language_code": language_code,
        "enable_automatic_punctuation": enable_automatic_punctuation,
        "audio_channel_count": audio_channel_count,
    }
    
    if upload_method == 'local':
        with io.open(audio_file, "rb") as f:
            content = f.read()
            audio = {"content": content}
    elif upload_method == 'uri':
        uri = audio_file
        audio = {"uri": uri}

    operation = client.long_running_recognize(config, audio)

    print(u"Waiting for operation to complete...")
    response = operation.result()
    print(u"Done!")

    return response

## For files that are < 1 minute
If we're working with small files (~1 minute), we can upload it directly from out local machine

In [3]:
audio_in_file_path = '###PROVIDE###' # Example: 'data/'
audio_in_file_name = '###PROVIDE###' # Example 'jordan_peterson_lecture'
audio_in_file_format = '###PROVIDE###' # Example '.wav'

audio_in_file = audio_in_file_path + audio_in_file_name + audio_in_file_format
response = sample_long_running_recognize(audio_in_file, 'local')

Waiting for operation to complete...
Done!


## For files that are > 1 minute
Audio longer than ~1 minute must use the uri field to reference an audio file in Google Cloud Storage.
You have to first upload the audio file you want to use into a cloud storage bucket (https://cloud.google.com/storage/docs/uploading-objects), and ensure that the credentials that you'll be running with have ```storage.objects.get``` access permission to the objects.

Service Account info:

```email: audioanalysis@multispeaker-tlt.iam.gserviceaccount.com```

```key id: bb977459cba035c5f5e08e540606301d03d630f4```

In [3]:
start_time = time.time()

audio_in_file_name = '###PROVIDE###' # Example 'jordan_peterson_lecture'
audio_in_file_format = '###PROVIDE###' # Example '.wav'
uri = '###PROVIDE###' + audio_in_file_name + audio_in_file_format # Example 'gs://audio_analsis/'

response = sample_long_running_recognize(uri, 'uri')

print("Time: " + str((time.time() - start_time)/60) + " minutes.")

Waiting for operation to complete...
Done!
Time: 4.090657250086466 minutes.


# 'LongRunningRecognizeResponse' to DataFrame

Once we receive the response from Google's speech API it comes in a format called 'LongRunningRecognizeResponse', which is similar to a .json, but not really. The general format of the response is as follows (the ones we need are marked with *):

Response

    Results.Alternatives
        Transcript*
        Words
            Start_time
                seconds*
                nano
            End_time
                seconds*
                nano
            Word*
            Speaker_tag*

In [4]:
cols = ["source_file", "transcript_id", "word", "end_sentence", "start", "stop", "speaker_tag"]
master = pd.DataFrame(columns = cols)

end_punctuation = [".", "!", "?"]
source_file = audio_in_file_name
transcript_id = []
word = []
end_sentence = []
speaker_tag = []
start = []
stop = []

In [5]:
for i in range(len(response.results) - 1):
  
    transcript_id = i + 1
    for j in range(len(response.results[i].alternatives[0].words)):
        
        words = response.results[i].alternatives[0].words[j]
        
        word = words.word
        
        for character in word:
            if character in end_punctuation:
                end_sentence = 1
            else:
                end_sentence = 0
        
        start = words.start_time.seconds
        stop = words.end_time.seconds
        speaker_tag = response.results[len(response.results) - 1].alternatives[0].words[j].speaker_tag
        
        data = [source_file, transcript_id, word, end_sentence, start, stop, speaker_tag]
        df = pd.DataFrame([data], columns = cols)
        
        master = pd.concat([master, df], axis = 0, ignore_index = True)


In [6]:
master

Unnamed: 0,source_file,transcript_id,word,end_sentence,start,stop,speaker_tag
0,hbb20_0,1,Stanford,0,4,5,2
1,hbb20_0,1,University,0,5,5,2
2,hbb20_0,2,"Okay,",0,10,12,2
3,hbb20_0,2,let's,0,12,13,2
4,hbb20_0,2,get,0,13,14,2
5,hbb20_0,2,started.,1,14,14,2
6,hbb20_0,3,We,0,18,19,2
7,hbb20_0,3,pick,0,19,19,2
8,hbb20_0,3,up,0,19,19,2
9,hbb20_0,3,with,0,19,20,2


#### Output master into a .csv file

In [12]:
out_file_path = '###PROVIDE###' #Example 'data/outputs/'
out_file_name = audio_in_file_name
out_file_format = '###PROVIDE###' #Example '.csv' 

out_file = out_file_path + out_file_name + out_file_format
master.to_csv(out_file, index = False)

# Separate the sentences

Now that we have the master csv file neatly organized, we can create a file with each individual sentence.

The idea here is that we'll look at the start of each word, add add 30 seconds to it, look where the next `end_sentence = 1` is, and every word in between will fall under the same sentence.

In [7]:
cols_sentences = cols + ["sentence_id"]
master_with_sentences = pd.DataFrame(columns = cols_sentences)

sentences = pd.DataFrame(columns = ['sentence_id'])

sentence_id = 1
time = 30

for index, row in master.iterrows():
    
    sentence_data = [sentence_id]
    if row['start'] >= time:
        if row['end_sentence'] == 1:
            time = row['stop'] + 30
            sentence_id = sentence_id + 1

    df1 = pd.DataFrame([sentence_data], columns = ['sentence_id'])
    sentences = pd.concat([sentences, df1], axis = 0, ignore_index = True)
    

In [8]:
master_with_sentences = master.assign(sentence_id = sentences)
master_with_sentences

Unnamed: 0,source_file,transcript_id,word,end_sentence,start,stop,speaker_tag,sentence_id
0,hbb20_0,1,Stanford,0,4,5,2,1
1,hbb20_0,1,University,0,5,5,2,1
2,hbb20_0,2,"Okay,",0,10,12,2,1
3,hbb20_0,2,let's,0,12,13,2,1
4,hbb20_0,2,get,0,13,14,2,1
5,hbb20_0,2,started.,1,14,14,2,1
6,hbb20_0,3,We,0,18,19,2,1
7,hbb20_0,3,pick,0,19,19,2,1
8,hbb20_0,3,up,0,19,19,2,1
9,hbb20_0,3,with,0,19,20,2,1


#### Output master_with_sentences into a .csv file

In [34]:
out_file_path = '###PROVIDE###' #Example 'data/outputs/' 
out_file_name = audio_in_file_name
out_file_format = '###PROVIDE###' #Example '.csv'

out_file = out_file_path + out_file_name + out_file_format
master_with_sentences.to_csv(out_file, index = False)

## Creating the individual sentences
Since we know the sentence to which each word belongs to, we can combine the words to form the entire sentences that cover a minimum of 30sec of audio.
For this we'll create a new DF that will contain the following columns: `sentence, sentenceID, speaker_tag, start_time, stop_time`

In [9]:
cols_sentence_master = ["sentence", "sentence_id", "speaker_tag","num_speakers", "start_time", "stop_time"]# "start_time"]
sentence_master = pd.DataFrame(columns = cols_sentence_master)

In [10]:
sentence = ""

for y in range(1, master_with_sentences['sentence_id'].max() + 1):
    temp = master_with_sentences.loc[master_with_sentences['sentence_id'] == y]
    
    for index, row in temp.iterrows():
        sentence += row["word"] + " "
    
    num_speakers = temp['speaker_tag'].nunique()
    speaker = temp['speaker_tag'].value_counts().idxmax()
    start_time = temp['start'].min()
    stop_time = temp['stop'].max()
    
    data_2 = [sentence, y, speaker, num_speakers, start_time, stop_time]
    df2 = pd.DataFrame([data_2], columns = cols_sentence_master)
    sentence_master = pd.concat([sentence_master, df2], axis = 0, ignore_index = True)
    
    sentence = ""

In [11]:
sentence_master

Unnamed: 0,sentence,sentence_id,speaker_tag,num_speakers,start_time,stop_time
0,"Stanford University Okay, let's get started. W...",1,2,1,4,29


In [96]:
out_file_path = '###PROVIDE###' #Example 'data/outputs/' 
out_file_name = audio_in_file_name
out_file_format = '###PROVIDE###' #Example '.csv'

out_file = out_file_path + out_file_name + "_sentences" + out_file_format
sentence_master.to_csv(out_file, index = False)