<a href="https://colab.research.google.com/github/iamsusiep/slp2019/blob/master/preprocess_utterances.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook preprocesses WAV files that have already had music removed (using the remove_music notebook) and turns them into mel-spectrograms, to then have prosody embeddings retrieved using a pre-trained model. 

The WAV files are split into different utterances using timestamps from transcriptions + forced alignment. Each utterance has a mel-spectrogram saved, so each WAV file has multiple mel-spectrograms associated.

In [0]:
from google.colab import drive

drive.mount('/content/gdrive')
!git clone https://github.com/syang1993/gst-tacotron.git

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive
Cloning into 'gst-tacotron'...
remote: Enumerating objects: 375, done.[K
remote: Total 375 (delta 0), reused 0 (delta 0), pack-reused 375[K
Receiving objects: 100% (375/375), 421.74 KiB | 587.00 KiB/s, done.
Resolving deltas: 100% (244/244), done.


In [0]:
cd gst-tacotron

/content/gst-tacotron


In [0]:
%%capture
!pip install -r requirements.txt

In [0]:
%%capture
!pip install --upgrade librosa

In [0]:
import pandas as pd
import glob 
import os
from util.audio import *

def process_utterance(out_dir, wav_path, link, start_time, end_time, name, text):
  '''Preprocesses a single utterance.
  This writes the mel and linear scale spectrograms to disk.
  Args:
    out_dir: The directory to write the spectrograms into
    wav_path: Path to the audio file containing the speech input
  '''
  if start_time != end_time and start_time != 29.96 and abs(start_time - end_time) >= 2 and text.strip():
    # Load the audio to a numpy array:
    wav = load_wav(wav_path)

    # limit to just be between end and start time based on transcription
    start_index = librosa.time_to_samples(start_time, sr = 16000)
    end_index = librosa.time_to_samples(end_time, sr = 16000)
    wav = wav[int(start_index):int(end_index)]

    # Compute the linear-scale spectrogram from the wav:
    spect = spectrogram(wav).astype(np.float32)
    n_frames = spect.shape[1]

    # Compute a mel-scale spectrogram from the wav:
    mel_spectrogram = melspectrogram(wav).astype(np.float32)

    # Write the spectrograms to disk:
    mel_filename = link + '_' + name + '-mel.npy'
    np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [0]:
# runs through all WAV files (each of which is 30 seconds long, so there are multiple per trailer)
# generates mel spectrograms for each WAV file, excluding some erroneous transcripts
# note that this cell may fail if your version of librosa is old
# the error is a weird shape-related one, but just pip install --upgrade librosa and you'll be fine

wav_files = glob.glob('/content/gdrive/My Drive/no_music/*.wav')
all_alignments = os.listdir('/content/gdrive/My Drive/align/')
existing_files = os.listdir('/content/gdrive/My Drive/preprocessed_prosody_model_inputs/')

# loop through all wav files
for f in wav_files:
  yt_link = f.split('no_music/')[1].split('.wav')[0]
  alignments = [x for x in all_alignments if yt_link in x]
  if len(alignments) > 0:
    # make sure we haven't already done this one
    if not any(yt_link in x for x in existing_files):
      for alignment in alignments:
        # amend yt_link to correspond to segment of trailer
        yt_link_segment = yt_link + alignment.split(yt_link)[1].split('.csv')[0]
        # read in alignment and process utterance
        df = pd.read_csv('/content/gdrive/My Drive/align/' + alignment, names = ['name', 'start', 'end', 'text'])
        df.apply(lambda row: process_utterance('/content/gdrive/My Drive/preprocessed_prosody_model_inputs/',
                                               f,
                                               yt_link_segment,
                                               row['start'],
                                               row['end'],
                                               row['name'],
                                               row['text']
                                               ), axis = 1)
        