# Speech Recognition with Wav2Vec

Initially from article https://pub.towardsai.net/speech-to-text-with-wav2vec-2-0-b21c1e1ad701

Blog on Wav2Vec 2.0: https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/

Initial github drawn from https://github.com/sdhilip200/speech-to-text ut modified as librosa couldnt run on AWS instances as sndfile couldnt install.

Detailed article: https://maelfabien.github.io/machinelearning/wav2vec/#5-the-code



In [1]:
!pip install -r requirements.txt



# HuggingFace Library

In [2]:
# Import necessary library

from audio2numpy import open_audio

#Importing Pytorch
import torch

#Importing Wav2Vec tokenizer
# from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

In [3]:
# Importing Wav2Vec pretrained model

# tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h") # Deprecated
tokenizer = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# audio_file = "taken_clip.wav" 
# audio_file = "timeout.wav"
# audio_file = "preamble_out.wav"
# audio_file = "gettysburg10_out.wav"
audio_file = "1249120_1853182_11719913.wav"

In [5]:
# Reading taken audio clip

import IPython.display as display
display.Audio(audio_file, autoplay=True)

In [6]:
import numpy as np
from scipy.io import wavfile
from scipy import interpolate

def sample(audio_file, sample_rate=16000):
    '''
    Sample the audio file to designated sample rate
    '''
    
    audio, rate = open_audio(audio_file)
    
    print(f"Audio file: {audio_file} @ {rate} Hz, duration: {audio.shape[0] / rate} sec")
    
    if audio.shape[1] == 2:
        # Stereo channel
        print("Stereo - channel 0 only used")
        audio_arr = audio[:,0]
    else:
        # Mono channel
        print("Mono")
        audio_arr = audio
    
    if rate == sample_rate:
        return audio_arr
    
    else:
        print(f"Resample to {sample_rate} Hz")
        duration = audio_arr.shape[0] / rate

        time_old  = np.linspace(0, duration, audio_arr.shape[0])
        time_new  = np.linspace(0, duration, int(audio_arr.shape[0] * sample_rate / rate))

        interpolator = interpolate.interp1d(time_old, audio_arr.T)
        new_audio = interpolator(time_new).T
        
        return new_audio

#         wavfile.write(FILE+"_out.wav", NEW_SAMPLERATE, np.round(new_audio).astype(old_audio.dtype))
    

In [7]:
audio_arr = sample(audio_file)

Audio file: 1249120_1853182_11719913.wav @ 44100 Hz, duration: 2.1362131519274374 sec
Stereo - channel 0 only used
Resample to 16000 Hz


In [9]:
# Taking an input value

input_values = tokenizer(audio_arr, sampling_rate=16000, return_tensors = "pt").input_values

In [10]:
input_values

tensor([[ 8.7039e-05,  2.6708e-03,  8.7039e-05,  ..., -2.4967e-03,
         -5.4253e-04,  8.7039e-05]])

In [11]:
# Storing logits (non-normalized prediction values)
logits = model(input_values).logits

In [12]:
# Storing predicted id's
prediction = torch.argmax(logits, dim = -1)


In [13]:
# Passing the prediction to the tokenzer decode to get the transcription
transcription = tokenizer.batch_decode(prediction)[0]
print(transcription)

I CAN'T MOVE MY HEAD UP AND DOWN


# Medical Speech Tests

Lets check data from medical speech test sets.

We have placed some sample data from Kaggle comp:
https://www.kaggle.com/paultimothymooney/medical-speech-transcription-and-intent

We have placed files and the overview-of-recordings.csv in separate directory test_file_path, not committed to repo.


In [16]:
test_file_path = "../medical_speech_transcription_and_intent/"

In [17]:
!ls {test_file_path}

1249120_44246595_89466082.wav  1249120_44246595_94621639.wav
1249120_44246595_89680770.wav  1249120_44246595_95208403.wav
1249120_44246595_92310242.wav  overview-of-recordings.csv
1249120_44246595_92462780.wav


In [32]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

recordings = pd.read_csv(test_file_path + "overview-of-recordings.csv")

In [27]:
import os

def all_wav_files(test_file_path):
    files = []
    for file in os.listdir(test_file_path):
        if file.endswith(".wav"):
            files.append(file)            
    return files

In [28]:
all_wav_files(test_file_path)

['1249120_44246595_94621639.wav',
 '1249120_44246595_92310242.wav',
 '1249120_44246595_92462780.wav',
 '1249120_44246595_89680770.wav',
 '1249120_44246595_95208403.wav',
 '1249120_44246595_89466082.wav']

In [39]:
def evaluate_transcript(id):
    recording = recordings.iloc[id]
    
    audio_arr = sample(test_file_path + recording['file_name'])
    
    input_values = tokenizer(audio_arr, sampling_rate=16000, return_tensors = "pt").input_values
    logits = model(input_values).logits
    prediction = torch.argmax(logits, dim = -1)
    transcription = tokenizer.batch_decode(prediction)[0]

    print(f"File: {recording['file_name']}")
    print(f"Actual transcription: {recording['phrase']}")
    print(f"Predicted:            {transcription}")

In [66]:
recordings[
    (recordings['file_name'].isin(all_wav_files(test_file_path))) & (recordings['background_noise_audible'] == "no_noise")
    ][['file_name','background_noise_audible','speaker_id', 'phrase']][20:40]

Unnamed: 0,file_name,background_noise_audible,speaker_id,phrase
3931,1249120_44292353_47569087.wav,no_noise,44292353,I feel like I went to an acupuncture's practice and had 100 needles in my shoulder
3932,1249120_44292353_53318642.wav,no_noise,44292353,My shower drain is full of hair every time.
3933,1249120_44292353_69566092.wav,no_noise,44292353,I love to garden but I get a terrible twinge in my lower back when I lean over.
3934,1249120_44292353_71029438.wav,no_noise,44292353,My neck is annoying me I can't sleep bacause of it
3935,1249120_44323331_16590901.wav,no_noise,44323331,I feel severe itching in the skin with redness
3936,1249120_44323331_22225771.wav,no_noise,44323331,I have a sharp pain in my ear
3937,1249120_44323331_25968681.wav,no_noise,44323331,I have severe shoulder pain
3938,1249120_44323331_26313266.wav,no_noise,44323331,"I am worried how cold intolerant I am, I am always shivering, even out in the sun."
3939,1249120_44323331_33624071.wav,no_noise,44323331,"My body feels weak after my first day in the gym, why?"
3940,1249120_44323331_38538892.wav,no_noise,44323331,I had a sharp pain in my stomach


In [70]:
evaluate_transcript(5040)

Audio file: ../medical_speech_transcription_and_intent/1249120_44246595_92462780.wav @ 44100 Hz, duration: 3.529410430839002 sec
Stereo - channel 0 only used
Resample to 16000 Hz
File: 1249120_44246595_92462780.wav
Actual transcription: He was discovered to have an open wound.
Predicted:            HE WAS DISCOVERED TO HAVE IN OPEN WOUNDS
