references:
https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec
https://huggingface.co/facebook/wav2vec2-base-960h
https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_text

source-code: https://github.com/facebookresearch/fairseq/blob/main/examples/speech_to_text/prep_mtedx_data.py

datasets: https://commonvoice.mozilla.org/en/datasets

requirements:
    torch -> https://pytorch.org/get-started/locally/
    transformer -> https://github.com/huggingface/transformers

In [5]:
# import library

import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import speech_recognition as sr
import io
from pydub import AudioSegment

In [2]:
# load model and tokenizer
tokenizer = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
r = sr.Recognizer()

with sr.Microphone(sample_rate=16000) as source:
    while True:
        audio = r.listen(source) # pyaudio object
        data = io.BytesIO(audio.get_wav_data()) # list of bytes
        clip = AudioSegment.from_file(data) # numpy array
        x = torch.FloatTensor(clip.get_array_of_samples()) # tensor

        inputs = tokenizer(x, sampling_rate=16000, return_tensors='pt', padding='longest').input_values
        logits = model(inputs).logits
        tokens = torch.argmax(logits)
        text = tokenizer.batch_decode(tokens)

        print("You said: ", str(text).lower())