<a href="https://colab.research.google.com/github/marcellinus-witarsah/speech-to-text-model/blob/main/speech-to-text-models/deep_speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Data Pipeline

Train data from a subset of LibriSpeech, which is a corpus of read English speech data derived from audiobooks, comprising 100 hours of transcribed audio data. You can easily download this dataset using torchaudio:

In [1]:
!pip install torchaudio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import torchaudio

train_dataset = torchaudio.datasets.LIBRISPEECH("./", url="train-clean-100", download=True)
test_dataset = torchaudio.datasets.LIBRISPEECH("./", url="test-clean", download=True)


  0%|          | 0.00/5.95G [00:00<?, ?B/s]

  0%|          | 0.00/331M [00:00<?, ?B/s]

# Data Augmentation
Data augmentation for speech recognition is needed to mitigate overfitting while training and at the same time increasy the variety of the dataset. For speech recognition, we can change the pitch, speed, injecting noise, and adding reverb to your audio data.

We found **Spectrogram Augmentation** (SpecAugment), to be a much **simpler and more effective approach**. SpecAugment, was first introduced in the paper SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition, in which the authors found that simply cutting out random blocks of consecutive time and frequency dimensions improved the models generalization abilities significantly!

for frequency masking and time masking we can use:

`torchaudio.transforms.FrequencyMasking()`

`torchaudio.transforms.TimeMasking()`


In [17]:
# create a pair of character that matches with the label
char_map_str = """
 ' 0
 <SPACE> 1
 a 2
 b 3
 c 4
 d 5
 e 6
 f 7
 g 8
 h 9
 i 10
 j 11
 k 12
 l 13
 m 14
 n 15
 o 16
 p 17
 q 18
 r 19
 s 20
 t 21
 u 22
 v 23
 w 24
 x 25
 y 26
 z 27
 """


In [30]:
# create a class that for text transformation
class TextTransform():
    """Maps charcacters to integers and vice versa"""
    def __init__(self):
        self.char_map = {}
        self.index_map = {}
        for line in char_map_str.strip().split('\n'):
            char, index = line.strip().split()
            self.char_map[char] = int(index)
            self.index_map[int(index)] = char
        self.index_map[1] = ' '
    """Use character as index to map to an integer (label)"""
    def text_to_int(self, text):
        int_sequence = []
        for c in text:
            if c == ' ':
                ch = self.char_map['<SPACE>']
            else:
                ch = self.char_map[c]
            int_sequence.append(ch)
        return int_sequence
    def int_to_text(self, labels):
        string = []
        for i in labels:
            string.append(self.index_map[i])
        return ''.join(string)