https://github.com/philipperemy/timit

https://www.kaggle.com/datasets/mfekadu/darpa-timit-acousticphonetic-continuous-speech?resource=download

https://www.kaggle.com/code/julwan/phoneme-recognition-with-wav2vec2

wav2vec seems like a common transformer model for speech recognition. 

630 speakers, each read 10 sentences—2 dialect "shibboleth" sentences (SA), 5 phonetically compact sentences (SX), and 3 phonetically diverse sentences (SI).

- SA: expose the dialectal variants of the speakers.
- SX: provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest.
- SI: add diversity in sentence types and phonetic contexts.



| Sentence Type | #Sentences | #Speakers | Total | #Sentences/Speaker |
|---------------|------------|-----------|-------|--------------------|
| Dialect (SA)  |      2     |    630    | 1260  |         2          |
| Compact (SX)  |     450    |     7     | 3150  |         5          |
| Diverse (SI)  |    1890    |     1     | 1890  |         3          |
|---------------|------------|-----------|-------|--------------------|
| Total         |    2342    |           | 6300  |        10          |


| Dialect Region (dr) | #Male   | #Female  | Total   |
|----------------------|---------|----------|---------|
|         1            | 31 (63%)| 18 (27%) | 49 (8%) |
|         2            | 71 (70%)| 31 (30%) |102(16%)|
|         3            | 79 (67%)| 23 (23%) |102(16%)|
|         4            | 69 (69%)| 31 (31%) |100(16%)|
|         5            | 62 (63%)| 36 (37%) | 98(16%)|
|         6            | 30 (65%)| 16 (35%) | 46 (7%)|
|         7            | 74 (74%)| 26 (26%) |100(16%)|
|         8            | 22 (67%)| 11 (33%) | 33 (5%)|
|----------------------|---------|----------|---------|
|        Total         | 438(70%)|192(30%)  |630(100%)|

The dialect regions are:
   dr1: New England
   dr2: Northern
   dr3: North Midland
   dr4: South Midland
   dr5: Southern
   dr6: New York City
   dr7: Western
   dr8: Army Brat (moved around)


For each utterance, the following information is available:
- .wav file: the speech waveform.
- .phn file: Time-aligned phonetic transcription.
- .wrd file: Time-aligned word transcription.
- .txt file: Associated orthographic transcription of the words the
            person said.

In [150]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [151]:
import glob
import librosa
import pandas as pd

# custum modules
from data_utils import *

ideal data structure:
``` python
data = {
    "document1": {
        "word": pd.DataFrame, 
        "phoneme": pd.DataFrame}
    ,
    "document2": {
        "word": pd.DataFrame,
        "phoneme": pd.DataFrame
    }
}
```

## Load phoneme, word, and transcript data

In [193]:
# sx61
glob.glob('timit/data/TRAIN/DR4/MGAG0/SX61.*')

['timit/data/TRAIN/DR4/MGAG0/SX61.TXT',
 'timit/data/TRAIN/DR4/MGAG0/SX61.WAV',
 'timit/data/TRAIN/DR4/MGAG0/SX61.PHN',
 'timit/data/TRAIN/DR4/MGAG0/SX61.WRD',
 'timit/data/TRAIN/DR4/MGAG0/SX61.WAV.wav']

In [194]:
# phoneme
df_phoneme = load_data(glob.glob('timit/data/TRAIN/DR4/MGAG0/SX61.PHN')[0], "phoneme")
df_phoneme.tail()

Unnamed: 0,start_sample,end_sample,phoneme,diff_sample
32,36940,39000,ih,2060
33,39000,40200,f,1200
34,40200,40960,tcl,760
35,40960,41720,t,760
36,41720,43680,h#,1960


In [195]:
# word
df_word = load_data(glob.glob('timit/data/TRAIN/DR4/MGAG0/SX61.WRD')[0], "word")
df_word.tail()

Unnamed: 0,start_sample,end_sample,word,diff_sample
4,20305,26520,fail,6215
5,26520,28393,as,1873
6,28393,29033,a,640
7,29033,36280,romantic,7247
8,36280,41720,gift,5440


In [196]:
df_transcript = load_transcript('timit/data/TRAIN/DR4/MGAG0/SX61.TXT')
df_transcript.tail()

Unnamed: 0,start_sample,end_sample,transcript,diff_sample
0,0,43725,Chocolate and roses never fail as a romantic g...,43725


## MFCC
Note that we can always use smaller `win_length` and `hop_length` for more fine-grained alignment.

if we choose 25ms window and 5ms hop, we can have $25 * 10^-3 * 16000 = 400$ `win_length` and $5 * 10^-3 * 16000 = 80$ `hop_length`.

https://librosa.org/doc/main/generated/librosa.feature.mfcc.html#librosa.feature.mfcc

In [202]:
df_mfcc = process_audio_file('timit/data/TRAIN/DR4/MGAG0/SX61.WAV',
                             win_length=400, 
                             hop_length=80)
df_mfcc.tail()

Unnamed: 0,start_sample,end_sample,mfcc
541,43280,43680,"[-827.1031, 32.212986, 10.357775, 4.273916, -7..."
542,43360,43760,"[-833.8036, 18.833408, 6.527279, 11.398876, 0...."
543,43440,43840,"[-845.2821, 3.8058267, 1.8199315, 11.270794, 5..."
544,43520,43920,"[-847.30914, 4.7726836, 6.2668734, 6.074581, 8..."
545,43600,44000,"[-847.56964, 9.320388, 15.307094, 7.9442987, 1..."


In [203]:
df_mfcc.shape

(546, 3)

In [204]:
400/16000

0.025

In [205]:
80/16000

0.005

### draft/explanation

y: total number of samples

sr: number of samples per second

In [None]:
y, sr = librosa.load('timit/data/TRAIN/DR4/MGAG0/SX61.WAV', sr=16000)
y.shape, sr

((43725,), 16000)

In [200]:
y, sr = librosa.load('timit/data/TRAIN/DR4/MGAG0/SX61.WAV')
y.shape, sr

((60259,), 22050)

In [201]:
# duration
60259/22050

2.732834467120181

we know y=43725

In [198]:
load_transcript('timit/data/TRAIN/DR4/MGAG0/SX61.TXT')

Unnamed: 0,start_sample,end_sample,transcript,diff_sample
0,0,43725,Chocolate and roses never fail as a romantic g...,43725


In [197]:
# try another 
y, sr = librosa.load('timit/data/TRAIN/DR4/MGAG0/SA1.WAV', sr=16000)
y.shape, sr

((45159,), 16000)

In [None]:
load_transcript('timit/data/TRAIN/DR4/MGAG0/SA1.TXT')

Unnamed: 0,start_time,end_time,transcript
0,0,45159,She had your dark suit in greasy wash water al...


In [None]:
# Number of samples=Duration×Sampling frequency (Sr in Hz)
# Hz represents the number of samples per second.
duration = len(y) / sr
duration

2.732834467120181

In [None]:
43725/duration  
# if an audio file is 2.732834467120181 seconds long and contains 43725 samples, it means that the audio file has a sampling rate of approximately 16000 Hz

15999.871388506283

In [None]:
n_mfcc = 20  # Number of MFCC coefficients
hop_length = 512  # Number of samples between consecutive frames (frame step)
win_length = 1024  # Length of the analysis window in samples (frame size)

In [None]:
1024/16000  # 64 ms

0.064

In [None]:
# Extract MFCC features
mfccs = librosa.feature.mfcc(y=y, sr=sr, hop_length=hop_length, win_length=win_length, n_mfcc=n_mfcc)

In [None]:
mfccs.shape
# The first dimension (20) represents the number of MFCC coefficients extracted. This is a common default value, as 20 coefficients are often sufficient to capture relevant information about the spectral characteristics of the audio signal.
# The second dimension (86) represents the number of frames extracted from the audio signal. The number of frames is determined by the duration of the audio signal and the frame length and hop length used in the feature extraction process.

(20, 86)

In [None]:
mfcc = load_mfcc('timit/data/TRAIN/DR4/MGAG0/SX61.WAV')
mfcc.shape

(20, 86)

In [None]:
df_mfcc = construct_mfcc_df(mfcc)
df_mfcc.tail()

Unnamed: 0,start_sample,end_sample,mfcc
81,41472,42496,"[-494.7277, 7.1650486, -1.718417, 17.082987, -..."
82,41984,43008,"[-627.15405, 57.744026, -20.17612, 10.34127, -..."
83,42496,43520,"[-698.2599, 51.94442, -1.7764928, 8.506399, 11..."
84,43008,44032,"[-753.46246, 40.802723, 8.634056, 1.6109326, 1..."
85,43520,44544,"[-795.0889, 12.04735, 9.87564, 9.984465, 5.128..."


## Align df_mfcc and df_phoneme 

epi, pau, h# are silence.

In [192]:
target_phoneme = ['h#', 'n', 'eh', 'v', 'axr']

# substitute every other phoneme with "#b"
df_phoneme['phoneme'] = df_phoneme['phoneme'].apply(lambda x: "#b" if x not in target_phoneme else x)
df_phoneme.head()

Unnamed: 0,start_sample,end_sample,phoneme,diff_sample
0,0,2370,h#,2370
1,2370,3442,#b,1072
2,3442,5351,#b,1909
3,5351,5973,#b,622
4,5973,6863,#b,890


In [189]:
df_phoneme.loc[13:18]

Unnamed: 0,start_sample,end_sample,phoneme,diff_sample
13,15882,16790,#b,908
14,16790,17642,n,852
15,17642,18673,eh,1031
16,18673,19342,v,669
17,19342,20305,axr,963
18,20305,22950,#b,2645


In [207]:
target_range = df_mfcc[(df_mfcc['end_sample'] >= df_phoneme.loc[14, 'start_sample']) & (df_mfcc['start_sample'] <= df_phoneme.loc[18, 'end_sample'])]
target_range

Unnamed: 0,start_sample,end_sample,mfcc
205,16400,16800,"[-489.03186, -12.311717, 58.93512, 42.58722, -..."
206,16480,16880,"[-504.2323, -29.32409, 67.523476, 37.44905, -1..."
207,16560,16960,"[-521.99133, -28.142538, 59.3443, 28.163754, -..."
208,16640,17040,"[-531.2007, -17.439379, 42.591145, 7.9479475, ..."
209,16720,17120,"[-558.5072, -11.389967, 43.460793, -2.4091964,..."
...,...,...,...
282,22560,22960,"[-482.33563, -27.126205, -14.6975765, -14.7098..."
283,22640,23040,"[-486.65332, -8.8072815, -24.472666, -2.108832..."
284,22720,23120,"[-511.60602, 11.586173, -31.901, 6.689104, -30..."
285,22800,23200,"[-525.69745, 65.10106, -14.114042, 24.237728, ..."


Because a frame (400 samples) can never contain more than 2 phonemes, we can use the following data structure to store the phoneme information.

```
h# (silence)
#b (other phonemes, noise)
#bn (bi-phonemes)
n
neh (bi-phonemes)
eh
v
axr
vaxr (bi-phonemes)
axr#b (bi-phonemes)
```

10 dimensional feature vector for each frame.

In [173]:
# for each row in df_mfcc, find the corresponding phoneme
row = df_mfcc.loc[210]
row

start_sample                                                16800
end_sample                                                  17200
mfcc            [-623.5903, -1.6005177, 46.49632, -1.0682647, ...
Name: 210, dtype: object

In [174]:
phoneme_labels = []
phonemes = ['h', 'b', 'bn', 'n', 'eh', 'v', 'axr', 'vaxr', 'axr#b']

In [175]:
# locate the range of phoneme
start_sample = row['start_sample']
end_sample = row['end_sample']

phoneme_vector = [0] * 10

phonme_subset = df_phonme[(df_phonme['start_sample'] >= start_sample) & (df_phonme['end_sample'] <= end_sample)]

for _, phonme_row in phonme_subset.iterrows():
    phoneme = phonme_row['phoneme']
    if phoneme in phonemes:
        phoneme_index = phonemes.index(phoneme)
        phoneme_vector[phoneme_index] = 1

phoneme_labels.append(phoneme_vector)

phoneme_labels

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]