https://github.com/philipperemy/timit

https://www.kaggle.com/datasets/mfekadu/darpa-timit-acousticphonetic-continuous-speech?resource=download

https://www.kaggle.com/code/julwan/phoneme-recognition-with-wav2vec2

wav2vec seems like a common transformer model for speech recognition. 

630 speakers, each read 10 sentences—2 dialect "shibboleth" sentences (SA), 5 phonetically compact sentences (SX), and 3 phonetically diverse sentences (SI).

- SA: expose the dialectal variants of the speakers.
- SX: provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest.
- SI: add diversity in sentence types and phonetic contexts.



| Sentence Type | #Sentences | #Speakers | Total | #Sentences/Speaker |
|---------------|------------|-----------|-------|--------------------|
| Dialect (SA)  |      2     |    630    | 1260  |         2          |
| Compact (SX)  |     450    |     7     | 3150  |         5          |
| Diverse (SI)  |    1890    |     1     | 1890  |         3          |
|---------------|------------|-----------|-------|--------------------|
| Total         |    2342    |           | 6300  |        10          |


| Dialect Region (dr) | #Male   | #Female  | Total   |
|----------------------|---------|----------|---------|
|         1            | 31 (63%)| 18 (27%) | 49 (8%) |
|         2            | 71 (70%)| 31 (30%) |102(16%)|
|         3            | 79 (67%)| 23 (23%) |102(16%)|
|         4            | 69 (69%)| 31 (31%) |100(16%)|
|         5            | 62 (63%)| 36 (37%) | 98(16%)|
|         6            | 30 (65%)| 16 (35%) | 46 (7%)|
|         7            | 74 (74%)| 26 (26%) |100(16%)|
|         8            | 22 (67%)| 11 (33%) | 33 (5%)|
|----------------------|---------|----------|---------|
|        Total         | 438(70%)|192(30%)  |630(100%)|

The dialect regions are:
   dr1: New England
   dr2: Northern
   dr3: North Midland
   dr4: South Midland
   dr5: Southern
   dr6: New York City
   dr7: Western
   dr8: Army Brat (moved around)


For each utterance, the following information is available:
- .wav file: the speech waveform.
- .phn file: Time-aligned phonetic transcription.
- .wrd file: Time-aligned word transcription.
- .txt file: Associated orthographic transcription of the words the
            person said.

In [20]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [21]:
import glob
import librosa
import pandas as pd

# custum modules
from data_utils import *

ideal data structure:
``` python
data = {
    "document1": {
        "word": pd.DataFrame, 
        "phoneme": pd.DataFrame}
    ,
    "document2": {
        "word": pd.DataFrame,
        "phoneme": pd.DataFrame
    }
}
```

## Load phoneme, word, and transcript data

In [22]:
# sx61
glob.glob('timit/data/TRAIN/DR4/MGAG0/SX61.*')

['timit/data/TRAIN/DR4/MGAG0/SX61.TXT',
 'timit/data/TRAIN/DR4/MGAG0/SX61.WAV',
 'timit/data/TRAIN/DR4/MGAG0/SX61.PHN',
 'timit/data/TRAIN/DR4/MGAG0/SX61.WRD',
 'timit/data/TRAIN/DR4/MGAG0/SX61.WAV.wav']

In [15]:
# phoneme
df_phoneme = load_data(glob.glob('timit/data/TRAIN/DR4/MGAG0/SX61.PHN')[0], "phoneme")
df_phoneme.tail()

Unnamed: 0,start_sample,end_sample,phoneme,diff_sample
32,36940,39000,#b,2060
33,39000,40200,#b,1200
34,40200,40960,#b,760
35,40960,41720,#b,760
36,41720,43680,h#,1960


In [None]:
# word
df_word = load_data(glob.glob('timit/data/TRAIN/DR4/MGAG0/SX61.WRD')[0], "word")
df_word.tail()

In [None]:
df_transcript = load_transcript('timit/data/TRAIN/DR4/MGAG0/SX61.TXT')
df_transcript.tail()

## MFCC
Note that we can always use smaller `win_length` and `hop_length` for more fine-grained alignment.

if we choose 25ms window and 5ms hop, we can have $25 * 10^-3 * 16000 = 400$ `win_length` and $5 * 10^-3 * 16000 = 80$ `hop_length`.

https://librosa.org/doc/main/generated/librosa.feature.mfcc.html#librosa.feature.mfcc

In [23]:
df_mfcc = process_audio_file('timit/data/TRAIN/DR4/MGAG0/SX61.WAV',
                             win_length=400, 
                             hop_length=80)
display(df_mfcc.tail())

print(df_mfcc.shape)

Unnamed: 0,start_sample,end_sample,mfcc
541,43280,43680,"[-827.1031, 32.212986, 10.357775, 4.273916, -7.0693197, 0.10315089, -2.513186, -4.818401, -1.4071487, 8.893745, 11.381563, 7.5945396, 1.589353, 5.308113, 5.909975, -4.874711, -6.9147296, -0.20854214, 2.7190595, -4.6186132]"
542,43360,43760,"[-833.8036, 18.833408, 6.527279, 11.398876, 0.103251174, 0.8042935, 3.4333205, -4.118975, -3.5335937, 4.9041696, 8.634795, 5.300904, 7.984957, 11.643053, 4.904706, -4.5845375, 1.6711016, 6.4817266, 0.8232577, -5.371475]"
543,43440,43840,"[-845.2821, 3.8058267, 1.8199315, 11.270794, 5.8475585, -1.3674686, 2.779904, -9.955495, -3.1452274, 1.7140839, 5.052147, -0.19389728, 5.7216744, 8.643326, 0.7713368, 3.1126137, 10.645617, 11.829777, 8.475585, 5.1200647]"
544,43520,43920,"[-847.30914, 4.7726836, 6.2668734, 6.074581, 8.445786, -0.121844046, 0.40512764, -4.014574, 6.1389318, 1.0904772, -0.5691086, -4.44184, 5.037013, 10.637453, 5.1548405, 11.605476, 4.642511, 0.25984132, 11.158846, 11.875011]"
545,43600,44000,"[-847.56964, 9.320388, 15.307094, 7.9442987, 10.220013, 7.1344895, 3.296341, 6.3600864, 11.376036, 8.035162, 7.2890534, -1.6766523, 7.5888567, 8.748432, 10.811941, 20.659288, 4.826117, 0.39584345, 2.2651634, 3.354187]"


(546, 3)


### draft/explanation

y: total number of samples

sr: number of samples per second

In [None]:
y, sr = librosa.load('timit/data/TRAIN/DR4/MGAG0/SX61.WAV', sr=16000)
y.shape, sr

In [None]:
y, sr = librosa.load('timit/data/TRAIN/DR4/MGAG0/SX61.WAV')
y.shape, sr

In [None]:
# duration
60259/22050

we know y=43725

In [None]:
load_transcript('timit/data/TRAIN/DR4/MGAG0/SX61.TXT')

In [None]:
# try another 
y, sr = librosa.load('timit/data/TRAIN/DR4/MGAG0/SA1.WAV', sr=16000)
y.shape, sr

In [None]:
load_transcript('timit/data/TRAIN/DR4/MGAG0/SA1.TXT')

In [None]:
# Number of samples=Duration×Sampling frequency (Sr in Hz)
# Hz represents the number of samples per second.
duration = len(y) / sr
duration

In [None]:
43725/duration  
# if an audio file is 2.732834467120181 seconds long and contains 43725 samples, it means that the audio file has a sampling rate of approximately 16000 Hz

In [None]:
n_mfcc = 20  # Number of MFCC coefficients
hop_length = 512  # Number of samples between consecutive frames (frame step)
win_length = 1024  # Length of the analysis window in samples (frame size)

In [None]:
1024/16000  # 64 ms

In [None]:
# Extract MFCC features
mfccs = librosa.feature.mfcc(y=y, sr=sr, hop_length=hop_length, win_length=win_length, n_mfcc=n_mfcc)

In [None]:
mfccs.shape
# The first dimension (20) represents the number of MFCC coefficients extracted. This is a common default value, as 20 coefficients are often sufficient to capture relevant information about the spectral characteristics of the audio signal.
# The second dimension (86) represents the number of frames extracted from the audio signal. The number of frames is determined by the duration of the audio signal and the frame length and hop length used in the feature extraction process.

## Align df_mfcc and df_phoneme 

epi, pau, h# are silence.

In [17]:
target_phoneme = ['epi', 'pau', 'h#', 'n', 'eh', 'v', 'axr']

# substitute every other phoneme with "#b"
df_phoneme['phoneme'] = df_phoneme['phoneme'].apply(lambda x: "#b" if x not in target_phoneme else x)
df_phoneme['phoneme'] = df_phoneme['phoneme'].apply(lambda x: "h#" if x in ['epi', 'pau'] else x)
df_phoneme.head()


Unnamed: 0,start_sample,end_sample,phoneme,diff_sample
0,0,2370,h#,2370
1,2370,3442,#b,1072
2,3442,5351,#b,1909
3,5351,5973,#b,622
4,5973,6863,#b,890


In [None]:
df_phoneme.loc[13:18]

In [None]:
target_range = df_mfcc[(df_mfcc['end_sample'] >= df_phoneme.loc[14, 'start_sample']) & (df_mfcc['start_sample'] <= df_phoneme.loc[18, 'end_sample'])]
target_range

Because a frame (400 samples) can never contain more than 2 phonemes, we can use the following data structure to store the phoneme information.

```
h# (silence)
#b (other phonemes, noise)
#bn (bi-phonemes)
n
neh (bi-phonemes)
eh
ehv
v
axr
vaxr (bi-phonemes)
axr#b (bi-phonemes)
```

11 dimensional feature vector for each frame.

In [None]:

df_phoneme

Example: 'n, eh'
Option 1: 50% probability to intersection 'neh'. The remaining is 'n' and 'eh'. 
Option 2: 

In [24]:
# Adjust display settings
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

new_df_mfcc = label_df_mfcc(df_mfcc, df_phoneme) 
new_df_mfcc[(df_mfcc['start_sample']>=16000) & (df_mfcc['end_sample']<=21000)] 

Unnamed: 0,start_sample,end_sample,mfcc,phoneme
200,16000,16400,"[-484.46353, 62.9625, 43.536667, 36.161552, -33.775795, 7.7863464, -7.81036, -32.905396, 0.38289544, 6.5445538, -3.362215, -19.924276, -11.869465, 11.452057, 8.279446, -4.9956255, 0.43326283, 6.2694674, -7.0054655, 1.9755795]",#b
201,16080,16480,"[-493.01233, 40.778465, 48.444675, 40.07834, -23.769608, 17.849758, -14.231265, -26.100803, 4.215081, 9.904649, -1.5833447, -25.324093, -15.639889, 9.591164, 6.5180416, -3.9707384, 3.0204115, 7.1835256, -6.532056, 6.8589716]",#b
202,16160,16560,"[-496.74826, 30.404907, 50.27643, 51.164154, -18.135803, 14.162009, -17.96069, -22.23743, 3.14307, 4.2015414, -6.2886233, -20.614836, -12.755966, 6.4162655, 2.4632564, -4.287204, 13.980566, 10.460466, -11.342501, 1.995001]",#b
203,16240,16640,"[-486.746, 15.234005, 45.627895, 56.397243, -22.324577, 1.3790563, -14.829262, -9.336773, 0.7815729, 6.828738, -5.83054, -16.375328, -7.6271615, 12.749347, 2.4277577, -11.2310505, 15.654282, 6.4201384, -8.722306, 1.0047979]",#b
204,16320,16720,"[-480.75903, 9.849446, 51.09784, 55.48481, -24.305824, 3.967523, -16.57162, -12.115267, -6.520217, 7.7633343, 5.4241953, -15.951579, -15.844574, 11.521286, 4.755411, -9.295454, 10.152107, -1.1387705, -3.4536805, 5.669768]",#b
205,16400,16800,"[-489.03186, -12.311717, 58.93512, 42.58722, -19.939394, 16.146267, -22.301338, -21.727514, 0.28492898, 6.2198315, 4.7558155, -15.748632, -9.075333, 11.358965, 5.0113773, -0.22935735, 13.739038, -1.828378, -5.681775, -1.4469748]","#b, n"
206,16480,16880,"[-504.2323, -29.32409, 67.523476, 37.44905, -13.633593, 23.534386, -28.00717, -22.00079, 14.396887, 8.668379, 2.1295342, -6.4604816, -1.8816123, 1.2565889, -2.6831994, -2.1007128, 10.023035, 2.7754722, 0.71725273, -5.889139]","#b, n"
207,16560,16960,"[-521.99133, -28.142538, 59.3443, 28.163754, -9.627581, 28.722317, -24.941507, -10.856624, 16.647911, -1.9732568, -4.1634145, -7.0630097, -4.5952353, -4.6961694, -6.205681, -11.348597, 8.628605, 0.5363274, -2.7605577, 0.04995489]","#b, n"
208,16640,17040,"[-531.2007, -17.439379, 42.591145, 7.9479475, -19.197601, 18.446426, -31.931095, -14.686649, 2.0308957, -7.3065987, 11.542793, -4.302381, -0.12107611, 13.267833, 5.3845015, -14.3902, 6.634276, -12.831449, -7.687578, 7.503215]","#b, n"
209,16720,17120,"[-558.5072, -11.389967, 43.460793, -2.4091964, -20.158928, 11.41258, -26.897453, -16.792953, -16.090136, -4.617997, 32.536724, 9.14871, 8.515188, 22.414402, 3.7902784, -8.4484825, 9.170767, -9.301361, 0.8911765, 8.615245]","#b, n"


In [25]:
new_df_mfcc2 = vectorize_label_df_mfcc(new_df_mfcc, df_phoneme)
new_df_mfcc2[(new_df_mfcc2['start_sample']>=16000) & (new_df_mfcc2['end_sample']<=21000)] 

The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state n does not exist in the config.
The state eh does not exist in the config.
The state n does not exist in the config.
The state eh does not exist in the config.
The state n does not exist in the config.
The state eh does not exist in the config.
The state n does not exist in the config.
The state eh does not exist in the config.
The state n does not exist in the config.
The state eh does not exist in the config.
The state eh does not exist in the config.
The state eh does not exist in the config.
The state eh does not exist

Unnamed: 0,start_sample,end_sample,mfcc,phoneme,state_weights,label
200,16000,16400,"[-484.46353, 62.9625, 43.536667, 36.161552, -33.775795, 7.7863464, -7.81036, -32.905396, 0.38289544, 6.5445538, -3.362215, -19.924276, -11.869465, 11.452057, 8.279446, -4.9956255, 0.43326283, 6.2694674, -7.0054655, 1.9755795]",#b,{'#b': 1.0},"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
201,16080,16480,"[-493.01233, 40.778465, 48.444675, 40.07834, -23.769608, 17.849758, -14.231265, -26.100803, 4.215081, 9.904649, -1.5833447, -25.324093, -15.639889, 9.591164, 6.5180416, -3.9707384, 3.0204115, 7.1835256, -6.532056, 6.8589716]",#b,{'#b': 1.0},"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
202,16160,16560,"[-496.74826, 30.404907, 50.27643, 51.164154, -18.135803, 14.162009, -17.96069, -22.23743, 3.14307, 4.2015414, -6.2886233, -20.614836, -12.755966, 6.4162655, 2.4632564, -4.287204, 13.980566, 10.460466, -11.342501, 1.995001]",#b,{'#b': 1.0},"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
203,16240,16640,"[-486.746, 15.234005, 45.627895, 56.397243, -22.324577, 1.3790563, -14.829262, -9.336773, 0.7815729, 6.828738, -5.83054, -16.375328, -7.6271615, 12.749347, 2.4277577, -11.2310505, 15.654282, 6.4201384, -8.722306, 1.0047979]",#b,{'#b': 1.0},"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
204,16320,16720,"[-480.75903, 9.849446, 51.09784, 55.48481, -24.305824, 3.967523, -16.57162, -12.115267, -6.520217, 7.7633343, 5.4241953, -15.951579, -15.844574, 11.521286, 4.755411, -9.295454, 10.152107, -1.1387705, -3.4536805, 5.669768]",#b,{'#b': 1.0},"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
205,16400,16800,"[-489.03186, -12.311717, 58.93512, 42.58722, -19.939394, 16.146267, -22.301338, -21.727514, 0.28492898, 6.2198315, 4.7558155, -15.748632, -9.075333, 11.358965, 5.0113773, -0.22935735, 13.739038, -1.828378, -5.681775, -1.4469748]","#b, n","{'#b': 0.975, 'n': 0.025}","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.975]"
206,16480,16880,"[-504.2323, -29.32409, 67.523476, 37.44905, -13.633593, 23.534386, -28.00717, -22.00079, 14.396887, 8.668379, 2.1295342, -6.4604816, -1.8816123, 1.2565889, -2.6831994, -2.1007128, 10.023035, 2.7754722, 0.71725273, -5.889139]","#b, n","{'#b': 0.775, 'n': 0.225}","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.775]"
207,16560,16960,"[-521.99133, -28.142538, 59.3443, 28.163754, -9.627581, 28.722317, -24.941507, -10.856624, 16.647911, -1.9732568, -4.1634145, -7.0630097, -4.5952353, -4.6961694, -6.205681, -11.348597, 8.628605, 0.5363274, -2.7605577, 0.04995489]","#b, n","{'#b': 0.575, 'n': 0.425}","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.575]"
208,16640,17040,"[-531.2007, -17.439379, 42.591145, 7.9479475, -19.197601, 18.446426, -31.931095, -14.686649, 2.0308957, -7.3065987, 11.542793, -4.302381, -0.12107611, 13.267833, 5.3845015, -14.3902, 6.634276, -12.831449, -7.687578, 7.503215]","#b, n","{'#b': 0.375, 'n': 0.625}","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.375]"
209,16720,17120,"[-558.5072, -11.389967, 43.460793, -2.4091964, -20.158928, 11.41258, -26.897453, -16.792953, -16.090136, -4.617997, 32.536724, 9.14871, 8.515188, 22.414402, 3.7902784, -8.4484825, 9.170767, -9.301361, 0.8911765, 8.615245]","#b, n","{'#b': 0.175, 'n': 0.825}","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.175]"


In [None]:
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.max_colwidth')

In [None]:
# locate the range of phoneme
start_sample = row['start_sample']
end_sample = row['end_sample']

phoneme_vector = [0] * 10

phonme_subset = df_phonme[(df_phonme['start_sample'] >= start_sample) & (df_phonme['end_sample'] <= end_sample)]

for _, phonme_row in phonme_subset.iterrows():
    phoneme = phonme_row['phoneme']
    if phoneme in phonemes:
        phoneme_index = phonemes.index(phoneme)
        phoneme_vector[phoneme_index] = 1

phoneme_labels.append(phoneme_vector)

phoneme_labels