https://github.com/philipperemy/timit

https://www.kaggle.com/datasets/mfekadu/darpa-timit-acousticphonetic-continuous-speech?resource=download

https://www.kaggle.com/code/julwan/phoneme-recognition-with-wav2vec2

wav2vec seems like a common transformer model for speech recognition. 

630 speakers, each read 10 sentences—2 dialect "shibboleth" sentences (SA), 5 phonetically compact sentences (SX), and 3 phonetically diverse sentences (SI).

- SA: expose the dialectal variants of the speakers.
- SX: provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest.
- SI: add diversity in sentence types and phonetic contexts.



| Sentence Type | #Sentences | #Speakers | Total | #Sentences/Speaker |
|---------------|------------|-----------|-------|--------------------|
| Dialect (SA)  |      2     |    630    | 1260  |         2          |
| Compact (SX)  |     450    |     7     | 3150  |         5          |
| Diverse (SI)  |    1890    |     1     | 1890  |         3          |
|---------------|------------|-----------|-------|--------------------|
| Total         |    2342    |           | 6300  |        10          |


| Dialect Region (dr) | #Male   | #Female  | Total   |
|----------------------|---------|----------|---------|
|         1            | 31 (63%)| 18 (27%) | 49 (8%) |
|         2            | 71 (70%)| 31 (30%) |102(16%)|
|         3            | 79 (67%)| 23 (23%) |102(16%)|
|         4            | 69 (69%)| 31 (31%) |100(16%)|
|         5            | 62 (63%)| 36 (37%) | 98(16%)|
|         6            | 30 (65%)| 16 (35%) | 46 (7%)|
|         7            | 74 (74%)| 26 (26%) |100(16%)|
|         8            | 22 (67%)| 11 (33%) | 33 (5%)|
|----------------------|---------|----------|---------|
|        Total         | 438(70%)|192(30%)  |630(100%)|

The dialect regions are:
   dr1: New England
   dr2: Northern
   dr3: North Midland
   dr4: South Midland
   dr5: Southern
   dr6: New York City
   dr7: Western
   dr8: Army Brat (moved around)


For each utterance, the following information is available:
- .wav file: the speech waveform.
- .phn file: Time-aligned phonetic transcription.
- .wrd file: Time-aligned word transcription.
- .txt file: Associated orthographic transcription of the words the
            person said.

Notes:
1. pick keywords from `PROMPTS.TXT`.

In [7]:
# Load from JSON
with open('word_counts.json', 'r') as f:
    loaded_word_counts_dict = json.load(f)


In [8]:
loaded_word_counts_dict

{'File': 2,
 'prompts': 1,
 'txt': 1,
 'updated': 1,
 '10': 1,
 '31': 1,
 '88': 1,
 'Prompt': 1,
 'form': 7,
 'of': 405,
 'each': 19,
 'TIMIT': 1,
 'sentence': 2,
 'text': 2,
 'followed': 7,
 'by': 82,
 'type': 4,
 'and': 366,
 'number': 10,
 'Lines': 1,
 'beginning': 2,
 'with': 105,
 'a': 516,
 'semicolon': 1,
 'are': 151,
 'comments': 1,
 'should': 23,
 'be': 141,
 'ignored': 4,
 'on': 126,
 'searches': 1,
 'She': 44,
 'had': 62,
 'your': 46,
 'dark': 9,
 'suit': 1,
 'in': 276,
 'greasy': 1,
 'wash': 2,
 'water': 8,
 'all': 65,
 'year': 10,
 'sa1': 1,
 'Don': 7,
 't': 84,
 'ask': 4,
 'me': 62,
 'to': 394,
 'carry': 3,
 'an': 79,
 'oily': 3,
 'rag': 3,
 'like': 40,
 'that': 107,
 'sa2': 1,
 'This': 52,
 'was': 161,
 'easy': 6,
 'for': 149,
 'us': 31,
 'sx3': 1,
 'Jane': 1,
 'may': 41,
 'earn': 1,
 'more': 38,
 'money': 12,
 'working': 4,
 'hard': 11,
 'sx4': 1,
 'is': 258,
 'thinner': 1,
 'than': 21,
 'I': 38,
 'am': 3,
 'sx5': 1,
 'Bright': 1,
 'sunshine': 1,
 'shimmers': 1,
 'the':

In [2]:
import glob

In [3]:
test_wav_path = glob.glob('timit/data/TRAIN/DR2/*/*.WAV')[0]
test_wav_path

'timit/data/TRAIN/DR2/FJKL0/SX302.WAV'

In [19]:
test_wav_path_2 = glob.glob('timit/data/TRAIN/DR2/*/*.WAV')[1]
test_wav_path_2

'timit/data/TRAIN/DR2/FJKL0/SX212.WAV'

In [7]:
glob.glob('timit/data/TRAIN/DR2/FJKL0/SX302.*')

['timit/data/TRAIN/DR2/FJKL0/SX302.TXT',
 'timit/data/TRAIN/DR2/FJKL0/SX302.WAV',
 'timit/data/TRAIN/DR2/FJKL0/SX302.WAV.wav',
 'timit/data/TRAIN/DR2/FJKL0/SX302.WRD',
 'timit/data/TRAIN/DR2/FJKL0/SX302.PHN']

In [20]:
glob.glob('timit/data/TRAIN/DR2/FJKL0/SX212.*')

['timit/data/TRAIN/DR2/FJKL0/SX212.WAV.wav',
 'timit/data/TRAIN/DR2/FJKL0/SX212.TXT',
 'timit/data/TRAIN/DR2/FJKL0/SX212.WAV',
 'timit/data/TRAIN/DR2/FJKL0/SX212.PHN',
 'timit/data/TRAIN/DR2/FJKL0/SX212.WRD']

The default frame length used by the librosa library when computing MFCC is 25 milliseconds (ms). This means that the audio signal is divided into overlapping frames, each of which is 25 ms long.

In [22]:
import librosa

def extract_mfcc(audio_path):
    # Load the .wav file
    # sample_rate: number of samples of audio carried per second, measured in Hertz (Hz)
    audio_data, sample_rate = librosa.load(audio_path)
    
    # Extract MFCC features
    mfcc = librosa.feature.mfcc(y=audio_data, sr=sample_rate)

    # Print the shape of the MFCC features
    print("MFCC shape:", mfcc.shape)
    
    return mfcc

extract_mfcc(test_wav_path)
extract_mfcc(test_wav_path_2)

MFCC shape: (20, 110)
MFCC shape: (20, 152)


array([[-7.0952979e+02, -7.1390491e+02, -7.2706232e+02, ...,
        -7.2968164e+02, -7.3138129e+02, -7.3147046e+02],
       [ 1.7904613e+01,  1.6811489e+01,  9.4369221e+00, ...,
         9.3325520e+00,  8.5217667e+00,  8.6797314e+00],
       [ 3.0789194e+00,  8.3299894e+00,  1.0474550e+01, ...,
         7.1994791e+00,  7.6631918e+00,  7.9200153e+00],
       ...,
       [ 1.1139317e+01,  8.2712784e+00,  1.0931711e+00, ...,
         1.1087195e+00,  3.6310863e+00,  4.2420616e+00],
       [ 1.2204434e+01,  9.0462980e+00,  2.1593666e+00, ...,
         1.7174146e-01,  2.8015797e+00,  3.5013776e+00],
       [ 2.9520707e+00,  4.6616049e+00,  2.2765219e+00, ...,
         1.4163768e-01,  2.2473769e+00,  3.1368017e+00]], dtype=float32)

In [26]:
def print_transcript(transcript_path):
    with open(transcript_path, 'r') as file:
        transcript = file.read()
        print(transcript)
        
# read the transcript
print_transcript(glob.glob('timit/data/TRAIN/DR2/FJKL0/SX302.txt')[0])
print_transcript(glob.glob('timit/data/TRAIN/DR2/FJKL0/SX212.txt')[0])

0 40551 Tofu is made from processed soybeans.

0 56116 I gave them several choices and let them set the priorities.



In [27]:
40551/110

368.6454545454545

In [28]:
56116/152

369.1842105263158

In [29]:
# read the phonemes
phn_file = glob.glob('timit/data/TRAIN/DR2/FJKL0/SX302.phn')[0]
with open(phn_file, 'r') as file:
    phn = file.readlines()
    # remove the newline character
    phn = [line.strip() for line in phn]
    print(phn)

['0 2230 h#', '2230 3440 t', '3440 5560 ow', '5560 7600 f', '7600 9900 uw', '9900 11200 ih', '11200 12254 z', '12254 13280 m', '13280 14955 ey', '14955 15560 dcl', '15560 16962 f', '16962 17591 r', '17591 18520 em', '18520 19240 pcl', '19240 20310 p', '20310 20929 r', '20929 22558 aa', '22558 24689 s', '24689 26120 eh', '26120 28860 s', '28860 31640 oy', '31640 32410 bcl', '32410 32590 b', '32590 34656 iy', '34656 36380 n', '36380 38380 z', '38380 40480 h#']
