## Configure Training Process

For configuring the training process, you can utilize the configuration files already provided by ESPnet contributors. To use a configuration file, you'll need to create a YAML file on your local machine. For instance, you can use the [e-branchformer config](train_asr_e-branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml).

In [None]:
%%writefile ./train.yml
# network architecture

# frontend related
# This is a configuration file for training an ASR model using ESPnet.
# It specifies the architecture, optimization settings, and data processing parameters.
# The model uses a conformer encoder and a transformer decoder with hybrid CTC/attention loss
# Since we compare CTC and attention decoding, we keep the encoder input identical. 
# 80-dim log-mel is ESPnet's default input representation.

frontend: default
frontend_conf:
    n_fft: 512 # Number of FFT points for STFT, 32 ms @ 16 kHz
    win_length: 400 # Window length for STFT, 25 ms @ 16 kHz
    hop_length: 160 # Hop length for STFT, 10 ms @ 16 kHz

# this configures how audio is converted to spectrograms -> the input
# representation for the neural network


# encoder related
# The encoder is a conformer model, which combines convolutional and self-attention mechanisms.
# It consists of 12 blocks with a linear unit size of 2048 and an output size of 256.
# The attention mechanism uses 4 heads, and the model employs relative positional encoding.
# The activation function is Swish, and it uses a macaron style with CNN modules
# with a kernel size of 15. 
# A conformer combines the strengths of convolutional neural networks (CNNs, local feature extraction) and 
# transformers (long-range dependencies). Why conformer?
# Convolution (local invariance, which means it can recognize patterns regardless of their position in the input sequence) 
# is good for local feature extraction, while self-attention (global invariance) 
# is good for long-range dependencies.
# This is particularly good for small datasets, because the CNN module works as an inductive bias 
# (i.e., it helps the model generalize better from limited data), and SpecAugment makes it robust to noise.

encoder: conformer
encoder_conf:
    input_layer: conv2d # strided 2-D CNN front
    num_blocks: 12
    linear_units: 2048 # FFN inner-dim 
    dropout_rate: 0.1
    output_size: 256 # hidden dim per frame
    attention_heads: 4
    attention_dropout_rate: 0.0
    pos_enc_layer_type: rel_pos
    selfattention_layer_type: rel_selfattn
    activation_type: swish
    macaron_style: true
    use_cnn_module: true
    cnn_module_kernel: 15


# decoder related
# The decoder is a transformer model with 6 blocks, each having a linear unit size of 2048.
# It uses an embedding layer for input and has a dropout rate of 0.1.
# The transformer decoder is designed to handle sequential data.
decoder_conf:
    input_layer: embed 
    num_blocks: 6
    linear_units: 2048
    dropout_rate: 0.1

# The decoder generates text sequences from the encoded audio features.

# hybrid CTC/attention
# The model uses a hybrid CTC/attention mechanism for training.
# CTC (Connectionist Temporal Classification) is used for sequence-to-sequence tasks where the alignment 
# between input and output sequences is not known.
# The attention mechanism allows the model to focus on specific parts of the input sequence when generating each output token.
# Therefore it provides better sequence modeling.
# The CTC weight is set to 0.3, and the label smoothing weight is set to 0.1.
# This weight can be adjusted to compare CTC and attention decoding.

# Length
# ctc weight 1 - pure CTC, no auto-reg decoding -> can test greedy CTC decoding
# ctc weight 0 - pure attention, no CTC loss -> can test autoreg decoding
# ctc_weight: 0.3 - 
model_conf:
    ctc_weight: 0.3 # 0.3 * CTC loss + 0.7 * attention loss
    lsm_weight: 0.1 # label smoothing weight, which helps prevent overfitting by smoothing the target labels
    length_normalized_loss: false

# optimization related
# The optimizer used is Adam with a learning rate of 4.0.
# Gradient accumulation is set to 1, meaning gradients are updated after each batch.
# Gradient clipping is set to 3 to prevent exploding gradients.
# The maximum number of epochs for training is set to 50.
# The Noam learning rate scheduler is used with a model size of 256 and a warmup period of 25000 steps.
# The Noam scheduler gradually increases the learning rate during the warmup phase and then decreases it.
# This helps stabilize training in the initial stages.
optim: adam
accum_grad: 1
grad_clip: 3
max_epoch: 50
optim_conf:
    lr: 4.0 #  # This is NOT a raw LR!  ESPnet’s Noam LR = scale * d_model^-0.5
scheduler: noamlr
scheduler_conf:
    model_size: 256
    warmup_steps: 25000



# minibatch
# the batch means the number of samples in each training step. The type of batch is 'numel', which means 
# the batch size is determined by the total number of elements in the batch.
# The batch_bins parameter specifies the number of bins for batching, which is set to 10 million.
batch_type: numel
batch_bins: 10000000

# the best_model_criterion specifies the metric used to select the best model during training.
# In this case, it uses validation accuracy ('valid' and 'acc') and selects the maximum value.
# This means the model with the highest validation accuracy will be saved as the best model.
# The keep_nbest_models parameter specifies how many of the best models to keep during training.
best_model_criterion:
-   - valid
    - acc
    - max
keep_nbest_models: 3
# valid/acc in ESPnet = 1 – WER during dev decoding (because higher = better).


# SpeAugment is a data augmentation technique that applies various transformations to the input audio data.
# It includes time warping, frequency masking, and time masking to improve the model's robustness. 
# It masks certain frequency bands and time segments in the spectrogram to simulate noise and variability in the data.
# This helps the model generalize better to unseen data.
# The configuration specifies that time warping is applied with a window size of 5, 
# frequency masking is applied with a width range of 0 to 30, and time masking is applied with a width range of 0 to 40.
# The number of frequency masks is set to 2, and the number of time masks is set to 2.
# The time warping mode is set to bicubic interpolation, which provides smooth transformations.

specaug: specaug # Data augmentation
specaug_conf: 
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 30
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_range:
    - 0
    - 40
    num_time_mask: 2
    

# The training process will not use any visualization tools like Matplotlib or TensorBoard as of now.
# But in the future, we will work on this.
use_matplotlib: false
use_tensorboard: false

Overwriting ./train.yml


In [None]:
# preprocessing configuration
# This section specifies the preprocessing steps for the input data.
# It includes the use of a preprocessor, tokenization method (BPE), and various audio processing parameters.
# The preprocessor is set to 'default', and the configuration includes parameters for speech normalization, noise application, and tokenization.
# The BPE model is specified, and the token list is provided for the model to understand the vocabulary.
# The configuration also allows for the application of RIR (Room Impulse Response) and noise to the audio data, with specified probabilities for their application.
# The speech volume normalization and non-linguistic symbols are set to null, indicating no specific processing for these aspects.
# The g2p (grapheme-to-phoneme) and cleaner configurations are also set to null, meaning no specific processing is applied for these tasks.
# The token list is specified, which contains the vocabulary used for tokenization.
# The preprocessor_conf section specifies the names of the speech and text fields in the data.
# The use_preprocessor flag indicates whether to use a preprocessor for the data.



%%writefile ./preprocess.yaml
use_preprocessor: true

token_type: bpe # Tokenization method, same for both models
bpemodel: data/bpemodel/bpe.model # BPE model file
rir_scp: null # Room Impulse Response script
rir_apply_prob: 1.0 # Probability of applying RIR -> currently 1.0. RIR simulates the effect of different room acoustics on the audio data.
noise_scp: null # Noise script
noise_apply_prob: 1.0 # Probability of applying noise, currently 1.0
noise_db_range: '13_15' # Range of noise dB levels to apply
speech_volume_normalize: null # Speech volume normalization
non_linguistic_symbols: null 

cleaner: null # Text cleaner
g2p: null # Grapheme-to-phoneme conversion, null means no conversion
preprocessor: default 
preprocessor_conf:
  speech_name: speech # Name of the speech field in the data
  text_name: text # Name of the text field in the data

token_list: data/bpemodel/tokens.txt

Overwriting ./preprocess.yaml


## Training

To prepare the stats file before training, you can execute the `collect_stats` method. This step is required before the training process and ensuring accurate statistics for the model.


In [5]:
# !rm -r ./exp/stats
# !rm -r ./exp/train_asr_branchformer_e24_amp

In [None]:
# This script sets up the training configuration for an ASR model using ESPnet.

import espnetez as ez # Importing the espnetez library for ASR tasks
import yaml # Importing the yaml library for reading configuration files

data_info = { # Data information dictionary
    # This dictionary specifies the structure of the data used for training.
    # It includes the names of the fields for speech and text data.
    # The "speech" field contains the audio data, and the "text" field contains
    # the corresponding transcriptions.
    # The "wav.scp" file contains the audio file paths, and the "text" file contains the transcriptions.
    # This structure is used to load and preprocess the data for training the ASR model.
    "speech": ["wav.scp", "sound"],
    "text": ["text", "text"],
}

EXP_DIR = "exp/train_asr_branchformer_e24_amp" # Directory where the training outputs will be saved
# This directory will contain the trained model, logs, and other related files.
STATS_DIR = "exp/stats" # Directory where the statistics of the training data will be saved

# load config
# For the configuration, please refer to the last cell in this notebook.
training_config = ez.config.from_yaml(
    "asr",
    "train.yml",
)

with open("preprocess.yaml") as stream:
    preprocessor_config = yaml.safe_load(stream)
    training_config.update(preprocessor_config)

with open(preprocessor_config["token_list"], "r") as f:
    training_config["token_list"] = [t.replace("\n", "") for t in f.readlines()]

# When you don't use yaml file, you can load finetune_config in the following way.
# task_class = ez.task.get_ez_task("asr")
# default_config = task_class.get_default_config()
# training_config = default_config.update(your_config_in_dict)

# Define the Trainer class
# The Trainer class is responsible for managing the training process, including data loading, model training, and evaluation.
# It takes various parameters such as the task type, training configuration, data information, output directory, and statistics directory.
# The ngpu parameter specifies the number of GPUs to use for training (0 for CPU, 1 for GPU).
# The collect_stats method is called to gather statistics from the training data, which is necessary for normalization and other preprocessing steps.
trainer = ez.Trainer(
    task='asr',
    train_config=training_config,
    train_dump_dir="dump/train",
    valid_dump_dir="dump/valid",
    data_info=data_info,
    output_dir=EXP_DIR,
    stats_dir=STATS_DIR,
    ngpu=1, # number of GPU, change to 0 if run on CPU, 1 if run on GPU (colab)
)
trainer.collect_stats() # ESPnet requires statistics of the training data for normalization and other preprocessing steps.

/opt/conda/envs/espnet/bin/python /opt/conda/envs/espnet/lib/python3.10/site-packages/ipykernel_launcher.py -f /home/jovyan/.local/share/jupyter/runtime/kernel-510df032-2600-4a04-accf-ec5898abdc32.json
[jupyter-wpc0385] 2025-07-03 10:33:12,091 (asr:523) INFO: Vocabulary size: 1000
[jupyter-wpc0385] 2025-07-03 10:33:12,472 (abs_task:1383) INFO: pytorch.version=2.5.1, cuda.available=True, cudnn.version=90100, cudnn.benchmark=False, cudnn.deterministic=True
[jupyter-wpc0385] 2025-07-03 10:33:12,477 (abs_task:1384) INFO: Model structure:
ESPnetASRModel(
  (frontend): DefaultFrontend(
    (stft): Stft(n_fft=512, win_length=400, hop_length=160, center=True, normalized=False, onesided=True)
    (frontend): Frontend()
    (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
  )
  (specaug): SpecAug(
    (time_warp): TimeWarp(window=5, mode=bicubic)
    (freq_mask): MaskAlongAxis(mask_width_range=[0, 30], num_mask=2, axis=freq)
    (time_mask): MaskAlongAxis(mask_wid

Finally, we are ready to begin the training process!

In [None]:
trainer.train() # Start the training process using the Trainer class.
# This executes the training loop.
# 1. Forward pass: The model processes the input data and generates predictions.
# 2. Loss calculation: The loss function computes the difference between the predicted and actual values. 
# Combines CTC and attention losses.
# 3. Backpropagation: The gradients are calculated and used to update the model parameters.
# 4. Validation: The model is evaluated on the validation set to monitor its performance.
# 5. Model saving: The best model is saved based on the validation performance.

# MODEL ARCHITECTURE FLOW
# The model architecture consists of an encoder and a decoder.
# The encoder processes the input audio features and generates a sequence of hidden states.
# The decoder takes these hidden states and generates the output text sequence.
# The encoder uses a conformer architecture, which combines convolutional and self-attention mechanisms.
# The decoder uses a transformer architecture with attention mechanisms.
# The training process involves optimizing the model parameters using a hybrid CTC/attention loss function.
# The CTC loss allows the model to learn alignments between input and output sequences, while the attention loss helps the model focus on relevant parts of the input during decoding.
# The training process iteratively updates the model parameters to minimize the loss function, improving the model's performance on the ASR task.
# The model is trained using a dataset of audio recordings and their corresponding transcriptions.

# 1. Audio Input (16khz waveform)
# 2. Frontend Feature Extraction (mel-spectrogram)
# 3. Encoder (Conformer)
#    - Convolutional layers (Conv2D input layer)
#    - 12 Conformer blocks (self-attention, feed-forward, and CNN modules)
#    - Outputs acoustic representation (256-dimensional features)
# 4. Decoder (Transformer)
#    - 6 Transformer blocks (self-attention, feed-forward)
#    - Outputs text sequence (token IDs)
# 5. Output: BPE tokens (text sequence)

/opt/conda/envs/espnet/bin/python /opt/conda/envs/espnet/lib/python3.10/site-packages/ipykernel_launcher.py -f /home/jovyan/.local/share/jupyter/runtime/kernel-510df032-2600-4a04-accf-ec5898abdc32.json
[jupyter-wpc0385] 2025-07-03 10:34:20,649 (asr:523) INFO: Vocabulary size: 1000
[jupyter-wpc0385] 2025-07-03 10:34:21,071 (abs_task:1383) INFO: pytorch.version=2.5.1, cuda.available=True, cudnn.version=90100, cudnn.benchmark=False, cudnn.deterministic=True
[jupyter-wpc0385] 2025-07-03 10:34:21,076 (abs_task:1384) INFO: Model structure:
ESPnetASRModel(
  (frontend): DefaultFrontend(
    (stft): Stft(n_fft=512, win_length=400, hop_length=160, center=True, normalized=False, onesided=True)
    (frontend): Frontend()
    (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
  )
  (specaug): SpecAug(
    (time_warp): TimeWarp(window=5, mode=bicubic)
    (freq_mask): MaskAlongAxis(mask_width_range=[0, 30], num_mask=2, axis=freq)
    (time_mask): MaskAlongAxis(mask_wid

## Inference
You can just use the inference API of the ESPnet.

In [None]:
# How to use the trained model for inference
# Loads the trained model with specified configurations and performs inference on audio data.
# Beam search: uses beam_size = 20 for better accuracy. Beam search is a decoding strategy that explores multiple possible output sequences to find the most likely one.
# It maintains a fixed number of hypotheses (beam size) at each decoding step, allowing it to consider multiple paths in the output space.
# This helps improve the accuracy of the transcriptions by exploring different possible sequences and selecting the most likely one based on the model's predictions.
# Language model (LM) integration: uses a language model to improve transcription accuracy.
# The language model is trained separately and is used to provide context and improve the accuracy of the transcriptions.
# The LM weight is set to 0.2, which balances the contribution of the language model and the acoustic model during decoding.
# The CTC weight is set to 0.8, which determines the influence of the CTC loss in the hybrid CTC/attention decoding
# process. This weight can be adjusted to compare CTC and attention decoding.


from espnet2.bin.asr_inference import Speech2Text

m = Speech2Text(
    asr_train_config="./exp/train_asr_branchformer_e24_amp/config.yaml",
    asr_model_file="./exp/train_asr_branchformer_e24_amp/valid.acc.best.pth",
    lm_train_config="exp/lm/cy/rnn/config.yaml",  # LM config
    lm_file="exp/lm/cy/rnn/valid.loss.ave.pth",  # LM model
    token_type="bpe",
    bpemodel="data/bpemodel/bpe.model",
    device="cuda",
    beam_size=20, # Beam size for decoding
    lm_weight=0.2,
    ctc_weight=0.8,
)



In [None]:
import librosa # librosa is a Python library for audio and music analysis.

i = 5 # Index of the sample to be processed

with open("./dump/valid/wav.scp", "r") as f: # Read the audio file paths
    sample_path = f.readlines()[i]

with open("./dump/valid/text", "r") as f: # Read the transcriptions
    transription = " ".join(f.readlines()[i].split(" ")[1:-1])

y, sr = librosa.load(sample_path.split()[1], sr=16000, mono=True)
nbests = m(y) # run inference on (n) number of best performing models
text, *_ = nbests[0] # get the 1st (best) model result

print("Predicted:", text)
print("Label:", transription)

Predicted: Yr oedd yn wedwr llai, gan yr eglwys arfantus.
Label: Yr oedd yn bregethwr lleyg yn yr Eglwys
