## Word Segmentation for WER Analysis

This Jupyter Notebook performs **word segmentation** as a preprocessing step for calculating the **Word Error Rate (WER)**. The segmentation method used is aligned with the evaluation approach described in [Xiao et al.](https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ipr2.13123), which ensures consistency and comparability with their results.

Before executing the cell, please install  ```cantoseg``` by using the following command:

```pip install cantoseg```

In [None]:
import pandas as pd
import cantoseg
from tqdm import tqdm

tqdm.pandas()
df = pd.read_csv('./result_tvb_134_epoch_test_absolute_full.csv')

# Segmentation function (space-separated for readability or further processing)
def tokenize(text):
    if pd.isna(text):
        return ''
    return ' '.join(cantoseg.cut(str(text)))

# Perform segmentation using the cantoseg library
df['transcript_trad_word_segmented'] = df['transcript_trad'].progress_apply(tokenize)
df['predicted_cantonese_word_segmented'] = df['predicted_cantonese'].progress_apply(tokenize)

df.to_csv('./result_tvb_59_epoch_test_absolute_full_word_segmented.csv', index=False)

print("Word Segmentation is completed. New CSV file have been saved.")


100%|██████████| 30648/30648 [04:33<00:00, 112.21it/s]
100%|██████████| 30648/30648 [05:24<00:00, 94.51it/s] 


Word Segmentation is completed. New CSV file have been saved.


In [None]:
import pandas as pd
import cantoseg
from tqdm import tqdm

tqdm.pandas()
df = pd.read_csv('./result_icable_134_epoch_test_absolute_full.csv')

# Segmentation function (space-separated for readability or further processing)
def tokenize(text):
    if pd.isna(text):
        return ''
    return ' '.join(cantoseg.cut(str(text)))

# Perform segmentation using the cantoseg library
df['transcript_trad_word_segmented'] = df['transcript_trad'].progress_apply(tokenize)
df['predicted_cantonese_word_segmented'] = df['predicted_cantonese'].progress_apply(tokenize)

df.to_csv('./result_icable_59_epoch_test_absolute_full_word_segmented.csv', index=False)

print("Word Segmentation is completed. New CSV file have been saved.")


100%|██████████| 7590/7590 [01:31<00:00, 82.58it/s] 
100%|██████████| 7590/7590 [01:40<00:00, 75.68it/s] 


Word Segmentation is completed. New CSV file have been saved.
