# Benchmarking English syllabification methods

**There exists mainly 4 categories of methods:**
- Dictionary searches, e.g. CMU dic
- Hyphenation methods, e.g. PyHyphen - https://www.tug.org/docs/liang/liang-thesis.pdf
- ML based approaches, non explanability
- Rule based approaches, fast but might be inaccurate, e.g. Sonority Sequencing by nltk

**We have tested the following packages and libraries:**
- `nltk` **sonority** based and **cmu** dict
- `indic` lib (moulshree)
- hyphenation libraries, `pyphen` and `hyphenate`
- `spacy` machine learning models, **spacy_syllable** and **nlp pipeline**
- machine learning models, `meow25` and `bigphoney`
- user contributions, `basirico`, `anonuser1`, `abigailb`, `hauntninja` and `tarun`
- misc libs, `syllapy` and `syllables`

**Some packages syllabify and count, others just count syllables:**

| **Method Type**             | **Implementation Type** | **Implementation Name**       |
|-----------------------------|-------------------------|-------------------------------|
| **Syllabification + Count** | Library                 | nltk (sonority-based)         |
|                             |                         | indic lib                     |
|                             | Hyphenation Library     | pyphen                        |
|                             |                         | hyphenate                     |
|                             | Machine Learning Model  | spaCy syllable                |
|                             |                         | spaCy nlp (same little faster)|
| **Count Only**              | User Contributed        | basirico                      |
|                             |                         | anonuser1                     |
|                             |                         | abigailb                      |
|                             |                         | hauntninja                    |
|                             |                         | tarun â€“ sylco                 |
|                             | Library                 | nltk (cmudict)                |
|                             |                         | syllapy                       |
|                             |                         | syllables                     |
|                             | Machine Learning Model  | meow25                        |
|                             |                         | bigphoney                     |





**To be implemented:**
- Gold standard is dictionary (human intervention)
- Web based approach, apis etc. (https://stackoverflow.com/questions/10414957/using-python-to-find-syllables/10416028#10416028)
     - https://www.howmanysyllables.com/syllables/table
     - https://www.wordcalc.com/index.php

### 1. Imports for library and packages (ignore)

In [11]:
import syll_libraries_niket as syll_functs
import nltk
import time, os
import pandas as pd

In [12]:
# Center the table for markdown cells
from IPython.core.display import HTML
HTML("""
<style>
table {
    margin-left: 0 !important;
    margin-right: auto !important;
}
</style>
""")

### 2. Add word list for benchmarking

Tried to add some hard words too. Find more words that diverge from each other

| Word        | Syllabification | Syllable Count |
|-------------|------------------|----------------|
| usually     | u-su-al-ly       | 4              |
| amsterdam   | am-ster-dam      | 3              |
| latika      | la-ti-ka         | 3              |
| table       | ta-ble           | 2              |
| passing     | pass-ing         | 2              |
| contest     | con-test         | 2              |
| conflict    | con-flict        | 2              |
| construct   | con-struct       | 2              |
| table       | ta-ble           | 2              |
| lion        | li-on            | 2              |

### 3. Now bechmark: on number of syllables

#### 3.0 Testing function performance on test words

In [13]:
word_list = ["usually", "amsterdam", "latika", "table", "passing", 'contest', 'conflict', 'construct', 'table', 'lion']
start_time = time.perf_counter()
for word in word_list:
    
    # syll = syll_functs.syllable_sonority(word)
    # syll = syll_functs.syllable_indic(word)
    # syll = syll_functs.syllable_hyphen_pyphen(word)
    # syll = syll_functs.syllable_hyphen_hyphenate(word)
    
    # syll = syll_functs.syllable_MLspacysyll(word) #abt 15 mins
    # syll = syll_functs.syllable_MLspacynlp(word)
    
    # syll = syll_functs.n_syllable_basirico(word)
    # syll = syll_functs.n_syllable_anonuser1(word)
    # syll = syll_functs.n_syllable_abigailb(word)
    # syll = syll_functs.n_syllable_hauntninja(word)
    # syll = syll_functs.n_syllable_tarunsylco(word)

    # syll = syll_functs.n_syllable_cmudict(word)
    # syll = syll_functs.n_syllable_syllapy(word)
    # syll = syll_functs.n_syllable_syllables(word)
    
    #syll = syll_functs.n_syllable_MLmeow25(word)
    syll = syll_functs.n_syllable_MLbigphone(word)
    
    print(syll)


elapsed_time = time.perf_counter() - start_time
print(elapsed_time)

{'word': 'usually', 'nsyll': 4}
{'word': 'amsterdam', 'nsyll': 3}
{'word': 'latika', 'nsyll': 3}
{'word': 'table', 'nsyll': 2}
{'word': 'passing', 'nsyll': 2}
{'word': 'contest', 'nsyll': 2}
{'word': 'conflict', 'nsyll': 2}
{'word': 'construct', 'nsyll': 2}
{'word': 'table', 'nsyll': 2}
{'word': 'lion', 'nsyll': 2}
1.9057884999929229


In [14]:
# sum(
#     1 
#     if count_syllables(word) in (sum(1 for p in x if p[-1].isdigit()) for x in pron)
# ) / len(cd)
# # 0.9073751569397757
# cd


# from collections import Counter
# for word, _ in Counter(nltk.corpus.brown.words()).most_common(1000):
#     word = word.lower()
#     if word in cd and count_syllables(word) not in (sum(1 for p in x if p[-1].isdigit()) for x in cd[word]):
#         print(word)

# sono_df = []
# for word in cmu_df.word.to_list():
#     nsyll = syll_functs.n_syllable_syllapy(word)['nsyll']
#     sono_df.append([word,nsyll])
# sono_df = pd.DataFrame(sono_df, columns=['word','nsyll_sono'])
# sono_df

#### 3.1 Benchmarking #1 dataset: CMU dictionary

In [15]:
# load the dataset
cd = nltk.corpus.cmudict.dict()

# Iterate and create a test df to check against
cmu_df = []
for word, pron in cd.items():
    nsyll = max([len([y for y in x if y[-1].isdigit()]) for x in pron])
    cmu_df.append([word,nsyll])
cmu_df = pd.DataFrame(cmu_df, columns=['word','nsyll_cmu'])

# print the dataframe
print(cmu_df)

              word  nsyll_cmu
0                a          1
1               a.          1
2           a42128          6
3              aaa          3
4           aaberg          2
...            ...        ...
123450        zysk          1
123451   zyskowski          3
123452    zyuganov          3
123453  zyuganov's          3
123454     zywicki          3

[123455 rows x 2 columns]


In [16]:
# list of our syllable functions
syllable_functions = [
    syll_functs.syllable_sonority, syll_functs.syllable_indic,
    syll_functs.syllable_hyphen_pyphen, syll_functs.syllable_hyphen_hyphenate,
    
    syll_functs.n_syllable_basirico, syll_functs.n_syllable_anonuser1, syll_functs.n_syllable_abigailb, syll_functs.n_syllable_hauntninja, syll_functs.n_syllable_tarunsylco,
    syll_functs.n_syllable_syllapy, syll_functs.n_syllable_syllables,

    # Same, use any one from below (spaynlp) is faster, syll_functs.syllable_MLspacysyll
    syll_functs.syllable_MLspacynlp,
    syll_functs.n_syllable_MLmeow25, syll_functs.n_syllable_MLbigphone
]

# Store results here
results = []

# Loop through each function
for func in syllable_functions:
    print("Executing function:", func)
    
    correct = 0
    total = len(cmu_df)
    # for timing
    start_time = time.perf_counter()

    for _, row in cmu_df.iterrows():
        word = row['word']
        true_count = row['nsyll_cmu']
        pred_count = func(word)['nsyll']
        if pred_count == true_count:
            correct += 1

    # for timing
    elapsed_time = time.perf_counter() - start_time
    
    accuracy = correct / total
    results.append({
        'function': func.__name__,
        'accuracy': accuracy,
        'time_seconds': elapsed_time
    })

# Create a results DataFrame
results_df = pd.DataFrame(results).sort_values(by='accuracy', ascending=False)
print(results_df)
results_df.to_csv("result_benchmark_cmu.csv",index=None)

Executing function: <function syllable_sonority at 0x0000021B9FA34040>
Executing function: <function syllable_indic at 0x0000021BDEBDE8E0>
Executing function: <function syllable_hyphen_pyphen at 0x0000021BCF85C360>
Executing function: <function syllable_hyphen_hyphenate at 0x0000021BDE888AE0>
Executing function: <function n_syllable_basirico at 0x0000021BCF85C180>
Executing function: <function n_syllable_anonuser1 at 0x0000021BCD51BC40>
Executing function: <function n_syllable_abigailb at 0x0000021BDE88A3E0>
Executing function: <function n_syllable_hauntninja at 0x0000021BDE88AC00>
Executing function: <function n_syllable_tarunsylco at 0x0000021BDE88B740>
Executing function: <function n_syllable_syllapy at 0x0000021BDE88B920>
Executing function: <function n_syllable_syllables at 0x0000021BDE88B880>
Executing function: <function syllable_MLspacynlp at 0x0000021BDEBDE980>
Executing function: <function n_syllable_MLmeow25 at 0x0000021BFEF63CE0>
Executing function: <function n_syllable_MLb

#### 3.2 Benchmarking #2 dataset: 1-7 syllable words

In [17]:
# load the dataset
all_files = os.listdir("benchmarking_dataset")

# Iterate and create a bench_df to check against
bench_df = []

for file in all_files:
    df = pd.read_csv(os.path.join("benchmarking_dataset", file), header=None)
    df['nsyll_bench'] = int(file[0])
    bench_df.append(df)

bench_df = pd.concat(bench_df, ignore_index=True)
bench_df.rename(columns={0:'word'}, inplace=True)

# print the dataframe
print(bench_df)

                word  nsyll_bench
0                the            1
1                 of            1
2                and            1
3                 to            1
4                 in            1
...              ...          ...
9456  unsatisfactory            6
9457  respectability            6
9458  unintelligible            6
9459  reconciliation            6
9460   individuality            7

[9461 rows x 2 columns]


In [18]:
# list of our syllable functions
syllable_functions = [
    syll_functs.syllable_sonority, syll_functs.syllable_indic,
    syll_functs.syllable_hyphen_pyphen, syll_functs.syllable_hyphen_hyphenate,
    
    syll_functs.n_syllable_basirico, syll_functs.n_syllable_anonuser1, syll_functs.n_syllable_abigailb, syll_functs.n_syllable_hauntninja, syll_functs.n_syllable_tarunsylco,
    syll_functs.n_syllable_syllapy, syll_functs.n_syllable_syllables,

    # Same, use any one from below (spaynlp) is faster, syll_functs.syllable_MLspacysyll
    syll_functs.syllable_MLspacynlp,
    syll_functs.n_syllable_MLmeow25, syll_functs.n_syllable_MLbigphone
]

# Store results here
results = []

# Loop through each function
for func in syllable_functions:
    print("Executing function:", func)
    
    correct = 0
    total = len(bench_df)
    # for timing
    start_time = time.perf_counter()

    for _, row in bench_df.iterrows():
        word = row['word']
        true_count = row['nsyll_bench']
        pred_count = func(word)['nsyll']
        if pred_count == true_count:
            correct += 1

    # for timing
    elapsed_time = time.perf_counter() - start_time
    
    accuracy = correct / total
    results.append({
        'function': func.__name__,
        'accuracy': accuracy,
        'time_seconds': elapsed_time
    })

# Create a results DataFrame
results_df = pd.DataFrame(results).sort_values(by='accuracy', ascending=False)
print(results_df)
results_df.to_csv("result_benchmark_dataset.csv",index=None)

Executing function: <function syllable_sonority at 0x0000021B9FA34040>
Executing function: <function syllable_indic at 0x0000021BDEBDE8E0>
Executing function: <function syllable_hyphen_pyphen at 0x0000021BCF85C360>
Executing function: <function syllable_hyphen_hyphenate at 0x0000021BDE888AE0>
Executing function: <function n_syllable_basirico at 0x0000021BCF85C180>
Executing function: <function n_syllable_anonuser1 at 0x0000021BCD51BC40>
Executing function: <function n_syllable_abigailb at 0x0000021BDE88A3E0>
Executing function: <function n_syllable_hauntninja at 0x0000021BDE88AC00>
Executing function: <function n_syllable_tarunsylco at 0x0000021BDE88B740>
Executing function: <function n_syllable_syllapy at 0x0000021BDE88B920>
Executing function: <function n_syllable_syllables at 0x0000021BDE88B880>
Executing function: <function syllable_MLspacynlp at 0x0000021BDEBDE980>
Executing function: <function n_syllable_MLmeow25 at 0x0000021BFEF63CE0>
Executing function: <function n_syllable_MLb

#### 3.3 Benchmarking selected on PARINDE english

In [None]:
see syllable_testing_parinde

#### Results

- For finding number of syllables, Treat CMU dict as gold standard
- If not found, then use ML based bigphoney, or use hauntninja for faster computations
- For syllabification, use ML based spacy nlp, or use hyphenate for faster computations

Sometimes, everything returns wrong, and then use sonority
