# HU-LLM-Text Mistral Large: PhonologyBench for Russian 

Inspired by [PhonologyBench: Evaluating Phonological Skills of Large Language Models](https://aclanthology.org/2024.knowllm-1.1/) (Suvarna et al., KnowLLM 2024). 


## Part 1: Counting syllables in words

The idea is to test how good is HU-LLM in counting syllables in Russian words. I took words with different number of syllables from [this web](https://slogi.su/1). I expect it to be quite good in this task as there is a clear algorithm how to identify a syllable in a Russian word. 

The preprocessed files with words are located in `words_folder`. The code below: 

- reads each file from the folder
- counts how many words are in each file 
- creates a big csv with all words and syllables count

In [36]:
import os
import pandas as pd
dfs = {}

folder_path = "/Users/maria.onoeva/Desktop/new_folder/GitHub/nlp-repo/HU_LLM/words_folder"

# Loop through all .txt files in the folder as above
for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
        file_path = os.path.join(folder_path, filename)

        with open(file_path, 'r') as file:
            words = file.read().split(', ')
            length = len(words)

        print(f"Words in {filename}: {length}")

        # Create a DataFrame
        df = pd.DataFrame(words, columns=['word'])

        # Extract syllable count from filename (e.g., 'words_3.txt' -> 3)
        syllable_count = int(''.join([d for d in filename if d.isdigit()]))

        # Add the syllable count column
        df['syllable'] = syllable_count

        # Store in dictionary using filename without extension as key
        key = os.path.splitext(filename)[0]
        dfs[key] = df

# combine all DataFrames into one big one
combined_df = pd.concat(dfs.values(), ignore_index=True)

# saving as csv
combined_df.to_csv('combined_df.csv') 

Words in words_8.txt: 208
Words in words_9.txt: 144
Words in words_4.txt: 286
Words in words_5.txt: 278
Words in words_7.txt: 257
Words in words_6.txt: 272
Words in words_2.txt: 296
Words in words_3.txt: 298
Words in words_1.txt: 270


In [37]:
combined_df.count()

word        2309
syllable    2309
dtype: int64

Now I create 4 random samples of 250 words from the big csv and prompt it to HU-LLM via its API (1000 in one batch seems to be too much). 

In [82]:
import numpy as np

sample_df_1000 = combined_df.sample(n=1000, random_state=420)

# Split into 4 approximately equal parts
split_dfs = np.array_split(sample_df_1000, 4)

# If you want lists instead of DataFrames:
split_lists = [subdf['word'].tolist() for subdf in split_dfs]

# Access the 4 random non-overlapping lists:
list1, list2, list3, list4 = split_lists

  return bound(*args, **kwds)


In [52]:
sample_df_1000.to_csv('sample_df_1000.csv')

In [51]:
from gradio_client import Client

for i in split_lists: 
    client = Client("https://llm1-compute.cms.hu-berlin.de/")
    result_list = client.predict(
		param_0=f"Please count syllables in these Russian words {i}",
		api_name="/chat"
)
    with open("sample_1000_result.txt", "a") as file:
        file.write(result_list)


Loaded as API: https://llm1-compute.cms.hu-berlin.de/ ✔
Loaded as API: https://llm1-compute.cms.hu-berlin.de/ ✔
Loaded as API: https://llm1-compute.cms.hu-berlin.de/ ✔
Loaded as API: https://llm1-compute.cms.hu-berlin.de/ ✔


Before assessing the results, I need to clean the result file `sample_1000_result.txt`. I manually removed all except results and saved to a new file `sample_1000_result_clean.txt`. Now I also need to remove numbers and replace `' - '` pattern with comma. Saving to `cleaned_output.csv`.

In [83]:
import re

pattern1 = re.compile(r"\d+\.\s") # this will remove a number from the beginning of the line
pattern2 = re.compile(r"\s*-\s*") # replaces " - " pattern 

cleaned_rows = []

with open("sample_1000_result_clean.txt", "r") as file:
    for line in file:
        line = pattern1.sub("", line)
        line = pattern2.sub(",", line)
        cleaned_line = line.strip()
        row = cleaned_line.split(",")  # split into columns
        cleaned_rows.append(row)

# Create DataFrame and save to CSV
cleaned_output = pd.DataFrame(cleaned_rows)
cleaned_output.to_csv("cleaned_output.csv", index=False, header=False)

In [80]:
cleaned_output.count()

0    1000
1    1000
dtype: int64

Now finally, I can compare the output. 

In [None]:
# because the dfs were not combined properly, I had to drop their initial indices 
sample_df_1000 = sample_df_1000.reset_index(drop=True)
cleaned_output = cleaned_output.reset_index(drop=True)

ready_df = pd.concat([sample_df_1000, cleaned_output], ignore_index=True, axis=1) # combining for comparison

In [None]:
ready_df[3] = ready_df[3].astype(int) # making the third column as int

In [106]:
ready_df[4] = ready_df[1] == ready_df[3] # comparing columns 
ready_df[5] = ready_df[0] == ready_df[2] # comparing columns 

In [104]:
false_ready_df = ready_df.loc[ready_df[4]!=True].copy()
false_ready_df[6] = false_ready_df[1]-false_ready_df[3]
false_ready_df.count()

0    449
1    449
2    449
3    449
4    449
5    449
6    449
dtype: int64

Well, 449 wrong hits is too much!!! What is going on? Is it because those words are super infrequent? Now I want to know per each prompt how much was wrong.

In [123]:
split_ready_df = np.array_split(ready_df, 4)
attempt1 = split_ready_df[0][4].mean()
attempt2 = split_ready_df[1][4].mean()
attempt3 = split_ready_df[2][4].mean()
attempt4 = split_ready_df[3][4].mean()

print(attempt1, attempt2, attempt3, attempt4)

0.624 0.484 0.572 0.524


  return bound(*args, **kwds)


In [97]:
false_ready_df.to_csv("false_ready_df.csv", index=False, header=False)

## Part 2: Counting syllables in sentences and stress marking

Данные из акцентологического корпуса