# Refine parallel dataset with character n-grams

This notebook guides you through a pipeline for refining a parallel dataset derived from standard token-to-token alignment models, to also include some likely character n-grams corresponding to the source word that original parallel dataset was based on. E.g., starting from a dataset containing each occurrence of the token 'when' in English and their parallels (translations) in several other languages, this notebook will calculate associations between 'when' and subtoken items to discover potential morphological means to express the meaning of 'when'. Where a language has both lexified counterparts to 'when' and morphological means (e.g. participles, converbs, next to 'when'-counterparts), the notebook will attempt to keep the counterpart while also grouping together sub-token units that show significant similarities at the character n-gram level. 

### Imports

In [11]:
import sys
import os

# Get the current directory of the notebook
notebook_dir = os.path.dirname(os.path.realpath("__file__"))

# Construct the path to the 'src' directory
src_dir = os.path.join(notebook_dir, '../src')

# Add the 'src' directory to sys.path
if src_dir not in sys.path:
    sys.path.append(src_dir)

import pandas as pd
import numpy as np
from glob import glob

import spacy
import spacy.cli

# Check if the model is already downloaded
if not spacy.util.is_package("en_core_web_sm"):
    # If not, download the model
    spacy.cli.download("en_core_web_sm")
    
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# Now you can import your modules
from utils.text_processing import find_adv_head, transform_sentence, find_adv_head, remove_accents, extract_head_counterpart
from analysis.ngram_analysis import NGramAssoc, source_target_stopwords_assoc

# Load the English language model for dependency parsing
nlp = spacy.load("en_core_web_sm")

### Edit variables

You'll need: 
1) A CSV file (to be defined in `advdf_path`), where each line is an occurrence of the source_token of interest in language A (e.g. English), and each column is the parallel in other languages, besides a `sent_id` column and a `context` column (with the source text).
2) A list of target languages in their ISO- code, whose column you want to refine. You'll get both a copy of the original CSV file with the respective columns modified (and the others left untouched) and a copy of the same but only with the `sent_id`, `context`, source language, and target languages that were modified.
3) The alignment models in a CSV format, with the first column being a `sent_id` that can be mapped to the CSV in `advdf_path`, the second column being a `context` column corresponding to the `sent_id` and the third column a mapping from source to target language context in the format word1 (parallel1), word2 (parallel2), etc. For instance, the header of the alignment model for [aai] is:

```
sent_id,context,targ
40001003,and judah the father of perez and zerah by tamar and perez the father of hezron and hezron the father of ram,and (naatu) judah (judah) the (NOMATCH) father (NOMATCH) of (natun) perez (perez) and (naatu) zerah (zerah) by (NOMATCH) tamar (natunatun) and (naatu) perez (perez) the (NOMATCH) father (NOMATCH) of (natun) hezron (NOMATCH) and (naatu) hezron (NOMATCH) the (NOMATCH) father (NOMATCH) of (natun) ram (ram) 
```

The location of the alignment models is to be defined in `alignments_parent_path` below.

:warning: Make sure that 'empty'/null-token alignments are marked as 'NOMATCH' rather than 'NULL' or other variations, to clearly distinguish them from empty parallels due to the lack of target text (as opposed to lack of a lexical counterpart).

You can define some stopwords, to ensure they are not mistaken as parallels to your word of interest if they occur very often.

In [12]:
# Set the source_token and target languages for processing
source_token = 'when'
targetlangs = ["aca","acf","acr","acu","agr","agu","ake","ame","amr","amu","apn","apu","apy","arl","arn","auc","ayo","ayr","azg","azz","bao","bkq","bmr","boa","bsn","bzd","bzj","caa","cab","cac","cag","cak","cao","cap","car","cas","cav","cax","cbc","cbi","cbr","cbs","cbt","cbu","cbv","cco","ceg","chd","chf","chq","chz","cjo","cjp","cle","cly","cni","cnl","cnt","coe","cof","cok","con","cot","cpa","cpb","cpc","cpu","cpy","crn","crq","crt","cso","cta","ctp","ctu","cub","cuc","cui","cuk","cul","cut","cux","cya","des","djk","emp","enx","ese","gnw","gub","guc","gug","guh","gui","gum","gun","guo","guq","gvc","gym","gyr","hat","hch","hix","hns","hto","hub","hus","huu","huv","icr","ign","inb","ixl","jac","jam","jic","jiv","jvn","kaq","kbc","kbh","kek","kgk","kgp","kjb","knj","kog","kpj","kvn","kwi","kyz","lac","leg","maa","maj","mam","maq","mau","mav","maz","mbc","mbj","mbl","mca","mcb","mcd","mcf","mco","mfy","mib","mie","mig","mih","mil","mio","miq","mir","mit","miy","miz","mjc","mks","moc","mop","mpm","mto","mtp","mxb","mxp","mxq","mxt","mxv","myu","myy","mza","mzh","mzl","nab","nch","ncj","ncl","ngu","nhd","nhe","nhg","nhi","nhw","nhx","nhy","noa","not","npl","nsu","ntp","ote","otm","otn","otq","ots","oym","pab","pad","pah","pap","pbb","pbc","pib","pio","pir","plg","pls","plu","poe","poh","poi","pps","pua","qub","quc","quf","qug","quh","qul","qup","quw","quy","quz","qva","qvc","qve","qvh","qvi","qvm","qvn","qvo","qvs","qvw","qvz","qwh","qxh","qxl","qxn","qxo","qxr","rkb","sab","sey","shp","sja","snn","sri","srm","srn","srq","stp","tac","tar","tav","tca","tee","ter","tfr","tku","tna","tnc","tob","toc","toj","too","top","tos","tpp","tpt","tqb","trc","trn","trq","tsz","ttc","tue","tuf","tuo","txu","tzh","tzj","tzo","ura","urb","usp","var","vmy","wap","way","wca","xav","xsu","xtd","xtm","xtn","yaa","yad","yan","yaq","ycn","yua","yuz","zaa","zab","zac","zad","zae","zai","zam","zao","zar","zas","zat","zav","zaw","zca","zos","zpc","zpi","zpl","zpm","zpo","zpq","zpt","zpu","zpv","zpz","zsr","ztq","zty"]  # List of target language codes

# Define paths for the dataframe and the alignment files
advdf_path = '../datasets/when-latamecarr.csv'
# The parent folder in which the all the CSV alignement files are found
alignments_parent_path = '../datasets/symgizamodel'

# Define a list of stopwords for source language
stopwords_source = ['jesus','herod','paul','peter','and','behold','then']
# Optional:
# stopwords_target = ['же','δὲ']

outputdir = '../outputs/'

### Main script

The following will output a bunch of files for each target language:
- [langname]-char-ngram-assoc.txt: a TSV file containing most likely character n-grams corresponding to the word of interest, ordered by p-value for the chi2 association and 'true positives', i.e. how many times a given ngram is found as translating the word of interest. 
- [langname]-word-assoc.txt: same measures as above, but at the word level.
- [langname]-stopwords.txt: same measures as above, but containing the most likely parallels to the stopwords provided in `stopwords_source`.

In [None]:
# Open a file to write output related to ngrams
with open(f'{outputdir}{source_token}-ngrams-details.txt','w') as outtxtgrams:
    # Load the dataframe containing alignments
    advdf = pd.read_csv(advdf_path,dtype=str)

    # Collect all alignment CSV files from the specified folder
    alignments_paths = sorted(glob(f'{alignments_parent_path}/*.csv'))

    # Iterate through each alignment file
    for alignments_path in alignments_paths:
        language = alignments_path.split('/')[-1].split('-')[0]
        if language in targetlangs:
            try:
                print(language)
                # Read the current alignment file
                df_with_parall = pd.read_csv(alignments_path,dtype=str)

                # Merge the alignment data with the {words}-dataframe based on 'sent_id' and 'context'
                df_with_parall = advdf.merge(df_with_parall, how='left',on=['sent_id','context'])

                # Process 'context' column to find occurrences of the source_token
                adv_heads_col = [find_adv_head(cont,source_token,nlp) for cont in df_with_parall['context']]

                # Process the results to format them for output
                df_with_parall['adv_head'] = [','.join(x) if x else None for x in adv_heads_col]

                # Process each row to transform the sentence based on the target and headword
                with open(f'{outputdir}{language}-{source_token}-with-adv-head.txt', 'w') as outtxt:
                    for index, row in df_with_parall.iterrows():
                        if not pd.isna(row['context']):
                            target = remove_accents(str(row['targ']))
                            headword = row['adv_head']
                            output_sentence = transform_sentence(str(target),str(headword))
                            outtxt.write(output_sentence + '\n')

                # Define target words for NGram alignment
                target_words=[source_token,'advhead']

                # Collect stopwords in the target language
                res = source_target_stopwords_assoc(f'{outputdir}{language}-{source_token}-with-adv-head.txt', outputdir, stopwords_source, source_token, language)
                
                # Now read in the file with stopwords matches
                df_sw= pd.read_csv(f'{outputdir}{language}-{source_token}-stopwords.txt',sep='\t')
                # Try and use only those with a p-value of 0
                # Initialize stopwords_target if not provided by the user
                if 'stopwords_target' not in locals():
                    stopwords_target = []

                # Add new stopwords to the list if the p-value is 0.0
                stopwords_target.extend(df_sw[df_sw['p-value'] == 0.0]['feature'].tolist())

                # If there are fewer words with such p-value than the number of stopwords, than order by p-value (ascending) and take the first n (= number of stopwords)
                if len(stopwords_target) < len(stopwords_source):
                    stopwords_target = list(set(list(df_sw[df_sw['p-value']==0.0]['feature']) + list(df_sw.sort_values(by='p-value')['feature'][0:len(stopwords_source)])))

                res=NGramAssoc(f'{outputdir}{language}-{source_token}-with-adv-head.txt',outputdir,target_words,stopwords_source,stopwords_target,source_token,language)

                df= pd.read_csv(f'{outputdir}{language}-{source_token}-word-assoc.txt',sep='\t')
                    
                # Only take words matching source_token with a p-value of 0
                source_token_words = []
                perfect_pvalue = list(df[df['p-value'] == 0.0]['feature'])

                if len(perfect_pvalue) == 0:
                    best_match = list(df.sort_values(by=['score'],ascending=False)['feature'])[0]
                    source_token_words.append(best_match)
                else:
                    source_token_words = source_token_words + perfect_pvalue

                # Your DataFrame
                df = pd.read_csv(f'{outputdir}{language}-{source_token}-char-ngram-assoc.txt', sep='\t')

                # Only consider ngrams occurring at the end of words (signaled by '@')
                # Note that this is experimental, and by no means the best approach. Needs to be tested systematically against, e.g. templatic languages
                df = df[(df['score'] > 1) & (df['feature'].str.contains('@'))].sort_values(by=['c11'], ascending=False).head(20)

                # Calculate TF-IDF
                vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(1, 8))
                tfidf_matrix = vectorizer.fit_transform(df['feature']).toarray()  # Convert to dense array

                # Number of clusters/components
                num_clusters_components = 3

                # K-Means clustering with different initialization methods
                np.random.seed(42)
                random_state = 42  # Random state for reproducibility

                kmeans_random = KMeans(n_clusters=num_clusters_components, init='random', random_state=random_state,n_init=10)
                df['cluster_kmeans_random'] = kmeans_random.fit_predict(tfidf_matrix)

                kmeans_kmeans_plus_plus = KMeans(n_clusters=num_clusters_components, init='k-means++', random_state=random_state,n_init=10)
                df['cluster_kmeans_kmeans++'] = kmeans_kmeans_plus_plus.fit_predict(tfidf_matrix)

                # DBSCAN clustering
                dbscan = DBSCAN(eps=1, min_samples=3)
                df['cluster_dbscan'] = dbscan.fit_predict(tfidf_matrix)

                # Agglomerative clustering
                agglomerative = AgglomerativeClustering(n_clusters=num_clusters_components)
                df['cluster_agglomerative'] = agglomerative.fit_predict(tfidf_matrix)

                # GMM clustering with fixed number of components
                gmm = GaussianMixture(n_components=num_clusters_components, random_state=random_state)
                df['cluster_gmm'] = gmm.fit_predict(tfidf_matrix)

                # Print cluster assignments
                # for algorithm, cluster_columns in [('K-Means (Random)', 'cluster_kmeans_random'),
                #                                 ('K-Means (k-means++)', 'cluster_kmeans_kmeans++'),
                #                                 ('DBSCAN', 'cluster_dbscan'),
                #                                 ('Agglomerative', 'cluster_agglomerative'),
                #                                 ('GMM', 'cluster_gmm')]:
                    
                for algorithm, cluster_columns in [('DBSCAN', 'cluster_dbscan')]:
                    print(f"\n{algorithm} Clustering:")
                    for cluster_number in sorted(df[cluster_columns].unique()):
                        # Select rows corresponding to the current cluster number
                        cluster_rows = df[df[cluster_columns] == cluster_number]

                        # Print the values in df['feature'] as a list for the current cluster
                        feature_list = cluster_rows['feature'].tolist()
                        print(f'Cluster {cluster_number}: {feature_list}')

                # Apply the function to create a new column
                df_with_parall['adv_head_transl'] = df_with_parall.apply(extract_head_counterpart, axis=1)
                # print(df_with_parall)
                ngrams_clusters = []
                full_words = []

                for name, group in df[df['cluster_dbscan'] >= 0].groupby('cluster_dbscan'):
                    # print(f'Cluster {name}')
                    cluster = []
                    # print(group['feature'])
                    for ngram in group['feature']:
                        if '$' in ngram:
                            newngram = ngram.split('$')[1].split('@')[0]
                            # print(f'ngram {newngram} is full word')
                            full_words.append(newngram)
                            # cluster.append(newngram)
                        else:
                            newngram = ngram.split('@')[0]
                            cluster.append(newngram)
                    ngrams_clusters.append(cluster)

                outtxtgrams.write(language)
                outtxtgrams.write('\n')
                outtxtgrams.write('\n')
                for iclu in range(len(ngrams_clusters)):
                    ngramcurrent = ', '.join(ngrams_clusters[iclu])
                    clustern = iclu + 1
                    outtxtgrams.write(f'ngram_{clustern}: {ngramcurrent}')
                    outtxtgrams.write('\n')

                patterntofilter = language + '-'

                lang_names_with_code = advdf.filter(like=patterntofilter, axis=1).columns
                
                print('langnameswithcodes',lang_names_with_code)
                for lang_name_with_code in lang_names_with_code:
                    print('adding column')
                    newcol = []
                    for index,row in df_with_parall.iterrows():
                        target = row['targ']
                        # print(row[lang_name_with_code])
                        if remove_accents(str(row[lang_name_with_code])) in source_token_words:
                            newcol.append(remove_accents(str(row[lang_name_with_code])))
                        elif row[lang_name_with_code] in full_words:
                            newcol.append(row[lang_name_with_code])
                        elif row[lang_name_with_code] == 'NOMATCH':
                            # print(f'Checking adv_head grams for sentence {target}')
                            anytrue = []
                            for cluster_values in ngrams_clusters:
                                ends_with_any = any(remove_accents(str(row['adv_head_transl'])).endswith(value) for value in cluster_values)
                                # print(f"Head word {row['adv_head_transl']} ends with any element in {cluster_values}: {ends_with_any}")
                            if True in anytrue:
                                indexoftrue = anytrue.index(True)
                                foundcluster = ngrams_clusters[indexoftrue]
                                # print('Gram found')
                                newcol.append(f'ngram_{indexoftrue + 1}')
                            else:
                                newcol.append('NOMATCH')
                        else:
                            # print('Checking when grams')
                            anytrue = []
                            for cluster_values in ngrams_clusters:
                                ends_with_any = any(remove_accents(str(row[lang_name_with_code])).endswith(value) for value in cluster_values)
                                anytrue.append(ends_with_any)
                                # print(f"Word {row[lang_name_with_code]} ends with any element in {cluster_values}: {ends_with_any}")
                            if True in anytrue:
                                indexoftrue = anytrue.index(True)
                                foundcluster = ngrams_clusters[indexoftrue]
                                # print('Gram found')
                                newcol.append(f'ngram_{indexoftrue + 1}')
                                continue
                            else:
                                anytrue = []
                                # print(f'Checking adv_head grams for sentence {target}')
                                for cluster_values in ngrams_clusters:
                                    ends_with_any = any(remove_accents(str(row['adv_head_transl'])).endswith(value) for value in cluster_values)
                                    # print(f"Head word {row['adv_head_transl']} ends with any element in {cluster_values}: {ends_with_any}")
                                    anytrue.append(ends_with_any)
                                if True in anytrue:
                                    indexoftrue = anytrue.index(True)
                                    foundcluster = ngrams_clusters[indexoftrue]
                                    # print('Gram found')
                                    newcol.append(f'ngram_{indexoftrue + 1}')
                                else:
                                    if remove_accents(str(row[lang_name_with_code])) not in stopwords_target:
                                        # print('Gram not found')
                                        newcol.append(remove_accents(str(row[lang_name_with_code])))
                                    else:
                                        # print('Gram not found')
                                        newcol.append('NOMATCH')

                advdf[lang_name_with_code] = newcol
            except ValueError:
                advdf = advdf.drop(columns=lang_name_with_code)

### Save refined dataframes

This will output two new dataframes:
- {source_token}_withgrams.csv: original parallel dataset, but with the columns for the target languages refined with n-grams.
- {source_token}_withgrams_selectedcols.csv: original parallel dataset, but _only_ with the columns for the target languages refined with n-grams (besides context, sent_id, and source language)


The former can be then used as usual to produce semantic maps (see other notebook).


In [8]:
advdf.to_csv(f'{outputdir}{source_token}_withgrams.csv',index=False)

selected_columns = advdf.loc[:, [col for col in advdf.columns if any(lang in col for lang in targetlangs)]].columns
selected_columns = ['sent_id','context','eng-29'] + list(selected_columns)
advdf.to_csv(f'{outputdir}{source_token}_withgrams_selectedcols.csv',index=False,columns=selected_columns)