<center>
<h1>Challenges in NLP for South Asian Languages</h1>
<h2>Final Project: Semantic similarity between Tamil multiword tokens and their translations in English and German</h2>
<h3>Professors: Dr Tafseer Ahmed, Dr K Sarveswaran</h3>
<h3>Name: Iro-Georgia Malta</h3>
<h3>Winter Semester 2023/2024</h3>
</center>

As a first step, I collected various multiword tokens from the Tamil treebank from the Universal Dependencies (UD) website: https://github.com/UniversalDependencies/UD_Tamil-TTB/tree/master
<br>
To process the conllu file, I use the library **pyconll**. Documentation of the library: https://pyconll.readthedocs.io/en/latest/pyconll/unit/token.html

In [1]:
!pip install pyconll



## Libraries

In [2]:
import pyconll
import pandas as pd

## Creation of a test set for Tamil multiword token forms
To create a test set for my experiment, I extracted multiword tokens from the Tamil treebank file in the Universal Dependencies (UD) format. In the file, lines representing multiword tokens are structured in the following manner:

<b>Example:</b> <br>
11-12	நிலையத்துக்குக்கான	_	_	_	_	_	_	_	_ <br>
11	நிலையத்துக்குக்க்	நிலையம்	NOUN	NND-3SN--	Case=Dat|Gender=Neut|Number=Sing|Person=3	12	nmod	12:nmod:dat	Translit=nilaiyattukkukk|LTranslit=nilaiyam <br>
12	ஆன	ஆகு	PART	Tg-------	_	13	nmod	13:nmod	Translit=āna|LTranslit=āku

In this corpus, multiword tokens are identified by a combination of numbers, known as their **ids**. For example, "11-12" indicates that the next two lines (11 and 12) contain the parts of the multiword token.

In the code, it first iterates through the sentences of the corpus and then the tokens. The **pyconll** library provides the method **token.is_multiword()**, which checks if the current token being processed is a multiword token. To ensure the correct identification of multiword tokens, the **token.id** is extracted and stored. Additionally, the **sentence.id** is stored to confirm the origin of the multiword token. <br>
The **sentence.id**, **token.id**, and **token.form** are stored as tuples in the list **multiword_tokens**. By being stored in a list, the sentences and tokens retain their order from the treebank file. In total, there are **520** multiword tokens.

In [3]:
multiword_tokens = [] # list where the multiword tokens are stored
corpus = pyconll.load_from_file('ta_ttb-ud-train.conllu')

for sentence in corpus:
    for token in sentence:
        if token.is_multiword(): # Boolean, pyconll: checks if the token is a multiword token
            multiword_tokens.append((sentence.id, token.id, token.form))
            
# print(multiword_tokens)

In [4]:
# Convert the data to a pandas dataFrame

df_ta_multiword = pd.DataFrame(multiword_tokens, columns = ['sentence id', 'token id', 'token form'])

# Create a csv file and xlsx file from the pandas df

df_ta_multiword.to_csv('tamil_mutliword_tokens.csv', index=False)
df_ta_multiword.to_excel('tamil_mutliword_tokens.xlsx', index=False)

## Tamil test set with multiword tokens and their parts


In [5]:
multiword_tokens_parts = []
corpus = pyconll.load_from_file('ta_ttb-ud-train.conllu')

for sentence in corpus:
    multiword_token = None  # Initialize a variable to store the multiword token
    for token in sentence:
        if '-' in token.id:
            multiword_token = token  # Store the multiword token
        elif multiword_token is not None:
            # Process the multiword token along with its parts
            multiword_range = multiword_token.id.split('-')
            start_id = int(multiword_range[0])
            end_id = int(multiword_range[1])
            
            # Check if the token ID is within the range of the multiword token
            if start_id <= int(token.id) <= end_id:
                multiword_tokens_parts.append((
                    sentence.id, 
                    multiword_token.id, 
                    multiword_token.form, 
                    token.id, 
                    token.form, 
                    token.upos))

In [6]:
# Convert the data to a pandas dataFrame

df_ta_multiword_parts = pd.DataFrame(multiword_tokens_parts, columns = ['sentence id', 
                                                                  'multiword id', 
                                                                  'multiword form',
                                                                  'token id', 
                                                                  'token form', 
                                                                  'token POS'])

# Create a csv file and xlsx file from the pandas df

# df_ta_multiword_parts.to_csv('tamil_mutliword_parts.csv', index=False)
# df_ta_multiword_parts.to_excel('tamil_mutliword_parts.xlsx', index=False) 

## Extracting specific multiword tokens for the project

From the **520** multiword tokens, I have decided to work instead with **20** due to time restrictions. I was provided by Professor Sarveswaran K with a list of **50** multiword tokens out of the **520**. Using this list, I extract the **50** tokens from the *df_ta_multiword_parts* data frame in order to keep the tokens and their corresponding POS tags and from the *df_ta_multiword* only the multiword token forms.

In [7]:
# Load excel file with 50 tokens
tamil_df_50 = pd.read_excel('Tamil-Multiword-list.xlsx') 

# Extract the multiword form column with the tokens and store it as a list
tamil_forms_list = tamil_df_50['multiword form'].tolist()

# Create an empty data frame, where the columns containing the tokens will be stored
final_df = pd.DataFrame(columns=df_ta_multiword_parts.columns)
final_df_onlyForms = pd.DataFrame(columns=df_ta_multiword.columns)


for token in tamil_forms_list:
    if token in df_ta_multiword_parts['multiword form'].values and \
    token in df_ta_multiword['token form'].values: # Extracting multiword tokens and their parts
        rows_containing_token = df_ta_multiword_parts[df_ta_multiword_parts['multiword form'] == token]
        rows_containing_token2 = df_ta_multiword[df_ta_multiword['token form'] == token]
        final_df = pd.concat([final_df, rows_containing_token])
        final_df_onlyForms = pd.concat([final_df_onlyForms, rows_containing_token2])
        
# Save the new data frame
#final_df.to_csv('final_df_tamil_tokens.csv', index=False)
#final_df.to_excel('final_df_tamil_tokens.xlsx', index=False)
#final_df_onlyForms.to_csv('multiword_tokens_forms_extracted.csv', index=False)
#final_df_onlyForms.to_excel('multiword_tokens_forms_extracted.xlsx', index=False)

## Models

### Sentence-Bert Multilingual

For the evaluation of multiword tokens from Tamil and their translations in English and German, I have chosen the **paraphrase-multilingual-mpnet-base-v2** model from sentence transformers (https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2). This model specializes in measuring sentence similarity using cosine similarity as the metric.<br>
To conduct the evaluation, I have prepared a test set consisting of **30** multiword tokens, along with their corresponding English and German translations. Notably, the translations were contextualized within the sentences containing the respective tokens.

In [8]:
!pip install -U sentence-transformers



In [9]:
import sentence_transformers
import itertools
from sentence_transformers import SentenceTransformer, util
model_sbert = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

# Load the test set 
# columns: sentence id, token id, token form, Translation EN, Translation DE, sentence, POS tags
tamil_multiwords_translations = pd.read_excel('30_multiwords_ta_EN_DE.xlsx')

# Store the token form, translation EN and translation DE columns into lists
token_forms_list = tamil_multiwords_translations['token form'].tolist()
translation_EN_list = tamil_multiwords_translations['Translation EN'].tolist()
translation_DE_list = tamil_multiwords_translations['Translation DE'].tolist()

In [10]:
# Function for embeddings generation and cosine similarity calculation for TA & EN and TA & DE

def cosine_similary_ta_en_de(multiwords_list, translation_list):
    embeddings_ta = model_sbert.encode(multiwords_list)
    embeddings_translation = model_sbert.encode(translation_list)
    
    results = []
    similarities = []
    
    # Calculate cosine similarity
    for e1, e2, token, trans in zip(embeddings_ta, embeddings_translation, multiwords_list, translation_list):
        similarity = float(util.pytorch_cos_sim(e1, e2))
        results.append({'Tamil': token, 'Translation': trans, 'Cosine Similarity': similarity})
        similarities.append(similarity)
        
    # Calculate descriptive Statistic measures        
    max_similarity = max(similarities)
    min_similarity = min(similarities)
    average_similarity = sum(similarities) / len(similarities)
    
    return results, min_similarity, average_similarity, max_similarity

In [11]:
# TA and EN cosine similarity results

results_ta_en, min_similarity_en, average_similarity_en, max_similarity_en = cosine_similary_ta_en_de(token_forms_list, 
                                                                                                      translation_EN_list)

for result in results_ta_en:
    print(f"Tamil: {result['Tamil']}\n"
          f"EN: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")

print(f"Minimum Similarity TA-EN: {min_similarity_en}\n"
      f"Average Similarity TA-EN: {average_similarity_en}\n"
      f"Maximum Similarity TA-EN: {max_similarity_en}")

Tamil: அடைக்கப்பட்டிருந்த
EN: who were detained
Cosine Similarity: 0.38040003180503845

Tamil: அதிகரிக்கப்பட்டுள்ளது
EN: it was increased
Cosine Similarity: 0.7994646430015564

Tamil: அழைக்கப்பட்டிருக்கிறார்
EN: he has been invited
Cosine Similarity: 0.6190996170043945

Tamil: அறிவிக்கப்பட்டுள்ளது
EN: it has been announced
Cosine Similarity: 0.7870899438858032

Tamil: அறிவித்துள்ளனர்
EN: they have informed 
Cosine Similarity: 0.7523360252380371

Tamil: அனுப்பப்பட்டுள்ளனர்
EN: they have been sent
Cosine Similarity: 0.8324966430664062

Tamil: அனுப்பியுள்ளோம்
EN: we have sent
Cosine Similarity: 0.8369427919387817

Tamil: ஆலோசிக்கப்படுவதாக
EN: he had said
Cosine Similarity: 0.5008760690689087

Tamil: இடம்பெற்றிருந்தது
EN: was featured
Cosine Similarity: 0.6667640805244446

Tamil: உத்தரவிடப்பட்டுள்ளது
EN: it has been ordered
Cosine Similarity: 0.6580705642700195

Tamil: உயர்ந்திருக்கிறது
EN: it has increased
Cosine Similarity: 0.6144263744354248

Tamil: உயிரிழந்துள்ளனர்
EN: they have been k

In [12]:
# TA and DE cosine similarity results

results_ta_de, min_similarity_de, average_similarity_de, max_similarity_de = cosine_similary_ta_en_de(token_forms_list, 
                                                                                                      translation_DE_list)

for result in results_ta_de:
    print(f"Tamil: {result['Tamil']}\n"
          f"DE: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")
    
print(f"Minimum Similarity TA-DE: {min_similarity_de}\n"
      f"Average Similarity TA-DE: {average_similarity_de}\n"
      f"Maximum Similarity TA-DE: {max_similarity_de}")

Tamil: அடைக்கப்பட்டிருந்த
DE: die festgehalten wurden
Cosine Similarity: 0.7735199928283691

Tamil: அதிகரிக்கப்பட்டுள்ளது
DE: er wurde erhöht
Cosine Similarity: 0.8120787143707275

Tamil: அழைக்கப்பட்டிருக்கிறார்
DE: er wurde eingeladen
Cosine Similarity: 0.686138391494751

Tamil: அறிவிக்கப்பட்டுள்ளது
DE: es wurde angekündigt
Cosine Similarity: 0.8980442881584167

Tamil: அறிவித்துள்ளனர்
DE: sie haben bekannt gegeben
Cosine Similarity: 0.8889493346214294

Tamil: அனுப்பப்பட்டுள்ளனர்
DE: sie wurden geschickt
Cosine Similarity: 0.9577236175537109

Tamil: அனுப்பியுள்ளோம்
DE: wir haben geschickt
Cosine Similarity: 0.9422879815101624

Tamil: ஆலோசிக்கப்படுவதாக
DE: er hatte erklärt
Cosine Similarity: 0.5989529490470886

Tamil: இடம்பெற்றிருந்தது
DE: vorgestellt wurde
Cosine Similarity: 0.7826376557350159

Tamil: உத்தரவிடப்பட்டுள்ளது
DE: sie wurde angeordnet
Cosine Similarity: 0.794486939907074

Tamil: உயர்ந்திருக்கிறது
DE: sie ist gestiegen
Cosine Similarity: 0.8414753675460815

Tamil: உயிரிழந்து

### Multilingual-e5-base
The second model I decided to use to generate embeddings for Tamil multiword tokens and measure their similarity to the embeddings of their translations in English and German is **Multilingual-e5-base** (https://huggingface.co/intfloat/multilingual-e5-base).

In [13]:
from sentence_transformers import SentenceTransformer
model_multi_e5 = SentenceTransformer('intfloat/multilingual-e5-base')

**Multilingual-e5-base** processes the data in a specific input way. Specifically, for symmetric tasks like semantic similarity or paraphrase retrieval, the input texts must have the "query" prefix in order to generate embeddings effectively. For example:

```python
input_texts = ['query: அடைக்கப்பட்டிருந்த',
               'query: who were detained']
```

In [14]:
# Function that prepares the data for the model, generates embeddings, and calculates cosine similarity

def embed_multiling_e5_base(multiwords_list, translation_list):
    input_ta = ['query:' + token for token in multiwords_list]
    input_trans = ['query:' + trans for trans in translation_list]
    embeddings_ta = model_multi_e5.encode(input_ta, normalize_embeddings=True)
    embeddings_translation = model_multi_e5.encode(input_trans, normalize_embeddings=True)
    
    results = []
    similarities = []
        
    for e1, e2, token, trans in zip(embeddings_ta, embeddings_translation, multiwords_list, translation_list):
        similarity = float(util.pytorch_cos_sim(e1, e2))
        results.append({'Tamil': token, 'Translation': trans, 'Cosine Similarity': similarity})
        similarities.append(similarity)
        
    # Calculate descriptive Statistic measures        
    max_similarity = max(similarities)
    min_similarity = min(similarities)
    average_similarity = sum(similarities) / len(similarities)
    
    return results, min_similarity, average_similarity, max_similarity

In [15]:
# TA and EN cosine similarity results from multilingual-e5-base

results_ta_en_2, min_similarity_en_2, average_similarity_en_2, max_similarity_en_2 = embed_multiling_e5_base(token_forms_list, 
                                                                                                             translation_EN_list)

for result in results_ta_en_2:
    print(f"Tamil: {result['Tamil']}\n"
          f"EN: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")

print(f"Minimum Similarity TA-EN: {min_similarity_en_2}\n"
      f"Average Similarity TA-EN: {average_similarity_en_2}\n"
      f"Maximum Similarity TA-EN: {max_similarity_en_2}")

Tamil: அடைக்கப்பட்டிருந்த
EN: who were detained
Cosine Similarity: 0.7933784127235413

Tamil: அதிகரிக்கப்பட்டுள்ளது
EN: it was increased
Cosine Similarity: 0.8405553698539734

Tamil: அழைக்கப்பட்டிருக்கிறார்
EN: he has been invited
Cosine Similarity: 0.8743373155593872

Tamil: அறிவிக்கப்பட்டுள்ளது
EN: it has been announced
Cosine Similarity: 0.8408982157707214

Tamil: அறிவித்துள்ளனர்
EN: they have informed 
Cosine Similarity: 0.8870394229888916

Tamil: அனுப்பப்பட்டுள்ளனர்
EN: they have been sent
Cosine Similarity: 0.8673956990242004

Tamil: அனுப்பியுள்ளோம்
EN: we have sent
Cosine Similarity: 0.8825974464416504

Tamil: ஆலோசிக்கப்படுவதாக
EN: he had said
Cosine Similarity: 0.8251854777336121

Tamil: இடம்பெற்றிருந்தது
EN: was featured
Cosine Similarity: 0.8225153684616089

Tamil: உத்தரவிடப்பட்டுள்ளது
EN: it has been ordered
Cosine Similarity: 0.8432210683822632

Tamil: உயர்ந்திருக்கிறது
EN: it has increased
Cosine Similarity: 0.8392896056175232

Tamil: உயிரிழந்துள்ளனர்
EN: they have been ki

In [19]:
# TA and DE cosine similarity results from multilingual-e5-base

results_ta_de_3, min_similarity_de_3, average_similarity_de_3, max_similarity_de_3 = embed_multiling_e5_base(token_forms_list, 
                                                                                                             translation_DE_list)

for result in results_ta_de_3:
    print(f"Tamil: {result['Tamil']}\n"
          f"DE: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")

print(f"Minimum Similarity TA-DE: {min_similarity_de_3}\n"
      f"Average Similarity TA-DE: {average_similarity_de_3}\n"
      f"Maximum Similarity TA-DE: {max_similarity_de_3}")

Tamil: அடைக்கப்பட்டிருந்த
DE: die festgehalten wurden
Cosine Similarity: 0.8398186564445496

Tamil: அதிகரிக்கப்பட்டுள்ளது
DE: er wurde erhöht
Cosine Similarity: 0.8287434577941895

Tamil: அழைக்கப்பட்டிருக்கிறார்
DE: er wurde eingeladen
Cosine Similarity: 0.8550869226455688

Tamil: அறிவிக்கப்பட்டுள்ளது
DE: es wurde angekündigt
Cosine Similarity: 0.8248528242111206

Tamil: அறிவித்துள்ளனர்
DE: sie haben bekannt gegeben
Cosine Similarity: 0.8936783671379089

Tamil: அனுப்பப்பட்டுள்ளனர்
DE: sie wurden geschickt
Cosine Similarity: 0.8826197385787964

Tamil: அனுப்பியுள்ளோம்
DE: wir haben geschickt
Cosine Similarity: 0.8963832259178162

Tamil: ஆலோசிக்கப்படுவதாக
DE: er hatte erklärt
Cosine Similarity: 0.8279762864112854

Tamil: இடம்பெற்றிருந்தது
DE: vorgestellt wurde
Cosine Similarity: 0.8159857392311096

Tamil: உத்தரவிடப்பட்டுள்ளது
DE: sie wurde angeordnet
Cosine Similarity: 0.8309434652328491

Tamil: உயர்ந்திருக்கிறது
DE: sie ist gestiegen
Cosine Similarity: 0.8292589783668518

Tamil: உயிரிழந்

## Testing of the models with corrected translations

In [61]:
# Load the test set 
# columns: sentence id, token id, token form, Translation EN, Translation DE, sentence, POS tags
tamil_multiwords_translations_2 = pd.read_excel('30_multiwords_ta_EN_DE_corrected.xlsx')

# Store the token form, translation EN and translation DE columns into lists
token_forms_list_2 = tamil_multiwords_translations_2['token form'].tolist()
translation_EN_list_2 = tamil_multiwords_translations_2['Translation EN'].tolist()
translation_DE_list_2 = tamil_multiwords_translations_2['Translation DE'].tolist()

### Sentence-Bert Multilingual

In [62]:
# TA and EN cosine similarity results with corrected translations

results_ta_en_4, min_similarity_en_4, average_similarity_en_4, max_similarity_en_4 = cosine_similary_ta_en_de(token_forms_list_2, 
                                                                                                      translation_EN_list_2)

for result in results_ta_en_4:
    print(f"Tamil: {result['Tamil']}\n"
          f"EN: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")

print(f"Minimum Similarity TA-EN: {min_similarity_en_4}\n"
      f"Average Similarity TA-EN: {average_similarity_en_4}\n"
      f"Maximum Similarity TA-EN: {max_similarity_en_4}")

Tamil: அடைக்கப்பட்டிருந்த
EN: were detained
Cosine Similarity: 0.39830151200294495

Tamil: அதிகரிக்கப்பட்டுள்ளது
EN: it was increased
Cosine Similarity: 0.7994646430015564

Tamil: அழைக்கப்பட்டிருக்கிறார்
EN: he has been invited
Cosine Similarity: 0.6190996170043945

Tamil: அறிவிக்கப்பட்டுள்ளது
EN: it has been announced
Cosine Similarity: 0.7870899438858032

Tamil: அறிவித்துள்ளனர்
EN: they have informed 
Cosine Similarity: 0.7523360252380371

Tamil: அனுப்பப்பட்டுள்ளனர்
EN: they have been sent
Cosine Similarity: 0.8324966430664062

Tamil: அனுப்பியுள்ளோம்
EN: we have sent
Cosine Similarity: 0.8369427919387817

Tamil: ஆலோசிக்கப்படுவதாக
EN: was being discussed
Cosine Similarity: 0.6270827054977417

Tamil: இடம்பெற்றிருந்தது
EN: was featured
Cosine Similarity: 0.6667640805244446

Tamil: உத்தரவிடப்பட்டுள்ளது
EN: it has been ordered
Cosine Similarity: 0.6580705642700195

Tamil: உயர்ந்திருக்கிறது
EN: it has increased
Cosine Similarity: 0.6144263744354248

Tamil: உயிரிழந்துள்ளனர்
EN: they have be

In [63]:
# TA and DE cosine similarity results with corrected translations

results_ta_de_5, min_similarity_de_5, average_similarity_de_5, max_similarity_de_5 = cosine_similary_ta_en_de(token_forms_list_2, 
                                                                                                      translation_DE_list_2)

for result in results_ta_de_5:
    print(f"Tamil: {result['Tamil']}\n"
          f"DE: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")
    
print(f"Minimum Similarity TA-DE: {min_similarity_de_5}\n"
      f"Average Similarity TA-DE: {average_similarity_de_5}\n"
      f"Maximum Similarity TA-DE: {max_similarity_de_5}")

Tamil: அடைக்கப்பட்டிருந்த
DE:  wurden festgehalten 
Cosine Similarity: 0.776469349861145

Tamil: அதிகரிக்கப்பட்டுள்ளது
DE: er wurde erhöht
Cosine Similarity: 0.8120787143707275

Tamil: அழைக்கப்பட்டிருக்கிறார்
DE: er wurde eingeladen
Cosine Similarity: 0.686138391494751

Tamil: அறிவிக்கப்பட்டுள்ளது
DE: es wurde angekündigt
Cosine Similarity: 0.8980442881584167

Tamil: அறிவித்துள்ளனர்
DE: sie haben bekannt gegeben
Cosine Similarity: 0.8889493346214294

Tamil: அனுப்பப்பட்டுள்ளனர்
DE: sie wurden geschickt
Cosine Similarity: 0.9577236175537109

Tamil: அனுப்பியுள்ளோம்
DE: wir haben geschickt
Cosine Similarity: 0.9422879815101624

Tamil: ஆலோசிக்கப்படுவதாக
DE: wurde erörtert
Cosine Similarity: 0.7718687653541565

Tamil: இடம்பெற்றிருந்தது
DE: hat stattgefunden
Cosine Similarity: 0.8956584334373474

Tamil: உத்தரவிடப்பட்டுள்ளது
DE: sie wurde angeordnet
Cosine Similarity: 0.794486939907074

Tamil: உயர்ந்திருக்கிறது
DE: sie ist gestiegen
Cosine Similarity: 0.8414753675460815

Tamil: உயிரிழந்துள்ளனர

### Multilingual-e5-base

In [64]:
# TA and EN cosine similarity results from multilingual-e5-base with corrected translations

results_ta_en_6, min_similarity_en_6, average_similarity_en_6, max_similarity_en_6 = embed_multiling_e5_base(token_forms_list_2, 
                                                                                                             translation_EN_list_2)

for result in results_ta_en_6:
    print(f"Tamil: {result['Tamil']}\n"
          f"EN: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")

print(f"Minimum Similarity TA-EN: {min_similarity_en_6}\n"
      f"Average Similarity TA-EN: {average_similarity_en_6}\n"
      f"Maximum Similarity TA-EN: {max_similarity_en_6}")

Tamil: அடைக்கப்பட்டிருந்த
EN: were detained
Cosine Similarity: 0.8353263139724731

Tamil: அதிகரிக்கப்பட்டுள்ளது
EN: it was increased
Cosine Similarity: 0.8405553698539734

Tamil: அழைக்கப்பட்டிருக்கிறார்
EN: he has been invited
Cosine Similarity: 0.8743373155593872

Tamil: அறிவிக்கப்பட்டுள்ளது
EN: it has been announced
Cosine Similarity: 0.8408982157707214

Tamil: அறிவித்துள்ளனர்
EN: they have informed 
Cosine Similarity: 0.8870394229888916

Tamil: அனுப்பப்பட்டுள்ளனர்
EN: they have been sent
Cosine Similarity: 0.8673956990242004

Tamil: அனுப்பியுள்ளோம்
EN: we have sent
Cosine Similarity: 0.8825974464416504

Tamil: ஆலோசிக்கப்படுவதாக
EN: was being discussed
Cosine Similarity: 0.8547957539558411

Tamil: இடம்பெற்றிருந்தது
EN: was featured
Cosine Similarity: 0.8225153684616089

Tamil: உத்தரவிடப்பட்டுள்ளது
EN: it has been ordered
Cosine Similarity: 0.8432210683822632

Tamil: உயர்ந்திருக்கிறது
EN: it has increased
Cosine Similarity: 0.8392896056175232

Tamil: உயிரிழந்துள்ளனர்
EN: they have bee

In [65]:
# TA and DE cosine similarity results from multilingual-e5-base with corrected translations

results_ta_de_7, min_similarity_de_7, average_similarity_de_7, max_similarity_de_7 = embed_multiling_e5_base(token_forms_list_2, 
                                                                                                             translation_DE_list_2)

for result in results_ta_de_7:
    print(f"Tamil: {result['Tamil']}\n"
          f"DE: {result['Translation']}\n"
          f"Cosine Similarity: {result['Cosine Similarity']}\n")

print(f"Minimum Similarity TA-DE: {min_similarity_de_7}\n"
      f"Average Similarity TA-DE: {average_similarity_de_7}\n"
      f"Maximum Similarity TA-DE: {max_similarity_de_7}")

Tamil: அடைக்கப்பட்டிருந்த
DE:  wurden festgehalten 
Cosine Similarity: 0.8372063636779785

Tamil: அதிகரிக்கப்பட்டுள்ளது
DE: er wurde erhöht
Cosine Similarity: 0.8287434577941895

Tamil: அழைக்கப்பட்டிருக்கிறார்
DE: er wurde eingeladen
Cosine Similarity: 0.8550869226455688

Tamil: அறிவிக்கப்பட்டுள்ளது
DE: es wurde angekündigt
Cosine Similarity: 0.8248528242111206

Tamil: அறிவித்துள்ளனர்
DE: sie haben bekannt gegeben
Cosine Similarity: 0.8936783671379089

Tamil: அனுப்பப்பட்டுள்ளனர்
DE: sie wurden geschickt
Cosine Similarity: 0.8826197385787964

Tamil: அனுப்பியுள்ளோம்
DE: wir haben geschickt
Cosine Similarity: 0.8963832259178162

Tamil: ஆலோசிக்கப்படுவதாக
DE: wurde erörtert
Cosine Similarity: 0.8881934285163879

Tamil: இடம்பெற்றிருந்தது
DE: hat stattgefunden
Cosine Similarity: 0.8335651159286499

Tamil: உத்தரவிடப்பட்டுள்ளது
DE: sie wurde angeordnet
Cosine Similarity: 0.8309434652328491

Tamil: உயர்ந்திருக்கிறது
DE: sie ist gestiegen
Cosine Similarity: 0.8292589783668518

Tamil: உயிரிழந்துள்

### Storing of the results in a dataframe

In [90]:
# # Tamil multi-word tokens
# tamil_test_tokens = [result['Tamil'] for result in results_ta_en]
# # Tamil translations in EN
# EN_trans_not_correct = [result['Translation'] for result in results_ta_en]
# # Tamil translations in DE
# DE_trans_not_correct = [result['Translation'] for result in results_ta_de]
# # Tamil translations in EN correct
# EN_trans_correct = [result['Translation'] for result in results_ta_en_4]
# # Tamil translations in DE
# DE_trans_correct = [result['Translation'] for result in results_ta_de_5]

# # Results of model 1 in EN and DE, not correct translations
# ta_en_model1 = [result['Cosine Similarity'] for result in results_ta_en] # EN
# ta_de_model1 = [result['Cosine Similarity'] for result in results_ta_de] # DE

# # Results of model 2 in EN and DE, not correct translations
# ta_en_model2 = [result['Cosine Similarity'] for result in results_ta_en_2] # EN
# ta_de_model2 = [result['Cosine Similarity'] for result in results_ta_de_3] # DE

# # Results of model 1 in EN and DE, correct translations
# ta_en_model1b = [result['Cosine Similarity'] for result in results_ta_en_4] # EN
# ta_de_model1b = [result['Cosine Similarity'] for result in results_ta_de_5] # DE

# # Results of model 2 in EN and DE, correct translations
# ta_en_model2b = [result['Cosine Similarity'] for result in results_ta_en_6] # EN
# ta_de_model2b = [result['Cosine Similarity'] for result in results_ta_de_7] # DE

In [91]:
# # Create a dictionary with column names as keys and corresponding lists as values
# # Model 1 = SBert results
# # Model 2 = multilinguial e5 base
# data_results = {
#     'Tamil Tokens': tamil_test_tokens,
#     'EN Translation (Incorrect)': EN_trans_not_correct,
#     'DE Translation (Incorrect)': DE_trans_not_correct,
#     'EN Translation (Correct)': EN_trans_correct,
#     'DE Translation (Correct)': DE_trans_correct,
#     'Cosine Similarity (TA-EN Model 1)': ta_en_model1,
#     'Cosine Similarity (TA-DE Model 1)': ta_de_model1,
#     'Cosine Similarity (TA-EN Model 2)': ta_en_model2,
#     'Cosine Similarity (TA-DE Model 2)': ta_de_model2,
#     'Cosine Similarity (TA-EN Model 1: correct translations)': ta_en_model1b,
#     'Cosine Similarity (TA-DE Model 1: correct translations)': ta_de_model1b,
#     'Cosine Similarity (TA-EN Model 2: correct translations)': ta_en_model2b,
#     'Cosine Similarity (TA-DE Model 2: correct translations)': ta_de_model2b,
# }

# # Create a DataFrame
# df_results_ta_de = pd.DataFrame(data_results)

In [92]:
# # Saving the results in an excel file
# df_results_ta_de.to_excel('results_ta_de.xlsx', index=False)