Dec 7, 2024, janabruses@pitt.edu, Jana Bruses

In [35]:
#importing glove_vecs
from gensim.models import KeyedVectors
filename = '/Users/janabruses/Documents/ling1330/glove/glove.6B.100d.txt'
glove_vecs = KeyedVectors.load_word2vec_format(filename, binary=False,
no_header=True)

In [36]:
#importing 
import re
languages = []
with open('/Users/janabruses/Documents/ling1330/languages-list-EN.txt', 'r') as file:
    for line in file:
        languages.append((re.sub(r"[^A-Za-z ]", "", line)).lower())

I looked for a txt listing languages, loaded it and separated it into languages included and not included in glove_vecs to avoid any key errors.

In [37]:
missing_lang = []
valid_lang = []
for lang in languages:
    if lang in glove_vecs:
        valid_lang.append(lang)
    else:
        missing_lang.append(lang)
print()
print("The missing languages that are in the language list but not in glove_vecs are:", missing_lang)


The missing languages that are in the language list but not in glove_vecs are: ['sign language']


While this list is called language-list, the words can actually be described as national adjectives as for instance "Spanish" doesn't necesarly have to be in the context "Spanish language," it could also be in "Spanish law" for instance. Hence, from now on I'll be working with national_adj.

The next step was creating a dictionary where the key is the first national adjective of the pair of the two adjectives we are comparing (nat_adj1,) and it maps to a tuple of the national adjective, the one the 1st is being compared to (nat_adj2,) and the similarity given to the pair by glove_vecs.

In [38]:
national_similarity = {}
for adj1 in valid_lang:
    if adj1 not in national_similarity:
        national_similarity[adj1] = []
    for adj2 in valid_lang:
        if adj1 != adj2:
            similarity_score = glove_vecs.similarity(adj1, adj2)
            national_similarity[adj1].append((adj2, similarity_score))

def get_similarity(tpl):
    return tpl[1]

I sorted the similarities so that all of languages most similar languages would be ordered to rank them

In [39]:
for key in national_similarity.keys():
    national_similarity[key].sort(key=get_similarity, reverse=True)

In [40]:
print()
print("The most similar nationality adjectives to Spanish according to glove_vecs are:")
print()
for adj, score in national_similarity['spanish'][:10]:
    print(f"""spanish - {adj}: {score}""")


The most similar nationality adjectives to Spanish according to glove_vecs are:

spanish - portuguese: 0.8205267786979675
spanish - italian: 0.7928807139396667
spanish - french: 0.7582119107246399
spanish - brazilian: 0.6808936595916748
spanish - dutch: 0.6707138419151306
spanish - catalan: 0.652790904045105
spanish - english: 0.626021683216095
spanish - german: 0.6057244539260864
spanish - greek: 0.6016337871551514
spanish - latin: 0.5984212756156921


In [41]:
print()
print("The most nationality adjectives to Catalan according to glove_vecs are:")
print()
for adj, score in national_similarity['catalan'][:10]:
    print(f"""catalan - {adj}: {score}""")


The most nationality adjectives to Catalan according to glove_vecs are:

catalan - spanish: 0.652790904045105
catalan - flemish: 0.639177143573761
catalan - portuguese: 0.6062145829200745
catalan - italian: 0.581230103969574
catalan - slovene: 0.5724879503250122
catalan - greek: 0.5056952834129333
catalan - french: 0.49839353561401367
catalan - dutch: 0.49797070026397705
catalan - berber: 0.49046844244003296
catalan - hungarian: 0.4877297878265381


In [42]:
print()
print("The most similar nationality adjectives to Mandarin according to glove_vecs are:")
print()
for adj, score in national_similarity['mandarin'][:10]:
    print(f"""mandarin - {adj}: {score}""")


The most similar nationality adjectives to Mandarin according to glove_vecs are:

mandarin - cantonese: 0.8423231840133667
mandarin - tagalog: 0.6539232134819031
mandarin - creole: 0.5836564898490906
mandarin - malay: 0.5460904836654663
mandarin - arabic: 0.5254192352294922
mandarin - english: 0.514771580696106
mandarin - hindi: 0.5106698274612427
mandarin - hawaiian: 0.49354326725006104
mandarin - javanese: 0.4853445589542389
mandarin - afrikaans: 0.472692608833313


In [43]:
print()
print("The most similar nationality adjectives to English according to glove_vecs are:")
print()
for adj, score in national_similarity['english'][:10]:
    print(f"""english - {adj}: {score}""")


The most similar nationality adjectives to English according to glove_vecs are:

english - welsh: 0.7520771026611328
english - irish: 0.7088090181350708
english - scottish: 0.6993111371994019
english - arabic: 0.6370865106582642
english - french: 0.6333723664283752
english - spanish: 0.626021683216095
english - dutch: 0.6117767095565796
english - portuguese: 0.5823111534118652
english - german: 0.5820319056510925
english - greek: 0.5720583200454712


\
Then I wanted to know what nationality adjectives were considered the most similar out of all these pairs.\
\
**Disclaimer:**\
The list of nationality adjectives isn't the most accurate, as in its initial purpose about languages it includes adjectives such as "Catalan" which is not a nation, but doesn't include Basque, which would be another regional level adjective. 

To analyze the pairs, I created a list containing tuples of the first nationality adjective and a tuple of the second nationality adjective and the similarity rating, and then sorted them:

In [44]:
all_pairs = []
for key in national_similarity.keys():
    for adj, score in national_similarity[key]:
        all_pairs.append((key, (adj, score)))

In [45]:
def get_score(item):
    return item[1][1]

In [46]:
print()
print("The most similar language according to glove_vecs are:")
print()
all_pairs.sort(key = get_score, reverse = True)
for adj1, (adj2, score) in all_pairs[:10]:
    print(f"{adj1} - {adj}: {score}")
print()


The most similar language according to glove_vecs are:

malayalam - panjabi: 0.8990049958229065
telugu - panjabi: 0.8990049958229065
bulgarian - panjabi: 0.8581596612930298
bulgarian - panjabi: 0.8581596612930298
romanian - panjabi: 0.8581596612930298
romanian - panjabi: 0.8581596612930298
cantonese - panjabi: 0.8423231840133667
mandarin - panjabi: 0.8423231840133667
hungarian - panjabi: 0.8410434126853943
romanian - panjabi: 0.8410434126853943



In [47]:
print()
print("The least similar nationality adjectives according to glove_vecs are:")
print()
for adj1, (adj2, score) in all_pairs[-10:]:
    print(f"{adj1} - {adj2}: {score}")
print()


The least similar nationality adjectives according to glove_vecs are:

panjabi - egyptian: -0.15936380624771118
egyptian - panjabi: -0.15936380624771118
panjabi - korean: -0.16060343384742737
korean - panjabi: -0.16060343384742737
panjabi - spanish: -0.16152037680149078
spanish - panjabi: -0.16152037680149078
panjabi - brazilian: -0.1739526391029358
brazilian - panjabi: -0.1739526391029358
bosnian - hawaiian: -0.19012433290481567
hawaiian - bosnian: -0.19012433290481567



### Conclusion:
Nationality adjectives whose languages have shared linguistic roots such as "malayalam - telugu" and "cantonese - mandarin" consistently scored high in similarity. This aligns with my expectations and speaks great of the performance of the model.\
\
Considering this the major determiner of similarity there are also some surprising pairings with high scores. Yet, as previously mentioned, we can't consider "Romanian" just in the context of Romanian language. And hence, examples such as Hungarian and Romanian's similarity going both ways reflect similar socio-political and cultural contexts or even can reflect the nations' geographical proximity.
\
The least similar language scores also reflect the model's ability to capture semantic distinctions.
We find hawaiian - bosian at the very bottom, which have no linguistic, cultural or geographical connection. And I would indeed have considered very different myself.\
\
It is very insightful to see regional clusters in these outputs. When looking for the most similar adjectives to Catalan, the cluster Spanish, Portuguese, French and Italian came up (all Mediteranean Cultures and Romance languages.)\
\
English's most similars are also very intuitive. As we see Welsh, Irish and Scottish at the top in English's most similar list (all
from the British Islands.)\
\
A surprising or unexpected result for me was the similarity between Spanish and Catalan being only about 0.65, lower than that of Spanish with other mediterranean cultures, romance languages, and many other nationalities. Considering geographic and cultural proximity I had assumed it would be higher. 
