Here's a notebook to explore data from the paper "Cultural influences on word meanings revealed through large-scale semantic alignment": https://www.nature.com/articles/s41562-020-0924-8. The data come from: https://osf.io/tngba/.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('./alignments-nel-wiki-trl.csv')

### Data Fields:

* `l1`: Language L1 (ISO2)
* `l2`: Language L2 (ISO2)
* `Concept_ID`: NorthEuraLex Concept ID
* `alignment`: **Semantic Alignment**
* `wordform_l1`: Wordform in Language `l1`
* `wordform_l2`: Wordform in Language `l2`
* `neighbour_overlap`: No. of translation pairs that are in the top `k` semantic neighbours of `Concept_ID` in both `l1` and `l2`
* `editdistance`: Edit distance between `wordform_L1` and `wordform_l2`
* `k`: Neighbour search depth of the algorithm used to compute alignment
* `n`: No. of translation pairs between `l1` and `l2` for which paralell embeddings were availible
* `zipf_freq_l1`: Language-specific word frequency (*Zipf frequency*) of `wordform_l1` (from [wordfreq](https://github.com/LuminosoInsight/wordfreq))
* `zipf_freq_l2`: Language-specific word frequency (*Zipf frequency*) of `wordform_l2` (from [wordfreq](https://github.com/LuminosoInsight/wordfreq))
* `passes_concept_filter`: `True` if alignment for Concept_ID could be calculated in at least 20 languages
* `passes_wiki_filter`: `True` if the Wikipedia corpus for this language passes quality control criteria
* `nel_pos`: Part of speech according to NorthEuraLex
* `pos_gloss`: Gloss for `nel_pos`
* `semantic_domain`: Chapter of [Intercontinental Dictionary Series](https://en.wikipedia.org/wiki/Intercontinental_Dictionary_Series) in which `Concept_ID` appear
* `global_semantic_density_l1`: Mean cosine similarity between `wordform_l1` and all `n` words that appear in the set of translation pairs for `l1` & `l2`
* `global_semantic_density_l2`: Mean cosine similarity between `wordform_l2` and all `n` words that appear in the set of translation pairs for `l1` & `l2`
* `local_semantic_density_l1`: Sum of cosine similarities betweeen `wordform_l1` and it's `k` closest neighbours in `l1`
* `local_semantic_density_l2`: Sum of cosine similarities betweeen `wordform_l2` and it's `k` closest neighbours in `l2`

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,l1,l2,alignment,wordform_l1,wordform_l2,neighbour_overlap,editdistance,k,n,...,zipf_freq_l2,passes_concept_filter,passes_wiki_filter,pos_gloss,semantic_domain,nel_pos,global_semantic_density_l1,global_semantic_density_l2,local_semantic_density_l1,local_semantic_density_l2
0,0,ab,lez,-0.140704,сара,за,93.0,3,100,109,...,,True,False,Pronouns,Kinship,PRN,0.904956,0.306323,98.640176,33.389172
1,1,ab,lez,-0.097948,акрал,пачагь,93.0,5,100,109,...,,True,False,Nouns,Social and political relations,N,0.798443,0.444899,87.030284,48.494014
2,2,ab,lez,-0.097085,ахыҧхьаӡара,сан,92.0,10,100,109,...,,True,False,Nouns,Modern world,N,0.727316,0.293339,79.277402,31.9739
3,3,ab,lez,-0.09371,агазеҭ,газета,94.0,3,100,109,...,,True,False,Nouns,Modern world,N,0.927466,0.307791,101.093773,33.549237
4,4,ab,lez,-0.090782,ахара,яргъал,93.0,5,100,109,...,,True,False,Adjectives,Spatial relations,A,0.915693,0.360167,99.810534,39.258229


these are the languages they looked at in the study, defined by these ISO2 codes: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

In [5]:
data['l1'].unique()

array(['ab', 'av', 'ba', 'az', 'be', 'bg', 'bn', 'ar', 'br', 'ca', 'ce',
       'cv', 'cy', 'cs', 'da', 'el', 'de', 'es', 'et', 'en', 'eu', 'fa',
       'fi', 'ga', 'fr', 'hi', 'he', 'hr', 'hu', 'hy', 'is', 'ka', 'kk',
       'ja', 'it', 'kn', 'ku', 'kv', 'ko', 'la', 'lbe', 'lez', 'lt', 'lv',
       'mdf', 'mhr', 'mrj', 'mn', 'ml', 'myv', 'no', 'nl', 'os', 'ps',
       'pl', 'pt', 'ro', 'sah', 'se', 'sk', 'ru', 'sq', 'ta', 'te', 'sv',
       'tr', 'tt', 'udm', 'uz', 'vep', 'uk', 'xal'], dtype=object)

here i'm filtering the CSV to look at the russian-english pairs,
if you want to look at another set of pairs,
replace 'ru' and 'en' with the codes of your choice!

In [12]:
pairs = data[(data['l1']=='ru') & (data['l2']=='en')]

what is the mean alignment between the word pairs across these two languages?

In [13]:
pairs.alignment.mean()

0.34792976529825265

what words are poorly aligned across the two languages?

In [14]:
pairs[pairs['alignment']<0.1]

Unnamed: 0.1,Unnamed: 0,l1,l2,alignment,wordform_l1,wordform_l2,neighbour_overlap,editdistance,k,n,...,zipf_freq_l2,passes_concept_filter,passes_wiki_filter,pos_gloss,semantic_domain,nel_pos,global_semantic_density_l1,global_semantic_density_l2,local_semantic_density_l1,local_semantic_density_l2
1638615,1638615,ru,en,-0.105349,жила,sinew,6.0,5,100,997,...,2.33,True,True,Nouns,The body,N,0.170701,0.214949,170.189102,214.304252
1638616,1638616,ru,en,-0.094918,близкий,near,4.0,7,100,997,...,5.28,True,True,Adjectives,Spatial relations,A,0.159128,0.173908,158.650299,173.385937
1638617,1638617,ru,en,-0.086786,грести,row,11.0,6,100,997,...,4.62,True,True,Verbs,Motion,V,0.261725,0.171959,260.940031,171.443214
1638618,1638618,ru,en,-0.077231,буква,character,18.0,9,100,997,...,5.10,True,True,Nouns,Speech and language,N,0.170150,0.183665,169.639364,183.114015
1638619,1638619,ru,en,-0.066755,владеть,rule,15.0,7,100,997,...,5.00,True,True,Verbs,Social and political relations,V,0.207353,0.125320,206.730610,124.943699
1638620,1638620,ru,en,-0.065444,лаять,bark,6.0,5,100,997,...,3.77,True,True,Verbs,Animals,V,0.289883,0.203942,289.013167,203.330472
1638621,1638621,ru,en,-0.065104,виднеться,show,17.0,9,100,997,...,5.64,True,True,Verbs,Sense perception,V,0.282834,0.200299,281.985428,199.698464
1638622,1638622,ru,en,-0.062084,передать,hand,12.0,8,100,997,...,5.40,True,True,Verbs,Basic actions and technology,V,0.211746,0.254232,211.111130,253.469147
1638623,1638623,ru,en,-0.059422,вина,fault,8.0,5,100,997,...,4.60,True,True,Nouns,Emotions and values,N,0.182166,0.161228,181.619463,160.744081
1638624,1638624,ru,en,-0.059395,сосуд,vein,19.0,5,100,997,...,3.75,True,True,Nouns,The body,N,0.214489,0.181214,213.845397,180.669880


what is the part-of-speech of these poorly aligned words?

In [31]:
pairs[pairs['alignment']<0.1]['pos_gloss'].value_counts()

Verbs         49
Nouns         45
Adjectives     4
Name: pos_gloss, dtype: int64

what is the semantic domain of these poorly aligned words?

In [32]:
pairs[pairs['alignment']<0.1]['semantic_domain'].value_counts()

The body                          17
Basic actions and technology      16
Motion                            10
Modern world                       9
Emotions and values                6
Spatial relations                  5
The house                          5
Agriculture and vegetation         4
Clothing and grooming              4
Food and drink                     3
Speech and language                3
The physical world                 3
Cognition                          3
Social and political relations     3
Quantity                           2
Sense perception                   2
Animals                            2
Possession                         1
Name: semantic_domain, dtype: int64

what words are well-aligned across the two languages?

In [10]:
ru_eng[ru_eng['alignment']>0.7]

Unnamed: 0.1,Unnamed: 0,l1,l2,alignment,wordform_l1,wordform_l2,neighbour_overlap,editdistance,k,n,...,zipf_freq_l2,passes_concept_filter,passes_wiki_filter,pos_gloss,semantic_domain,nel_pos,global_semantic_density_l1,global_semantic_density_l2,local_semantic_density_l1,local_semantic_density_l2
1639543,1639543,ru,en,0.700593,искусный,skilful,41.0,8,100,997,...,2.98,True,True,Adjectives,Basic actions and technology,A,0.202132,0.169786,201.525886,169.276958
1639544,1639544,ru,en,0.701642,утро,morning,50.0,7,100,997,...,5.33,True,True,Nouns,Time,N,0.216001,0.2103,215.353099,209.66876
1639545,1639545,ru,en,0.70576,побережье,coast,38.0,9,100,997,...,4.89,True,True,Nouns,The physical world,N,0.165395,0.157014,164.899055,156.543276
1639546,1639546,ru,en,0.713041,ночь,night,44.0,5,100,997,...,5.62,True,True,Nouns,Time,N,0.206921,0.224344,206.299831,223.670769
1639547,1639547,ru,en,0.719437,семья,family,39.0,6,100,997,...,5.65,True,True,Nouns,Kinship,N,0.165961,0.144304,165.463083,143.871315
1639548,1639548,ru,en,0.721068,черный,black,40.0,6,100,997,...,5.44,True,True,Adjectives,Sense perception,A,0.205107,0.193908,204.491772,193.326611
1639549,1639549,ru,en,0.735024,муж,husband,50.0,7,100,997,...,4.95,True,True,Nouns,Kinship,N,0.184726,0.16848,184.171695,167.974539
1639550,1639550,ru,en,0.73586,белый,white,49.0,5,100,997,...,5.46,True,True,Adjectives,Sense perception,A,0.208003,0.19172,207.378517,191.144809
1639551,1639551,ru,en,0.742038,полдень,noon,49.0,7,100,997,...,4.11,True,True,Nouns,Time,N,0.198595,0.183532,197.998907,182.981057
1639552,1639552,ru,en,0.742433,вечер,evening,45.0,7,100,997,...,4.87,True,True,Nouns,Time,N,0.212252,0.205102,211.614967,204.486321
