In [8]:
import pandas as pd

for lang in ['es', 'it']:
    df = pd.io.json.read_json(path_or_buf=f'./data/train-data_all/{lang}.train.json', orient="records")
    # patch to hashable
    df["sgns"] = df.sgns.apply(tuple)
    df["char"] = df.char.apply(tuple)
    len_df = len(df)
    len_df_sgns_dedup = len(df.drop_duplicates(subset=["gloss", "sgns"]))
    print(lang, len_df, len_df_sgns_dedup - len_df)

es 43608 -10
it 43608 -68


In [5]:
for lang in ['en', 'fr', 'ru']:
    df = pd.io.json.read_json(path_or_buf=f'./data/train-data_all/{lang}.train.json', orient="records")
    # patch to hashable
    df["sgns"] = df.sgns.apply(tuple)
    df["char"] = df.char.apply(tuple)
    df["electra"] = df.electra.apply(tuple)
    len_df = len(df)
    len_df_sgns_dedup = len(df.drop_duplicates(subset=["gloss", "sgns"]))
    len_df_full_dedup = len(df.drop_duplicates(subset=["gloss", "sgns", "electra"]))
    print(lang, len_df, len_df_sgns_dedup - len_df, len_df_full_dedup - len_df)

en 43608 -12301 0
fr 43608 -9111 0
ru 43608 -12375 -2


All word2vec SGNS and ELECTRA embeddings models for the 5 languages were trained on comparable datasets:
- around 1 billion sentences in total 
- 50% of the sentences came from wikipedia, 40% came from  open subtitles, the rest were drawn from book corpora such as wikisource or gutenberg.org
- to normalize whitespaces, all sentences were tokenized with NLTK's word_tokenize 

Some analysis. 
- The SGNS models were found to perform comparably to other off-the-shelf word2vec models on word analogy tasks. 
- The character embeddings achieved a 98% reconstruction accuracy on a held-out test set

By "external resources", we mean both external datasets (such as wordnet or raw corpus data to train a LM) as well as  pretrained models, such as BERT and the like. 
You are, on the other hand, free to train language models on the CoDWoE datasets themselves.

The electra vectors were computed from examples of usage: we embedded the full example of usage, and then extracted the embeddings corresponding to the word being defined. 

Some entries have more than one such example of usage; for a concrete example, see https://fr.wiktionary.org/wiki/plastique#Nom_commun_2 which lists three examples, with the definiendum in bold. In such cases, we computed distinct electra vectors for each example of usage, and these correspond to distinct items in the CODWOE datasets. 
These items therefore have distinct electra vectors, but the same SGNS and char vectors, as well as the same glosses. This case only occurs in the EN/FR/RU datasets.

Another case of duplicate glosses is due to distinct words that are defined with the same gloss in the original dataset. In the CODWOE datasets, they correspond to items with different SGNS and char vectors, but the same gloss. These are present in all datasets, including the ES /IT datasets. 

Lastly, a few entries (around 0.03% of all items)  in our sources had the same word, with the same gloss, but with different POS. These did produce true duplicates.
