This code repository hosts the datasets, pre-trained embeddings, and scripts to run experiments found in the anonymous submission "Do Similar Entities have Similar Embeddings?" to ESWC 2024.
Required libraries and dependencies can be found in requirements.txt
.
The preprocessed_datasets
folder contains already processed datasets with all the required files to effectively carry out experiments.
In particular, files containing results for the Jaccard coefficients are names fast_jaccard2id_train_K=100_hops=X.json
where X is whether 1 (when considering 1-hop subgraphs) or 0 (when considering 2-hop subgraphs). A mapped version to labels exist, under the same naming convention fast_jaccard2labels_train_K=100_hops=X.json
.
Some of them are not provided (as they exceed the file size limit of GitHub). However, we provide fast_jaccard.py
which uses a fast implementation of Jaccard coefficient computation over subgraphs. This script can be used to generate the required files.
The dataset DBpedia50 is named db50_trans
to highlight the fact that its transductive version is used in the experiments.
Trained models and their respective embeddings can be found under the embeddings
folder. RDF2Vec embeddings were trained using pyRDF2Vec while other models were trained using PyKEEN. For RDF2Vec, we used the default hyperparameters reported in Portisch et al.: embeddings of dimension 200, with 2,000 walks maximum, a depth of 4, a window size of 5, and 25 epochs for training word2vec with the continuous skip-gram architecture. For models trained with PyKEEN, we used the provided configuration files, when available. For those datasets with no reported best hyperparameters, we used the hyperparameters reported in the original paper (e.g. YAGO4-19K, with best hyperparameters found in Hubert et al.) or performed manual hyperparameter search and kept the sets of hyperparameters leading to the best results on the validation sets (e.g. for AIFB).
Running the experiments is quite straightforward: we provide several scripts. Each of them outputs results w.r.t. a separate experimental section in the paper and aims at answering a specific research question (RQ).
run_expe_1_2.py
runs experiments related to RQ1 and RQ2. (Table 2, Figure 3, and Figure 4)
run_expe_3.py
runs experiments related to RQ3. (Table 4)
Additional files are provided to get dataset statistics : get_dataset_statistics.py
and get_rel_freq_per_class.py
. These scripts do not need to be run and are not used per se in the paper. However, they can provide insights into how predicates are globally distributed across classes, what classes observe a larger set of distinct predicates connecting their respective entities, etc.