The objective is to extract a sample
In the initial set
Also, one entity
Then, given a subset
It follows that the diversity of organisms or chemicals can be measured with the Shannon's entropy over the probability distributions of elements
The proposed sampling approach is a simple greedy algorithm that, at each step, selects and adds the new document
Install the gme-sampler
pip install git+ssh://git@github.com/idiap/gme-sampler
The provided example considers the motivating scenario of sampling literature references for the LOTUS database.
The original snapshot of the LOTUS database (v.10 - Jan 6, 2023) used in this work is available at .
The LOTUS dataset used is licensed under CC BY 4.0. See the related article and the website for more details.
Only a subset of the original database is used in this example.
In this example, each document
structure_wikidata | structure_cid | structure_nameTraditional | organism_wikidata | organism_name | organism_taxonomy_02kingdom | reference_wikidata | reference_doi | reference_pubmed_id |
---|---|---|---|---|---|---|---|---|
http://www.wikidata.org/entity/Q43656 | 5997 | Cholesterol | http://www.wikidata.org/entity/Q1146782 | Eryngium foetidum | Archaeplastida | http://www.wikidata.org/entity/Q34502919 | 10.1002/(SICI)1099-1573(199902)13:13.0.CO;2-F | 10189959 |
http://www.wikidata.org/entity/Q121802 | 222284 | Beta-Sitosterol | http://www.wikidata.org/entity/Q1146782 | Eryngium foetidum | Archaeplastida | http://www.wikidata.org/entity/Q34502919 | 10.1002/(SICI)1099-1573(199902)13:13.0.CO;2-F | 10189959 |
http://www.wikidata.org/entity/Q104253515 | 5283638 | Clerosterol | http://www.wikidata.org/entity/Q1146782 | Eryngium foetidum | Archaeplastida | http://www.wikidata.org/entity/Q34502919 | 10.1002/(SICI)1099-1573(199902)13:13.0.CO;2-F | 10189959 |
-
The item_column indicates the column containing the documents identifiers while columns containing the variables to maximise the diversity are specified with the on_columns argument.
-
If binarised is set to True, thenwe only consider the distinct set of entities per documents.
-
dutopia is the main metric use to select the next document are represent the distance to the uptopian point, see here. However, a second metric is provided, sum, which simply rank the documents by the sum of the associated entropies.
import pandas as pd
from gme.gme import GreedyMaximumEntropySampler
# Load data
data = pd.read_csv("data/test_data.csv", sep="\t", dtype=object)
# Init sampler
sampler = GreedyMaximumEntropySampler(selector="dutopia", binarised=False)
# Sample
output = sampler.sample(
data=data,
N=5,
item_column="reference_doi",
on_columns=["structure_wikidata", "organism_wikidata"],
)
The expected output is:
reference_doi | structure_wikidata | organism_wikidata |
---|---|---|
10.1016/0039-128X(82)90018-6 | 2.77259 | 0.00000 |
10.1055/S-2001-11496 | 2.89793 | 1.02910 |
10.1016/J.TALANTA.2005.04.043 | 3.32340 | 1.33408 |
10.1016/0039-128X(80)90068-9 | 3.57149 | 1.56290 |
10.1016/S0305-1978(00)00054-5 | 3.70730 | 1.75229 |
At each step the document which maximised the diversity is sampled and the corresponding increasing entropy values for each variables (here structure_wikidata and organism_wikidata) are indicated.
On larger datasets, monitoring the entropy values can help estimate a sufficient sample size. See a complete example in our related article.
The approache scales quadratically with the size of the dataset, which can be impractical for very large datasets. In this context, we implemented an --approx
option. With this option, instead of computing all the emtropy values, the best candidate is approximated by only taking the best one over a random sample of
Considering an intial sample of
For instance, for a dataset of size
- 88.19 % that it is in the top-10 in the full dataset.
- 98.60 % that it is in the top-20 in the full dataset.
- 99.99 % that it is in the top-50 in the full dataset.
-
$\approx 100$ % that it is in the top-100 in the full dataset.