In [1]:
import json
import statistics as stats

import genpred


def load_sp_vocabulary(task="disease", vocab_size=32000):
    path = genpred.DATA_ROOT / f"{task}/sentencepiece/{vocab_size}/tokenizer.json"
    with open(path, "r", encoding="utf-8") as file:
        vocabulary = json.load(file)["model"]["vocab"]
    return [word for word in vocabulary if len(word) >= 3]


def print_stats(vocabulary, vocab_size=32000):
    vocabulary_size = len(vocabulary)
    word_lengths = [len(word) for word in vocabulary]
    longer_than_8 = [wl for wl in word_lengths if wl > 8]

    print(f"Statistics for sentencepiece {vocab_size} vocabulary:", end="\n\n")
    print(f"Min. word length: {min(word_lengths)}")
    print(f"Max. word length: {max(word_lengths)}")
    print(f"Mean word length: {stats.mean(word_lengths)} +/- {stats.stdev(word_lengths)}")
    print(f"Word median: {stats.median(word_lengths)}")
    print(f"Word mode: {stats.mode(word_lengths)}")
    print(f"Number of words longer than 8 bases: {len(longer_than_8)}")
    print(f"Percentage of words longer than 8 bases: {len(longer_than_8) / vocabulary_size * 100}")


## Analysis of the sentencepiece vocabulary

Here, we perform an analysis of the 32k sentencepiece vocabulary (used in the `disease` task) to show its property when compared to the k-mer vocabularies.

In [2]:
vocabulary_32k = load_sp_vocabulary(task="disease", vocab_size=32000)
print_stats(vocabulary_32k, vocab_size=32000)

Statistics for sentencepiece 32000 vocabulary:

Min. word length: 3
Max. word length: 410
Mean word length: 16.59013727758842 +/- 21.827647853364194
Word median: 9
Word mode: 8
Number of words longer than 8 bases: 21121
Percentage of words longer than 8 bases: 66.04646799462147


These statistics show that the vocabulary obtained with sentencepiece is drastically different from k-mer vocabularies, since:

- it contains words longer than 8 bases (while it is not feasible for k-mers to include words longer than 9 due to combinatorial explosion of the vocabulary size);

- the majority of words (66%) are longer than 8 bases.

This remains true even for the 8k vocabulary used for the `capsule` task:

In [3]:
vocabulary_8k = load_sp_vocabulary(task="capsule", vocab_size=8000)
print_stats(vocabulary_8k, vocab_size=8000)

Statistics for sentencepiece 8000 vocabulary:

Min. word length: 3
Max. word length: 99
Mean word length: 8.763032581453635 +/- 5.678093322486045
Word median: 8.0
Word mode: 7
Number of words longer than 8 bases: 2000
Percentage of words longer than 8 bases: 25.062656641604008


In this case, even though the vocabulary has approximately 1/10 words in total compared to the largest k-mer vocabulary (containing words of 8 bases at most), it still contains 25% of words longer than 8 bases.