This is an experiment about using sourmash for working with natural language (instead of genomic k-mers). The goal is to use books for explaining how MinHash works, and make analogies to why we use k-mers for genomic data.
Plotting the similarity between chapters in The Dispossessed, by Ursula K. Le Guin.
I wanted to check this because the book alternates between two different places (Anarres and Urras) and timeframes (past and present), so do they also cluster together (even and odd chapters numbers)?
First try, using only the presence of words:
Second try, considering the abundance (how many times each word appears):
Third try: find the intersection (the words present in all chapters) and remove it.
This removes the common background in all chapters, and maximizes the difference
between them. This plot is also using the abundance.
It is interesting to notice that chapter 1 and 13 are "space travel" chapters, not totally in one or the other planet, and the odd/even chapters do group together.
This example aims to show the difference between Similarity and Containment, and when you might prefer one to the other. For that, we use the Torah and the Bible as examples.
The Torah is a composed of five books,
which are also the first five books in the Bible.
But, since they have different sizes (the Bible being much larger),
the Similarity score is low (0.34
) because similarity takes into account the
union of elements from both datasets (the denominator in this equation):
The Containment score takes into account the size of each dataset (in the denominator),
and so it is an asymmetrical score.
Because of this,
Containment of the Torah in the Bible is high (
C(T, B) = 0.91
)
while the Containment of the Bible in the Torah is small (C(B, T) = 0.35
).
There are two pieces in this repo:
ansible
, a very minimal CLI written in Rust for transforming a text file into a sourmash signature and performing intersection and subtraction of signatures.- A Snakemake pipeline for downloading data, building signature with
ansible
, and running sourmashcompare
andplot
commands to generate pretty pictures.
This projects depends on Snakemake, sourmash, pandoc, wget and a Rust compiler.
If you don't want to download and install them yourself, you can use Nix to manage it for you:
- Clone the repo.
- Install nix.
- run
nix-shell
in the repo directory to open a shell with all deps installed. - run
snakemake -j1
to process the pipeline and generate figures
You can also use the conda environment specified in environment.yml
.
To build and activate this environment run:
conda env create --force --file environment.yml
conda activate 2021-02-26-text-minhash
After installing the conda environment,
you can also use repo2docker
to create a container with all dependencies,
just like what is executed in Binder
.
repo2docker .