Skip to content

An experiment using sourmash for working with natural language (instead of genomic k-mers).

Notifications You must be signed in to change notification settings

luizirber/2021-02-26-text-minhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Using sourmash with text

Binder

This is an experiment about using sourmash for working with natural language (instead of genomic k-mers). The goal is to use books for explaining how MinHash works, and make analogies to why we use k-mers for genomic data.

Example 1: The Dispossessed

Plotting the similarity between chapters in The Dispossessed, by Ursula K. Le Guin.

I wanted to check this because the book alternates between two different places (Anarres and Urras) and timeframes (past and present), so do they also cluster together (even and odd chapters numbers)?

First try, using only the presence of words:

Second try, considering the abundance (how many times each word appears):

Third try: find the intersection (the words present in all chapters) and remove it. This removes the common background in all chapters, and maximizes the difference between them. This plot is also using the abundance.

It is interesting to notice that chapter 1 and 13 are "space travel" chapters, not totally in one or the other planet, and the odd/even chapters do group together.

Example 2: Similarity and Containment

This example aims to show the difference between Similarity and Containment, and when you might prefer one to the other. For that, we use the Torah and the Bible as examples.

The Torah is a composed of five books, which are also the first five books in the Bible. But, since they have different sizes (the Bible being much larger), the Similarity score is low (0.34) because similarity takes into account the union of elements from both datasets (the denominator in this equation):

The Containment score takes into account the size of each dataset (in the denominator), and so it is an asymmetrical score. Because of this, Containment of the Torah in the Bible is high (C(T, B) = 0.91) while the Containment of the Bible in the Torah is small (C(B, T) = 0.35).

Code organization

There are two pieces in this repo:

  • ansible, a very minimal CLI written in Rust for transforming a text file into a sourmash signature and performing intersection and subtraction of signatures.
  • A Snakemake pipeline for downloading data, building signature with ansible, and running sourmash compare and plot commands to generate pretty pictures.

Setup

This projects depends on Snakemake, sourmash, pandoc, wget and a Rust compiler.

Nix

If you don't want to download and install them yourself, you can use Nix to manage it for you:

  • Clone the repo.
  • Install nix.
  • run nix-shell in the repo directory to open a shell with all deps installed.
  • run snakemake -j1 to process the pipeline and generate figures

Conda

You can also use the conda environment specified in environment.yml. To build and activate this environment run:

conda env create --force --file environment.yml

conda activate 2021-02-26-text-minhash

After installing the conda environment, you can also use repo2docker to create a container with all dependencies, just like what is executed in Binder.

repo2docker .

About

An experiment using sourmash for working with natural language (instead of genomic k-mers).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages