# Week 1 -- Testing protein encodings for similarity detection with sourmash

This notebook requires the following libraries: `pandas`, `os`, `re`, `argparse`, `ipfsapi`, `requests`, `shutil`. Most of these libraries come with python, but you will get an error message if they are not installed: `ModuleNotFoundError: No module named 'pandas'`. To install something in your `rotation` environment, run:

`$ conda activate rotation`  
`(rotation) $ conda install pandas`

## Download data

First download some data that we can use as a test set. We will download genomes, "transcriptomes" (computationally predicted RNA sequences from DNA), and amino acid sequences. We will use sequences that are related to species *Treponema denticola*, *Bacteroides thetaiotaomicron*, and *Porphyromonas gingivalis*. In the csv which we use to cue downloads, we have recorded the taxonomic relatedness of each bug that we download relative to our species of interest. We will use these levels to infer whether sourmash recapitulates known taxonomic relationships between sequences.

In [16]:
!git clone https://github.com/bluegenes/2018-test_datasets.git

!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/denticola.csv
!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/bacteroides.csv
!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/gingivalis.csv

Cloning into '2018-test_datasets'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 68 (delta 10), reused 15 (delta 6), pack-reused 48[K
Unpacking objects: 100% (68/68), done.
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/343/205/GCA_000343205.1_CA_glsol121/GCA_000343205.1_CA_glsol121_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/343/205/GCA_000343205.1_CA_glsol121/GCA_000343205.1_CA_glsol121_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/343/205/GCA_000343205.1_CA_glsol121/GCA_000343205.1_CA_glsol121_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/568/735/GCA_000568735.2_ASM56873v2/GCA_000568735.2_ASM56873v2_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/568/735/GCA_000568735.2_ASM56873v2/GCA_000568735.2_ASM56873v2_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000

genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/164/975/GCA_900164975.1_16852_2_85/GCA_900164975.1_16852_2_85_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/164/975/GCA_900164975.1_16852_2_85/GCA_900164975.1_16852_2_85_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/164/975/GCA_900164975.1_16852_2_85/GCA_900164975.1_16852_2_85_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/758/165/GCA_000758165.1_Spiroch1.0/GCA_000758165.1_Spiroch1.0_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/758/165/GCA_000758165.1_Spiroch1.0/GCA_000758165.1_Spiroch1.0_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/758/165/GCA_000758165.1_Spiroch1.0/GCA_000758165.1_Spiroch1.0_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/501/115/GCA_000501115.1_BorGarIPT101/GCA_000501115.1_BorGarIPT101_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/501/115/GCA_000501

genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/475/GCA_001829475.1_ASM182947v1/GCA_001829475.1_ASM182947v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/475/GCA_001829475.1_ASM182947v1/GCA_001829475.1_ASM182947v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/475/GCA_001829475.1_ASM182947v1/GCA_001829475.1_ASM182947v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/443/305/GCA_001443305.1_ASM144330v1/GCA_001443305.1_ASM144330v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/443/305/GCA_001443305.1_ASM144330v1/GCA_001443305.1_ASM144330v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/443/305/GCA_001443305.1_ASM144330v1/GCA_001443305.1_ASM144330v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/147/745/GCA_900147745.1_138_1/GCA_900147745.1_138_1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/147/745/GCA_90014774

genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/340/645/GCA_000340645.1_Trep_dent_H-22_V1/GCA_000340645.1_Trep_dent_H-22_V1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/340/645/GCA_000340645.1_Trep_dent_H-22_V1/GCA_000340645.1_Trep_dent_H-22_V1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/340/645/GCA_000340645.1_Trep_dent_H-22_V1/GCA_000340645.1_Trep_dent_H-22_V1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/338/615/GCA_000338615.1_Trep_dent_ATCC_33520_V1/GCA_000338615.1_Trep_dent_ATCC_33520_V1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/338/615/GCA_000338615.1_Trep_dent_ATCC_33520_V1/GCA_000338615.1_Trep_dent_ATCC_33520_V1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/338/615/GCA_000338615.1_Trep_dent_ATCC_33520_V1/GCA_000338615.1_Trep_dent_ATCC_33520_V1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/165/GCA_001829165.1_ASM182916

rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/603/535/GCA_001603535.1_ASM160353v1/GCA_001603535.1_ASM160353v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/974/365/GCA_000974365.1_ASM97436v1/GCA_000974365.1_ASM97436v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/974/365/GCA_000974365.1_ASM97436v1/GCA_000974365.1_ASM97436v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/974/365/GCA_000974365.1_ASM97436v1/GCA_000974365.1_ASM97436v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/550/025/GCA_000550025.1_Stap_aure_M1496_V1/GCA_000550025.1_Stap_aure_M1496_V1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/550/025/GCA_000550025.1_Stap_aure_M1496_V1/GCA_000550025.1_Stap_aure_M1496_V1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/550/025/GCA_000550025.1_Stap_aure_M1496_V1/GCA_000550025.1_Stap_aure_M1496_V1_rna_from_genomic.fna.gz
genome: https://ftp.

rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/659/685/GCA_001659685.2_ASM165968v2/GCA_001659685.2_ASM165968v2_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/785/025/GCA_000785025.1_ASM78502v1/GCA_000785025.1_ASM78502v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/785/025/GCA_000785025.1_ASM78502v1/GCA_000785025.1_ASM78502v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/785/025/GCA_000785025.1_ASM78502v1/GCA_000785025.1_ASM78502v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/405/895/GCA_001405895.1_14207_7_22/GCA_001405895.1_14207_7_22_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/405/895/GCA_001405895.1_14207_7_22/GCA_001405895.1_14207_7_22_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/405/895/GCA_001405895.1_14207_7_22/GCA_001405895.1_14207_7_22_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/432/495/GCA_000

protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/816/245/GCA_001816245.1_ASM181624v1/GCA_001816245.1_ASM181624v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/816/245/GCA_001816245.1_ASM181624v1/GCA_001816245.1_ASM181624v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/611/675/GCA_000611675.1_ASM61167v1/GCA_000611675.1_ASM61167v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/611/675/GCA_000611675.1_ASM61167v1/GCA_000611675.1_ASM61167v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/611/675/GCA_000611675.1_ASM61167v1/GCA_000611675.1_ASM61167v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/142/325/GCA_900142325.1_IMG-taxon_2698536696_annotated_assembly/GCA_900142325.1_IMG-taxon_2698536696_annotated_assembly_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/142/325/GCA_900142325.1_IMG-taxon_2698536696_annotated_assembly/GCA_900142325.1_IMG-taxon_

genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/114/365/GCA_900114365.1_IMG-taxon_2651870357_annotated_assembly/GCA_900114365.1_IMG-taxon_2651870357_annotated_assembly_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/114/365/GCA_900114365.1_IMG-taxon_2651870357_annotated_assembly/GCA_900114365.1_IMG-taxon_2651870357_annotated_assembly_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/114/365/GCA_900114365.1_IMG-taxon_2651870357_annotated_assembly/GCA_900114365.1_IMG-taxon_2651870357_annotated_assembly_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/007/585/GCA_000007585.1_ASM758v1/GCA_000007585.1_ASM758v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/007/585/GCA_000007585.1_ASM758v1/GCA_000007585.1_ASM758v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/007/585/GCA_000007585.1_ASM758v1/GCA_000007585.1_ASM758v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all

genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/769/035/GCA_000769035.1_ASM76903v1/GCA_000769035.1_ASM76903v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/769/035/GCA_000769035.1_ASM76903v1/GCA_000769035.1_ASM76903v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/769/035/GCA_000769035.1_ASM76903v1/GCA_000769035.1_ASM76903v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/768/555/GCA_001768555.1_ASM176855v1/GCA_001768555.1_ASM176855v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/768/555/GCA_001768555.1_ASM176855v1/GCA_001768555.1_ASM176855v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/768/555/GCA_001768555.1_ASM176855v1/GCA_001768555.1_ASM176855v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/085/525/GCA_900085525.1_12045_7_37/GCA_900085525.1_12045_7_37_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/085/525/GCA_9000

rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/583/675/GCA_000583675.1_ASM58367v1/GCA_000583675.1_ASM58367v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/808/555/GCA_001808555.1_ASM180855v1/GCA_001808555.1_ASM180855v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/808/555/GCA_001808555.1_ASM180855v1/GCA_001808555.1_ASM180855v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/808/555/GCA_001808555.1_ASM180855v1/GCA_001808555.1_ASM180855v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/162/415/GCA_000162415.1_ASM16241v1/GCA_000162415.1_ASM16241v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/162/415/GCA_000162415.1_ASM16241v1/GCA_000162415.1_ASM16241v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/162/415/GCA_000162415.1_ASM16241v1/GCA_000162415.1_ASM16241v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/064/135/GCA

genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/089/535/GCA_900089535.1_A04/GCA_900089535.1_A04_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/089/535/GCA_900089535.1_A04/GCA_900089535.1_A04_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/089/535/GCA_900089535.1_A04/GCA_900089535.1_A04_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/217/375/GCA_000217375.2_CLC_glsol081/GCA_000217375.2_CLC_glsol081_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/217/375/GCA_000217375.2_CLC_glsol081/GCA_000217375.2_CLC_glsol081_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/217/375/GCA_000217375.2_CLC_glsol081/GCA_000217375.2_CLC_glsol081_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/599/245/GCA_000599245.1_ASM59924v1/GCA_000599245.1_ASM59924v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/599/245/GCA_000599245.1_ASM59924v1/GCA_000599245.1_A

In [2]:
# look and see how many genomes were downloaded for denticola
!ls 2018-test_datasets/denticola/

GCA_000147075.1_ASM14707v1_genomic.fna.gz
GCA_000147075.1_ASM14707v1_protein.faa.gz
GCA_000147075.1_ASM14707v1_rna_from_genomic.fna.gz
GCA_000216715.2_CLC_glsol140_genomic.fna.gz
GCA_000216715.2_CLC_glsol140_protein.faa.gz
GCA_000216715.2_CLC_glsol140_rna_from_genomic.fna.gz
GCA_000217015.3_CLC_glsol119_genomic.fna.gz
GCA_000217015.3_CLC_glsol119_protein.faa.gz
GCA_000217015.3_CLC_glsol119_rna_from_genomic.fna.gz
GCA_000217655.1_ASM21765v1_genomic.fna.gz
GCA_000217655.1_ASM21765v1_protein.faa.gz
GCA_000217655.1_ASM21765v1_rna_from_genomic.fna.gz
GCA_000222305.1_ASM22230v1_genomic.fna.gz
GCA_000222305.1_ASM22230v1_protein.faa.gz
GCA_000222305.1_ASM22230v1_rna_from_genomic.fna.gz
GCA_000236685.1_ASM23668v1_genomic.fna.gz
GCA_000236685.1_ASM23668v1_protein.faa.gz
GCA_000236685.1_ASM23668v1_rna_from_genomic.fna.gz
GCA_000239475.1_ASM23947v1_genomic.fna.gz
GCA_000239475.1_ASM23947v1_protein.faa.gz
GCA_000239475.1_ASM23947v1_rna_from_genomic.fna.gz
GCA_000242595.3_ASM242

## Generating signatures using sourmash

We will generate signatures for our data. One signature can hold multiple k sizes, but only one scaled value and only one molecule type/encoding (i.e. DNA or protein).

In [20]:
# calculate signatures for RNA and DNA
!mkdir -p sigs/bacteroides/genomic
!for infile in 2018-test_datasets/bacteroides/genomic/*fna.gz; do out_name=$(basename $infile .fna.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/genomic/${out_name}.sig ${infile}; done

[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[Kcalculated 3 signatures for 2 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_00

[Kcalculated 3 signatures for 103 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000466425.1_ASM46642v1_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[Kcalculated 3 signatures for 59 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and

[Kcalculated 3 signatures for 179 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000759315.1_04_NF40_HMP9302v01_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[Kcalculated 3 signatures for 77 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K=

[Kcalculated 3 signatures for 1730 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001373135.1_2e6A_assembly_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[Kcalculated 3 signatures for 69 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Pl

[Kcalculated 3 signatures for 423 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001614375.1_ASM161437v1_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[Kcalculated 3 signatures for 81 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Pl

[Kcalculated 3 signatures for 74 sequences in 2018-test_datasets/bacteroides/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_assembly_genomic.fna.gz
[Ksaved 3 signature(s). Note: signature license is CC0.
[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_900111425.1_IMG-taxon_2693429902_annotated_assembly_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_900111425.1_IMG-taxon_2693429902_annotated_assembly_genomic.fna.gz
[Kcalculated 3 signatures for 89 sequences in 2018-test_datasets/bacteroides/genomic/GCA_900111425.1_IMG-taxon_2693429902_annotated_assembly_genomic.fn

In [17]:
# Using the code above, and the sourmash compute help message, calculate signatures for proteins
!sourmash compute -h

usage: sourmash [--protein] [--no-protein] [--dayhoff] [--no-dayhoff] [--dna]
                [--no-dna] [-q] [--input-is-protein] [-k KSIZES]
                [-n NUM_HASHES] [--check-sequence] [-f] [-o OUTPUT]
                [--singleton] [--merge MERGED] [--name-from-first]
                [--input-is-10x] [-p PROCESSES] [--track-abundance]
                [--scaled SCALED] [--seed SEED] [--randomize]
                [--license LICENSE]
                filenames [filenames ...]
sourmash: error: the following arguments are required: filenames


In [21]:
# Now you can use sourmash compare to compare signatures that were calculated with the same k size, the same 
# molecule type, and the same encoding (e.g. dayoff or h-p).
# Take a look at the metapallette paper for some applications of k-sizes to detection of relatedness across
# evolutionary distances, or take our word that k = 21 ~ genus level similarlity, k = 31 ~ species level 
# similarity, and k = 51 ~ strain level similarity. https://msystems.asm.org/content/1/3/e00020-16
!mkdir -p sourmash_compare
!sourmash compare -k 31 -o sourmash_compare/bacteroides_k31_dna_comp sigs/bacteroides/genomic/*.sig

[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 61 signatures total.                                                    _assembly_genomic.sig
[Kdownsampling to scaled value of 2000
[K
min similarity in matrix: 0.000
[Ksaving labels to: sourmash_compare/bacteroides_k31_dna_comp.labels.txt
[Ksaving distance matrix to: sourmash_compare/bacteroides_k31_dna_comp


In [22]:
# Sourmash has built-in plotting capabilities that you can use to visualize the comparison matrix
# You can open these pdf files from your computer, or go back to the jupyter dashboard an click 
# on them to open view them.
!sourmash plot --labels --pdf sourmash_compare/bacteroides_k31_dna_comp

[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloading comparison matrix from sourmash_compare/bacteroides_k31_dna_comp...
[K...got 61 x 61 matrix.
[Kloading labels from sourmash_compare/bacteroides_k31_dna_comp.labels.txt
[Ksaving histogram of matrix values => bacteroides_k31_dna_comp.hist.pdf
[Kwrote dendrogram to: bacteroides_k31_dna_comp.dendro.pdf
[Kwrote numpy distance matrix to: bacteroides_k31_dna_comp.matrix.pdf


In [23]:
# Alternatively, you can output the sourmash compare matrix as a csv, and import it into R to visualize:
!sourmash compare -k 31 --csv sourmash_compare/bacteroides_k31_dna_comp.csv sigs/bacteroides/genomic/*.sig
# see this link for generic R code to visualize the matrix:
# https://sourmash.readthedocs.io/en/latest/other-languages.html#r-code-for-working-with-compare-output

[K== This is sourmash version 2.1.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 61 signatures total.                                                    _assembly_genomic.sig
[Kdownsampling to scaled value of 2000
[K
min similarity in matrix: 0.000


## Your next rotation task!

Use this notebook and sourmash help messages to create the following signatures for all files in the bacteroides, denticola, and gingivalis folders:
+ DNA k = 21, 31, 51; scaled = 2000
+ RNA k = 21, 31, 51; scaled = 2000
+ protein k = 7, 11, 17; scaled = 2000; no encoding
+ protein k = 7, 11, 17; scaled = 2000; dayhoff encoding
+ protein k = 7, 11, 17; scaled = 2000; hp encoding

Then, generate `sourmash compare` matrices for each set of signatures (i.e. one for denticola DNA k = 21, scaled = 2000). In total, this will be 45 sourmash compare matrices. Make sure you carefully select a consistent naming scheme so you will be able to tell the difference between each of these. 

Then, generate visualizations for these compare matrices. We want to know whether these accurately capture taxonomic relationships across evolutionary distances. Feel free to be creative with visualizations! You can use built-in sourmash plots, use the R code linked above, or even explore things like tanglegrams to compare a two trees. 
