# Week 1 -- Testing protein encodings for similarity detection with sourmash

This notebook requires the following libraries: `pandas`, `requests`. Most of these libraries come with python, but you will get an error message if they are not installed: `ModuleNotFoundError: No module named 'pandas'`. To install something in your `rotation` environment, run:

`$ conda activate rotation`  
`(rotation) $ conda install pandas`

## Download data

First download some data that we can use as a test set. We will download genomes, "transcriptomes" (computationally predicted RNA sequences from DNA), and amino acid sequences. We will use sequences that are related to species *Treponema denticola*, *Bacteroides thetaiotaomicron*, and *Porphyromonas gingivalis*. In the csv which we use to cue downloads, we have recorded the taxonomic relatedness of each bug that we download relative to our species of interest. We will use these levels to infer whether sourmash recapitulates known taxonomic relationships between sequences.

!git clone https://github.com/bluegenes/2018-test_datasets.git

!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/denticola.csv
!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/bacteroides.csv
!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/gingivalis.csv

In [3]:
!git clone https://github.com/bluegenes/2018-test_datasets.git

fatal: destination path '2018-test_datasets' already exists and is not an empty directory.


In [5]:
!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/denticola.csv

genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/343/205/GCA_000343205.1_CA_glsol121/GCA_000343205.1_CA_glsol121_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/343/205/GCA_000343205.1_CA_glsol121/GCA_000343205.1_CA_glsol121_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/343/205/GCA_000343205.1_CA_glsol121/GCA_000343205.1_CA_glsol121_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/568/735/GCA_000568735.2_ASM56873v2/GCA_000568735.2_ASM56873v2_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/568/735/GCA_000568735.2_ASM56873v2/GCA_000568735.2_ASM56873v2_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/568/735/GCA_000568735.2_ASM56873v2/GCA_000568735.2_ASM56873v2_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/818/865/GCA_000818865.1_ASM81886v1/GCA_000818865.1_ASM81886v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/818/865/GCA_0008

rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/164/975/GCA_900164975.1_16852_2_85/GCA_900164975.1_16852_2_85_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/758/165/GCA_000758165.1_Spiroch1.0/GCA_000758165.1_Spiroch1.0_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/758/165/GCA_000758165.1_Spiroch1.0/GCA_000758165.1_Spiroch1.0_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/758/165/GCA_000758165.1_Spiroch1.0/GCA_000758165.1_Spiroch1.0_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/501/115/GCA_000501115.1_BorGarIPT101/GCA_000501115.1_BorGarIPT101_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/501/115/GCA_000501115.1_BorGarIPT101/GCA_000501115.1_BorGarIPT101_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/501/115/GCA_000501115.1_BorGarIPT101/GCA_000501115.1_BorGarIPT101_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/246/8

rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/475/GCA_001829475.1_ASM182947v1/GCA_001829475.1_ASM182947v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/443/305/GCA_001443305.1_ASM144330v1/GCA_001443305.1_ASM144330v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/443/305/GCA_001443305.1_ASM144330v1/GCA_001443305.1_ASM144330v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/443/305/GCA_001443305.1_ASM144330v1/GCA_001443305.1_ASM144330v1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/147/745/GCA_900147745.1_138_1/GCA_900147745.1_138_1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/147/745/GCA_900147745.1_138_1/GCA_900147745.1_138_1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/900/147/745/GCA_900147745.1_138_1/GCA_900147745.1_138_1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/340/745/GCA_000340745.1_Trep_dent_US-Tr

rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/340/645/GCA_000340645.1_Trep_dent_H-22_V1/GCA_000340645.1_Trep_dent_H-22_V1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/338/615/GCA_000338615.1_Trep_dent_ATCC_33520_V1/GCA_000338615.1_Trep_dent_ATCC_33520_V1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/338/615/GCA_000338615.1_Trep_dent_ATCC_33520_V1/GCA_000338615.1_Trep_dent_ATCC_33520_V1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/000/338/615/GCA_000338615.1_Trep_dent_ATCC_33520_V1/GCA_000338615.1_Trep_dent_ATCC_33520_V1_rna_from_genomic.fna.gz
genome: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/165/GCA_001829165.1_ASM182916v1/GCA_001829165.1_ASM182916v1_genomic.fna.gz
protein: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/165/GCA_001829165.1_ASM182916v1/GCA_001829165.1_ASM182916v1_protein.faa.gz
rna: https://ftp.ncbi.nih.gov/genomes/all/GCA/001/829/165/GCA_001829165.1_ASM182916v1/GCA_001829165.1_ASM18291

In [None]:
!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/bacteroides.csv

In [None]:

!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/gingivalis.csv

In [38]:
# look and see how many genomes were downloaded for denticola
!ls 2018-test_datasets/denticola/genomic/ | wc -l
!ls 2018-test_datasets/denticola/genomic/


69
GCA_000147075.1_ASM14707v1_genomic.fna.gz
GCA_000216715.2_CLC_glsol140_genomic.fna.gz
GCA_000217015.3_CLC_glsol119_genomic.fna.gz
GCA_000217655.1_ASM21765v1_genomic.fna.gz
GCA_000222305.1_ASM22230v1_genomic.fna.gz
GCA_000236685.1_ASM23668v1_genomic.fna.gz
GCA_000239475.1_ASM23947v1_genomic.fna.gz
GCA_000242595.3_ASM24259v3_genomic.fna.gz
GCA_000246415.2_CLC_glsol068_genomic.fna.gz
GCA_000246815.1_ASM24681v1_genomic.fna.gz
GCA_000260795.1_ASM26079v1_genomic.fna.gz
GCA_000301975.1_ASM30197v1_genomic.fna.gz
GCA_000338595.1_Trep_dent_ATCC_33521_V1_genomic.fna.gz
GCA_000338615.1_Trep_dent_ATCC_33520_V1_genomic.fna.gz
GCA_000338635.1_Trep_dent_ASLM_V1_genomic.fna.gz
GCA_000340605.1_Trep_dent_H1-T_V1_genomic.fna.gz
GCA_000340645.1_Trep_dent_H-22_V1_genomic.fna.gz
GCA_000340725.1_Trep_dent_AL-2_V1_genomic.fna.gz
GCA_000340745.1_Trep_dent_US-Trep_V1_genomic.fna.gz
GCA_000342705.1_CA_glsol153_genomic.fna.gz
GCA_000343205.1_CA_glsol121_genomic.fna.gz
GCA_000382565.1_ASM38256v1_genomic.fna.gz
G

## Generating signatures using sourmash

We will generate signatures for our data. One signature can hold multiple k sizes, but only one scaled value and only one molecule type/encoding (i.e. DNA or protein).

In [22]:
# calculate signatures for RNA and DNA
!mkdir -p sigs/bacteroides/genomic
!for infile in 2018-test_datasets/bacteroides/genomic/*fna.gz; do out_name=$(basename $infile .fna.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/genomic/${out_name}.sig ${infile}; done

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[Kcalculated 3 signatures for 2 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000011065.1_ASM1106v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bac

[Kcalculated 3 signatures for 103 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000466425.1_ASM46642v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000466425.1_ASM46642v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[Kcalculated 3 signatures for 59 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000495955.1_NUHP2_genomic.sig. Note: signature 

[Kcalculated 3 signatures for 179 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000759315.1_04_NF40_HMP9302v01_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000759315.1_04_NF40_HMP9302v01_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[Kcalculated 3 signatures for 77 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000785025.1_ASM7

[Kcalculated 3 signatures for 1730 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001373135.1_2e6A_assembly_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001373135.1_2e6A_assembly_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[Kcalculated 3 signatures for 69 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001398875.1_Mother10-2_ge

[Kcalculated 3 signatures for 423 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001614375.1_ASM161437v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001614375.1_ASM161437v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[Kcalculated 3 signatures for 81 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001659685.2_ASM165968v2_gen

[K== This is sourmash version 2.2.0. ==s/genomic/GCA_900109385.1_IMG-taxon_2593339226_annotated_assembly_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_assembly_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_assembly_genomic.fna.gz
[Kcalculated 3 signatures for 74 sequences in 2018-test_datasets/bacteroides/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_assembly_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_as

In [35]:
##worker Bee

!mkdir -p sigs/bacteroides/genomic2
!for infile in 2018-test_datasets/bacteroides/genomic2/*fna.gz; \
do out_name=$(basename $infile .fna.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/genomic/${out_name}.sig ${infile}; done


mv: cannot stat 'sigs/bacteroides/genomic/*protein.sig': No such file or directory


In [32]:
# Using the code above, and the sourmash compute help message, calculate signatures for proteins
!sourmash compute -h

!mkdir -p sigs/bacteroides/proteomic
!for infile in 2018-test_datasets/bacteroides/protein/*faa.gz; do out_name=$(basename $infile .faa.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/proteomic/${out_name}.sig ${infile}; done


usage: sourmash [--protein] [--no-protein] [--dayhoff] [--no-dayhoff] [--dna]
                [--no-dna] [-q] [--input-is-protein] [-k KSIZES]
                [-n NUM_HASHES] [--check-sequence] [-f] [-o OUTPUT]
                [--singleton] [--merge MERGED] [--name-from-first]
                [--input-is-10x] [--count-valid-reads COUNT_VALID_READS]
                [--write-barcode-meta-csv WRITE_BARCODE_META_CSV]
                [-p PROCESSES] [--save-fastas SAVE_FASTAS]
                [--line-count LINE_COUNT] [--track-abundance]
                [--scaled SCALED] [--seed SEED] [--randomize]
                [--license LICENSE]
                [--rename-10x-barcodes RENAME_10X_BARCODES]
                [--barcodes-file BARCODES_FILE]
                filenames [filenames ...]
sourmash: error: unrecognized arguments: -h


In [36]:
# Now you can use sourmash compare to compare signatures that were calculated with the same k size, the same 
# molecule type, and the same encoding (e.g. dayoff or h-p).
# Take a look at the metapallette paper for some applications of k-sizes to detection of relatedness across
# evolutionary distances, or take our word that k = 21 ~ genus level similarlity, k = 31 ~ species level 
# similarity, and k = 51 ~ strain level similarity. https://msystems.asm.org/content/1/3/e00020-16
!mkdir -p sourmash_compare
!sourmash compare -k 31 -o sourmash_compare/bacteroides_k31_dna_comp sigs/bacteroides/genomic/*.sig

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 61 signatures total.                                                    _assembly_genomic.sig
[Kdownsampling to scaled value of 2000
[K
min similarity in matrix: 0.000
[Ksaving labels to: sourmash_compare/bacteroides_k31_dna_comp.labels.txt
[Ksaving distance matrix to: sourmash_compare/bacteroides_k31_dna_comp


In [40]:
# Sourmash has built-in plotting capabilities that you can use to visualize the comparison matrix
# You can open these pdf files from your computer, or go back to the jupyter dashboard an click 
# on them to open view them.
!sourmash plot --labels --pdf sourmash_compare/bacteroides_k31_dna_comp
%mv *.pdf /mnt/c/Users/user/Desktop/TESTcsv/


In [42]:
# Alternatively, you can output the sourmash compare matrix as a csv, and import it into R to visualize:
!sourmash compare -k 31 --csv sourmash_compare/bacteroides_k31_dna_comp.csv sigs/bacteroides/genomic/*.sig
# see this link for generic R code to visualize the matrix:
# https://sourmash.readthedocs.io/en/latest/other-languages.html#r-code-for-working-with-compare-output

%mv sourmash_compare/*.csv /mnt/c/Users/user/Desktop/TESTcsv/

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 61 signatures total.                                                    _assembly_genomic.sig
[Kdownsampling to scaled value of 2000
[K
min similarity in matrix: 0.000


## Your next rotation task!

Use this notebook and sourmash help messages to create the following signatures for all files in the bacteroides, denticola, and gingivalis folders:
+ DNA k = 21, 31, 51; scaled = 2000
+ RNA k = 21, 31, 51; scaled = 2000
+ protein k = 7, 11, 17; scaled = 2000; no encoding
+ protein k = 7, 11, 17; scaled = 2000; dayhoff encoding
+ protein k = 7, 11, 17; scaled = 2000; hp encoding

Then, generate `sourmash compare` matrices for each set of signatures (i.e. one for denticola DNA k = 21, scaled = 2000). In total, this will be 45 sourmash compare matrices. Make sure you carefully select a consistent naming scheme so you will be able to tell the difference between each of these. 

Then, generate visualizations for these compare matrices. We want to know whether these accurately capture taxonomic relationships across evolutionary distances. Feel free to be creative with visualizations! You can use built-in sourmash plots, use the R code linked above, or even explore things like tanglegrams to compare a two trees. 


*here I got really overwhelmed, so this is what I will want to be able to run in rstudio markdowns once i figure out how to install python and packages for it... >.< *

Note: where does 45 come from? well: #of k's to try (3) * versions (5) * #plots (3)

In [None]:
#!git clone https://github.com/bluegenes/2018-test_datasets.git
##Download files
#!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/denticola.csv
#!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/bacteroides.csv
#!python 2018-test_datasets/download_genbank_datasets.py --genbank --protein --rna --subfolder 2018-test_datasets/gingivalis.csv

#count up
# look and see how many genomes were downloaded for denticola
#!ls 2018-test_datasets/denticola/genomic/ | wc -l
#!ls 2018-test_datasets/denticola/genomic/



#make a dir for the signature for each genome in the dataset (6, 3orgs*2spaces)
!mkdir -p sigs/bacteroides/genomic
!mkdir -p sigs/bacteroides/proteomic

!mkdir -p sigs/denticola/genomic
!mkdir -p sigs/denticola/proteomic

!mkdir -p sigs/gingivalis/genomic
!mkdir -p sigs/gingivalis/proteomic

#unzip and compute signature
!for infile in 2018-test_datasets/bacteroides/genomic/*fna.gz; do out_name=$(basename $infile .fna.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/genomic/${out_name}.sig ${infile}; done

!for infile in 2018-test_datasets/bacteroides/protein/*faa.gz; do out_name=$(basename $infile .faa.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/proteomic/${out_name}.sig ${infile}; done


# compare matrix as a csv, and import it into R to visualize:
!sourmash compare -k 31 --csv sourmash_compare/bacteroides_k31_dna_comp.csv sigs/bacteroides/genomic/*.sig
# see this link for generic R code to visualize the matrix:
# https://sourmash.readthedocs.io/en/latest/other-languages.html#r-code-for-working-with-compare-output

%mv sourmash_compare/*.csv /mnt/c/Users/user/Desktop/TESTcsv/

In [46]:
#make a dir for the signature for each genome in the dataset (6, 3orgs*2spaces)
!mkdir -p sigs/bacteroides/genomic
!mkdir -p sigs/bacteroides/proteomic

!mkdir -p sigs/denticola/genomic
!mkdir -p sigs/denticola/proteomic

!mkdir -p sigs/gingivalis/genomic
!mkdir -p sigs/gingivalis/proteomic

In [47]:
#unzip and compute signature

!for infile in 2018-test_datasets/bacteroides/genomic/*fna.gz; do out_name=$(basename $infile .fna.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/genomic/${out_name}.sig ${infile}; done
!for infile in 2018-test_datasets/bacteroides/protein/*faa.gz; do out_name=$(basename $infile .faa.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/bacteroides/proteomic/${out_name}.sig ${infile}; done


!for infile in 2018-test_datasets/denticola/genomic/*fna.gz; do out_name=$(basename $infile .fna.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/denticola/genomic/${out_name}.sig ${infile}; done
!for infile in 2018-test_datasets/denticola/protein/*faa.gz; do out_name=$(basename $infile .faa.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/denticola/proteomic/${out_name}.sig ${infile}; done


!for infile in 2018-test_datasets/gingivalis/genomic/*fna.gz; do out_name=$(basename $infile .fna.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/gingivalis/genomic/${out_name}.sig ${infile}; done
!for infile in 2018-test_datasets/gingivalis/protein/*faa.gz; do out_name=$(basename $infile .faa.gz); sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o sigs/gingivalis/proteomic/${out_name}.sig ${infile}; done



[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[Kcalculated 3 signatures for 2 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000011065.1_ASM1106v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bac

[Kcalculated 3 signatures for 103 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000466425.1_ASM46642v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000466425.1_ASM46642v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[Kcalculated 3 signatures for 59 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000495955.1_NUHP2_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000495955.1_NUHP2_genomic.sig. Note: signature 

[Kcalculated 3 signatures for 179 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000759315.1_04_NF40_HMP9302v01_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000759315.1_04_NF40_HMP9302v01_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[Kcalculated 3 signatures for 77 sequences in 2018-test_datasets/bacteroides/genomic/GCA_000785025.1_ASM78502v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_000785025.1_ASM7

[Kcalculated 3 signatures for 1730 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001373135.1_2e6A_assembly_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001373135.1_2e6A_assembly_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[Kcalculated 3 signatures for 69 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001398875.1_Mother10-2_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001398875.1_Mother10-2_ge

[Kcalculated 3 signatures for 423 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001614375.1_ASM161437v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001614375.1_ASM161437v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[Kcalculated 3 signatures for 81 sequences in 2018-test_datasets/bacteroides/genomic/GCA_001659685.2_ASM165968v2_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_001659685.2_ASM165968v2_gen

[K== This is sourmash version 2.2.0. ==s/genomic/GCA_900109385.1_IMG-taxon_2593339226_annotated_assembly_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_assembly_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_assembly_genomic.fna.gz
[Kcalculated 3 signatures for 74 sequences in 2018-test_datasets/bacteroides/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_assembly_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==s/genomic/GCA_900110645.1_IMG-taxon_2693429910_annotated_as

[Kcalculated 3 signatures for 4131 sequences in 2018-test_datasets/bacteroides/protein/GCA_000159855.2_Bact_sp_3_2_5_V2_protein.faa.gz
[K== This is sourmash version 2.2.0. ==s/proteomic/GCA_000159855.2_Bact_sp_3_2_5_V2_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_000163055.2_ASM16305v2_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_000163055.2_ASM16305v2_protein.faa.gz
[Kcalculated 3 signatures for 1960 sequences in 2018-test_datasets/bacteroides/protein/GCA_000163055.2_ASM16305v2_protein.faa.gz
[K== This is sourmash version 2.2.0. ==s/proteomic/GCA_000163055.2_A

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_000501415.1_BorBurgIPT48_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_000501415.1_BorBurgIPT48_protein.faa.gz
Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/__main__.py", line 83, in main
    cmd(sys.argv[2:])
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/command_compute.py", line 567, in compute
    for n, record 

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_000626635.1_ASM62663v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_000626635.1_ASM62663v1_protein.faa.gz
[Kcalculated 3 signatures for 3971 sequences in 2018-test_datasets/bacteroides/protein/GCA_000626635.1_ASM62663v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==s/proteomic/GCA_000626635.1_ASM62663v1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_dat

Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/__main__.py", line 83, in main
    cmd(sys.argv[2:])
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/command_compute.py", line 567, in compute
    for n, record in enumerate(screed.open(filename)):
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/screed/openscreed.py", line 39, in __init__
    self.iter_fn = self.open_reader(filename, *args, **kwargs)
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/screed/openscreed.py", line 95, in open_reader
    raise ValueError("unknown file format for '%s'" % filename)
ValueError: unknown file format for '2018-test_datasets/bacteroides/protein/GCA_001049535.1_3731_protein.faa.gz'
[K== This is sourmash version 2.2.0. ==
[K== Please cite Bro

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_001405895.1_14207_7_22_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_001405895.1_14207_7_22_protein.faa.gz
[Kcalculated 3 signatures for 5036 sequences in 2018-test_datasets/bacteroides/protein/GCA_001405895.1_14207_7_22_protein.faa.gz
[K== This is sourmash version 2.2.0. ==s/proteomic/GCA_001405895.1_14207_7_22_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_dat

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_001614375.1_ASM161437v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_001614375.1_ASM161437v1_protein.faa.gz
Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/__main__.py", line 83, in main
    cmd(sys.argv[2:])
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/command_compute.py", line 567, in compute
    for n, record in

[Kcalculated 3 signatures for 2030 sequences in 2018-test_datasets/bacteroides/protein/GCA_900032635.1_14555_6_27_protein.faa.gz
[K== This is sourmash version 2.2.0. ==s/proteomic/GCA_900032635.1_14555_6_27_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_900104585.1_PRJEB16348_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_900104585.1_PRJEB16348_protein.faa.gz
Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packag

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_900142325.1_IMG-taxon_2698536696_annotated_assembly_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_900142325.1_IMG-taxon_2698536696_annotated_assembly_protein.faa.gz
[Kcalculated 3 signatures for 3335 sequences in 2018-test_datasets/bacteroides/protein/GCA_900142325.1_IMG-taxon_2698536696_annotated_assembly_protein.faa.gz
[K== This is sourmash version 2.2.0. ==s/proteomic/GCA_900142325.1_IMG-taxon_2698536696_annotated_assembly_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105

[Kcalculated 3 signatures for 667 sequences in 2018-test_datasets/denticola/genomic/GCA_000246415.2_CLC_glsol068_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_000246415.2_CLC_glsol068_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/genomic/GCA_000246815.1_ASM24681v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/genomic/GCA_000246815.1_ASM24681v1_genomic.fna.gz
[Kcalculated 3 signatures for 1 sequences in 2018-test_datasets/denticola/genomic/GCA_000246815.1_ASM24681v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_000246815.1_ASM24681v1_genomic.sig. Note:

[Kcalculated 3 signatures for 194 sequences in 2018-test_datasets/denticola/genomic/GCA_000501035.1_BorGarIPT95_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_000501035.1_BorGarIPT95_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/genomic/GCA_000501115.1_BorGarIPT101_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/genomic/GCA_000501115.1_BorGarIPT101_genomic.fna.gz
[Kcalculated 3 signatures for 713 sequences in 2018-test_datasets/denticola/genomic/GCA_000501115.1_BorGarIPT101_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_000501115.1_BorGarIPT101_genomic.si

[Kcalculated 3 signatures for 2 sequences in 2018-test_datasets/denticola/genomic/GCA_000818865.1_ASM81886v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_000818865.1_ASM81886v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/genomic/GCA_000941035.1_ASM94103v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/genomic/GCA_000941035.1_ASM94103v1_genomic.fna.gz
[Kcalculated 3 signatures for 5 sequences in 2018-test_datasets/denticola/genomic/GCA_000941035.1_ASM94103v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_000941035.1_ASM94103v1_genomic.sig. Note: signa

[Kcalculated 3 signatures for 373 sequences in 2018-test_datasets/denticola/genomic/GCA_001829165.1_ASM182916v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_001829165.1_ASM182916v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/genomic/GCA_001829295.1_ASM182929v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/genomic/GCA_001829295.1_ASM182929v1_genomic.fna.gz
[Kcalculated 3 signatures for 138 sequences in 2018-test_datasets/denticola/genomic/GCA_001829295.1_ASM182929v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_001829295.1_ASM182929v1_genomic.sig. N

[Kcalculated 3 signatures for 122 sequences in 2018-test_datasets/denticola/genomic/GCA_002069655.1_ASM206965v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_002069655.1_ASM206965v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/genomic/GCA_002069965.1_ASM206996v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/genomic/GCA_002069965.1_ASM206996v1_genomic.fna.gz
[Kcalculated 3 signatures for 160 sequences in 2018-test_datasets/denticola/genomic/GCA_002069965.1_ASM206996v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==genomic/GCA_002069965.1_ASM206996v1_genomic.sig. N

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_000217655.1_ASM21765v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_000217655.1_ASM21765v1_protein.faa.gz
[Kcalculated 3 signatures for 1010 sequences in 2018-test_datasets/denticola/protein/GCA_000217655.1_ASM21765v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_000217655.1_ASM21765v1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/de

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_000338595.1_Trep_dent_ATCC_33521_V1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_000338595.1_Trep_dent_ATCC_33521_V1_protein.faa.gz
[Kcalculated 3 signatures for 2531 sequences in 2018-test_datasets/denticola/protein/GCA_000338595.1_Trep_dent_ATCC_33521_V1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_000338595.1_Trep_dent_ATCC_33521_V1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kc

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_000382565.1_ASM38256v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_000382565.1_ASM38256v1_protein.faa.gz
[Kcalculated 3 signatures for 832 sequences in 2018-test_datasets/denticola/protein/GCA_000382565.1_ASM38256v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_000382565.1_ASM38256v1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/den

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_000501035.1_BorGarIPT95_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_000501035.1_BorGarIPT95_protein.faa.gz
Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/__main__.py", line 83, in main
    cmd(sys.argv[2:])
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/command_compute.py", line 567, in compute
    for n, record in enu

[Kcalculated 3 signatures for 1085 sequences in 2018-test_datasets/denticola/protein/GCA_000568735.2_ASM56873v2_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_000568735.2_ASM56873v2_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_000758165.1_Spiroch1.0_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_000758165.1_Spiroch1.0_protein.faa.gz
[Kcalculated 3 signatures for 2303 sequences in 2018-test_datasets/denticola/protein/GCA_000758165.1_Spiroch1.0_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_000758165.1_Spiroch1.0_protein.sig. N

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_001604335.1_ASM160433v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_001604335.1_ASM160433v1_protein.faa.gz
Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/__main__.py", line 83, in main
    cmd(sys.argv[2:])
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/command_compute.py", line 567, in compute
    for n, record in enu

[Kcalculated 3 signatures for 2855 sequences in 2018-test_datasets/denticola/protein/GCA_001829505.1_ASM182950v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_001829505.1_ASM182950v1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_001830585.1_ASM183058v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_001830585.1_ASM183058v1_protein.faa.gz
[Kcalculated 3 signatures for 3119 sequences in 2018-test_datasets/denticola/protein/GCA_001830585.1_ASM183058v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_001830585.1_ASM183058v1_protein.

[Kcalculated 3 signatures for 2775 sequences in 2018-test_datasets/denticola/protein/GCA_002069965.1_ASM206996v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_002069965.1_ASM206996v1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/denticola/protein/GCA_900018035.1_7521_5_26_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/denticola/protein/GCA_900018035.1_7521_5_26_protein.faa.gz
[Kcalculated 3 signatures for 2780 sequences in 2018-test_datasets/denticola/protein/GCA_900018035.1_7521_5_26_protein.faa.gz
[K== This is sourmash version 2.2.0. ==proteomic/GCA_900018035.1_7521_5_26_protein.sig. Not

[Kcalculated 3 signatures for 29 sequences in 2018-test_datasets/gingivalis/genomic/GCA_000482365.1_ASM48236v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_000482365.1_ASM48236v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/genomic/GCA_000503975.1_SJD2_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/genomic/GCA_000503975.1_SJD2_genomic.fna.gz
[Kcalculated 3 signatures for 117 sequences in 2018-test_datasets/gingivalis/genomic/GCA_000503975.1_SJD2_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_000503975.1_SJD2_genomic.sig. Note: signature license is

[Kcalculated 3 signatures for 1 sequences in 2018-test_datasets/gingivalis/genomic/GCA_000739415.1_ASM73941v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_000739415.1_ASM73941v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/genomic/GCA_000765945.1_ASM76594v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/genomic/GCA_000765945.1_ASM76594v1_genomic.fna.gz
[Kcalculated 3 signatures for 31 sequences in 2018-test_datasets/gingivalis/genomic/GCA_000765945.1_ASM76594v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_000765945.1_ASM76594v1_genomic.sig. Note

[Kcalculated 3 signatures for 196 sequences in 2018-test_datasets/gingivalis/genomic/GCA_001261715.1_HGC4_v3_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_001261715.1_HGC4_v3_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/genomic/GCA_001263815.1_ASM126381v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/genomic/GCA_001263815.1_ASM126381v1_genomic.fna.gz
[Kcalculated 3 signatures for 1 sequences in 2018-test_datasets/gingivalis/genomic/GCA_001263815.1_ASM126381v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_001263815.1_ASM126381v1_genomic.sig. Note:

[Kcalculated 3 signatures for 99 sequences in 2018-test_datasets/gingivalis/genomic/GCA_001897595.1_ASM189759v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_001897595.1_ASM189759v1_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/genomic/GCA_001898165.1_ASM189816v1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/genomic/GCA_001898165.1_ASM189816v1_genomic.fna.gz
[Kcalculated 3 signatures for 80 sequences in 2018-test_datasets/gingivalis/genomic/GCA_001898165.1_ASM189816v1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_001898165.1_ASM189816v1_genomic.si

[Kcalculated 3 signatures for 72 sequences in 2018-test_datasets/gingivalis/genomic/GCA_900157215.1_Strain_3-3_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_900157215.1_Strain_3-3_genomic.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/genomic/GCA_900157325.1_3A1_genomic.fna.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/genomic/GCA_900157325.1_3A1_genomic.fna.gz
[Kcalculated 3 signatures for 56 sequences in 2018-test_datasets/gingivalis/genomic/GCA_900157325.1_3A1_genomic.fna.gz
[K== This is sourmash version 2.2.0. ==/genomic/GCA_900157325.1_3A1_genomic.sig. Note: signature license is CC0.

[Kcalculated 3 signatures for 4041 sequences in 2018-test_datasets/gingivalis/protein/GCA_000273295.1_Bact_vulg_CL09T03C04_V1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_000273295.1_Bact_vulg_CL09T03C04_V1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/protein/GCA_000380305.1_PgingivalisJCVISC001v1.0_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/protein/GCA_000380305.1_PgingivalisJCVISC001v1.0_protein.faa.gz
[Kcalculated 3 signatures for 2354 sequences in 2018-test_datasets/gingivalis/protein/GCA_000380305.1_PgingivalisJCVISC001v1.0_protein.faa.gz
[K== This is sour

[Kcalculated 3 signatures for 2935 sequences in 2018-test_datasets/gingivalis/protein/GCA_000510525.1_BU063sc679_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_000510525.1_BU063sc679_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/protein/GCA_000510545.1_BU063sc13_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/protein/GCA_000510545.1_BU063sc13_protein.faa.gz
[Kcalculated 3 signatures for 2371 sequences in 2018-test_datasets/gingivalis/protein/GCA_000510545.1_BU063sc13_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_000510545.1_BU063sc13_protein.sig.

[Kcalculated 3 signatures for 3971 sequences in 2018-test_datasets/gingivalis/protein/GCA_000626635.1_ASM62663v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_000626635.1_ASM62663v1_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/protein/GCA_000739415.1_ASM73941v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/protein/GCA_000739415.1_ASM73941v1_protein.faa.gz
[Kcalculated 3 signatures for 1958 sequences in 2018-test_datasets/gingivalis/protein/GCA_000739415.1_ASM73941v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_000739415.1_ASM73941v1_protein.

Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/__main__.py", line 83, in main
    cmd(sys.argv[2:])
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/command_compute.py", line 567, in compute
    for n, record in enumerate(screed.open(filename)):
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/screed/openscreed.py", line 39, in __init__
    self.iter_fn = self.open_reader(filename, *args, **kwargs)
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/screed/openscreed.py", line 95, in open_reader
    raise ValueError("unknown file format for '%s'" % filename)
ValueError: unknown file format for '2018-test_datasets/gingivalis/protein/GCA_001049535.1_3731_protein.faa.gz'
[K== This is sourmash version 2.2.0. ==
[K== Please cite Brow

[Kcalculated 3 signatures for 3813 sequences in 2018-test_datasets/gingivalis/protein/GCA_001406635.1_13470_2_63_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_001406635.1_13470_2_63_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/protein/GCA_001438125.1_ASM143812v1_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/protein/GCA_001438125.1_ASM143812v1_protein.faa.gz
[Kcalculated 3 signatures for 1662 sequences in 2018-test_datasets/gingivalis/protein/GCA_001438125.1_ASM143812v1_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_001438125.1_ASM143812v1_prot

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/protein/GCA_001670785.1_RCAD0183_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/protein/GCA_001670785.1_RCAD0183_protein.faa.gz
[Kcalculated 3 signatures for 2017 sequences in 2018-test_datasets/gingivalis/protein/GCA_001670785.1_RCAD0183_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_001670785.1_RCAD0183_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingiv

[Kcalculated 3 signatures for 4961 sequences in 2018-test_datasets/gingivalis/protein/GCA_900085525.1_12045_7_37_protein.faa.gz
[K== This is sourmash version 2.2.0. ==/proteomic/GCA_900085525.1_12045_7_37_protein.sig. Note: signature license is CC0.
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: 2018-test_datasets/gingivalis/protein/GCA_900089535.1_A04_protein.faa.gz
[KComputing signature for ksizes: [21, 31, 51]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/gingivalis/protein/GCA_900089535.1_A04_protein.faa.gz
Traceback (most recent call last):
  File "/home/hehouts/miniconda3/envs/rotation/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/hehouts/miniconda3/envs/rotation/lib/python3.7/site-packages/sourmash/__main

[K...2018-test_datasets/gingivalis/protein/GCA_900157325.1_3A1_protein.faa.gz 1845 sequences[Kcalculated 3 signatures for 1846 sequences in 2018-test_datasets/gingivalis/protein/GCA_900157325.1_3A1_protein.faa.gz
[Ktime taken to save signatures is 0.00014 seconds[Ksaved signature(s) to sigs/gingivalis/proteomic/GCA_900157325.1_3A1_protein.sig. Note: signature license is CC0.

In [89]:
#%pwd
#!mkdir -p sourmash_compare/compare_CSVs

# 31-mer
##Bacteroides, genomic
#!sourmash compare -k 31 -o sourmash_compare/bacteroides_k31_genomic_comp sigs/bacteroides/genomic/*.sig
#!sourmash plot --labels --pdf sourmash_compare/bacteroides_k31_genomic_comp
#%mv *.pdf /mnt/c/Users/user/Desktop/CheckPlotPDFs/
#!sourmash compare -k 31 --csv sourmash_compare/compare_CSVs/bacteroides_k31_genomic_comp.csv sigs/bacteroides/genomic/*.sig
#%mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/

##Bacteroides, proteomic
#!sourmash compare -k 31 -o sourmash_compare/bacteroides_k31_proteomic_comp sigs/bacteroides/proteomic/*.sig
#!sourmash plot --labels --pdf sourmash_compare/bacteroides_k31_proteomic_comp
#%mv *.pdf /mnt/c/Users/user/Desktop/CheckPlotPDFs/
##This doesnt appear useful (looks like only one seq is present)
#!sourmash compare -k 31 --csv sourmash_compare/compare_CSVs/bacteroides_k31_proteomic_comp.csv sigs/bacteroides/proteomic/*.sig
#%mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/



[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloading sigs/bacteroides/proteomic/GCA_000011065.1_ASM1106v1_protein.sig[Kloading sigs/bacteroides/proteomic/GCA_000157015.1_ASM15701v1_protein.sig[Kloading sigs/bacteroides/proteomic/GCA_000159855.2_Bact_sp_3_2_5_V2_protein.sig[Kloading sigs/bacteroides/proteomic/GCA_000163055.2_ASM16305v2_protein.sig[Kloading sigs/bacteroides/proteomic/GCA_000224595.1_Prevotella_sp_C561_V1_protein.sig[Kloading sigs/bacteroides/proteomic/GCA_000273215.1_Bact_ovat_CL03T12C18_V1_protein.sig[Kloading sigs/bacteroides/proteomic/GCA_000373705.1_ASM37370v1_protein.sig[KError in parsing signature; quitting.
[KException: parse error: premature EOF
                                       
                     (right here) ------^

[K
[Kloading sigs/bacteroides/proteomic/GCA_000403155.2_Bact_thet_dnLKV9_V1_protein.sig[Kloading sigs/bacteroides/proteomic/GCA_

we gon make a turduckin style for loop:

for space= genomic, proteomic
for organism= bacteroides, gingivalis, denticola
for k=31, 21, 51



In [91]:
%%bash


#unzip and compute signature
#seqSpace = {"genomic", "proteomic"}
#organism = {"bacteroides", "gingivalis", "denticola"}
#kMer = {31, 21, 51}

#for z in seqSpace:
#    for y in organism:
#        for x in kMer
 #       do
  #          sourmash compare -k (x) -o sourmash_compare/bacteroides_k31_genomic_comp sigs/bacteroides/genomic/*.sig
   #     done
#!sourmash plot --labels --pdf sourmash_compare/bacteroides_k31_genomic_comp
#%mv *.pdf /mnt/c/Users/user/Desktop/CheckPlotPDFs/
#!sourmash compare -k 31 --csv sourmash_compare/compare_CSVs/bacteroides_k31_genomic_comp.csv sigs/bacteroides/genomic/*.sig
#%mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/

#!sourmash compare -k 31 -o sourmash_compare/bacteroides_k31_genomic_comp sigs/bacteroides/genomic/*.sig
#!sourmash plot --labels --pdf sourmash_compare/bacteroides_k31_genomic_comp
#%mv *.pdf /mnt/c/Users/user/Desktop/CheckPlotPDFs/
#!sourmash compare -k 31 --csv sourmash_compare/compare_CSVs/bacteroides_k31_genomic_comp.csv sigs/bacteroides/genomic/*.sig
#%mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/


echo hello

hello


In [97]:
%%bash
for k in 21 31 51
    do
   for org in bacteroides gingivalis denticola
       do
        for mol in genomic
            do 
sourmash compare -k ${k} -o sourmash_compare/${org}_k${k}_${mol}_comp sigs/${org}/${mol}/*.sig
sourmash plot --labels --pdf sourmash_compare/${org}_k${k}_${mol}_comp
sourmash compare -k ${k} --csv sourmash_compare/compare_CSVs/${org}_k${k}_${mol}_comp.csv sigs/${org}/${mol}/*.sig
done
done
done

for k in 7 9 12 21
    do
   for org in bacteroides gingivalis denticola
       do
        for mol in proteomic
            do 
sourmash compare -k ${k} -o sourmash_compare/${org}_k${k}_${mol}_comp sigs/${org}/${mol}/*.sig
sourmash plot --labels --pdf sourmash_compare/${org}_k${k}_${mol}_comp
sourmash compare -k ${k} --csv sourmash_compare/compare_CSVs/${org}_k${k}_${mol}_comp.csv sigs/${org}/${mol}/*.sig
done
done
done

mv *.pdf /mnt/c/Users/user/Desktop/CheckPlotPDFs/
mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/

min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000
min similarity in matrix: 0.000


[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloading sigs/bacteroides/genomic/GCA_000011065.1_ASM1106v1_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000157015.1_ASM15701v1_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000159855.2_Bact_sp_3_2_5_V2_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000163055.2_ASM16305v2_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000224595.1_Prevotella_sp_C561_V1_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000273215.1_Bact_ovat_CL03T12C18_V1_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000373705.1_ASM37370v1_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000403155.2_Bact_thet_dnLKV9_V1_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000431275.1_MGS755_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000432495.1_MGS558_genomic.sig[Kloading sigs/bacteroides/genomic/GCA_000466425.1_ASM46642v1_genomic.sig[Kloading sigs/bac

In [3]:
%%bash
for k in 7
#9 12
    do
   for org in bacteroides 
#gingivalis denticola
       do
        for mol in proteomic
            do 
sourmash compare -k ${k} -o sourmash_compare/${org}_k${k}_${mol}_comp sigs/${org}/${mol}/*.sig
sourmash plot --labels --pdf sourmash_compare/bacteroides_k${k}_genomic_comp
sourmash compare -k ${k} --csv sourmash_compare/compare_CSVs/${org}_k${k}_${mol}_comp.csv sigs/${org}/${mol}/*.sig
done
done
done
mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/Test

[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloading sigs/bacteroides/proteomic/GCA_000011065.1_ASM1106v1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000157015.1_ASM15701v1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000159855.2_Bact_sp_3_2_5_V2_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000163055.2_ASM16305v2_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000224595.1_Prevotella_sp_C561_V1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000273215.1_Bact_ovat_CL03T12C18_V1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000373705.1_ASM37370v1_protein.sig[KError in parsing signature; quitting.
[KException: parse error: premature EOF
                                       
                     (right here) ------^

[K
[Kloading sigs/bacteroides/proteomic/GCA_000403155.2_Bact_thet_dnLKV9_V1_protein.sig[K
[Kloading sigs/

In [96]:
%%bash
for k in 7 9 12
    do
   for org in bacteroides gingivalis denticola
       do
        for mol in proteomic
            do 
sourmash plot --labels --pdf sourmash_compare/bacteroides_k${k}_genomic_comp
done
done
done
mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/

#%mv *.pdf /mnt/c/Users/user/Desktop/CheckPlotPDFs/
#%mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/

mv: cannot stat '*.pdf': No such file or directory


In [12]:

# 7-mer
#Bacteroides, proteomic
!sourmash compare -k 7 -o sourmash_compare/bacteroides_k7_proteomic_comp sigs/bacteroides/proteomic/*.sig
#!sourmash plot --labels --pdf sourmash_compare/bacteroides_k7_proteomic_comp
#%mv *.pdf /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/Test

#!sourmash compare -k 7 --csv sourmash_compare/compare_CSVs/bacteroides_k7_proteomic_comp.csv sigs/bacteroides/proteomic/*.sig
#%mv sourmash_compare/compare_CSVs/*.csv /mnt/c/Users/user/Desktop/ROTATIONS/DIBlabRotationProject/CompCSVs/Test


[K== This is sourmash version 2.2.0. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloading sigs/bacteroides/proteomic/GCA_000011065.1_ASM1106v1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000157015.1_ASM15701v1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000159855.2_Bact_sp_3_2_5_V2_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000163055.2_ASM16305v2_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000224595.1_Prevotella_sp_C561_V1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000273215.1_Bact_ovat_CL03T12C18_V1_protein.sig[K
[Kloading sigs/bacteroides/proteomic/GCA_000373705.1_ASM37370v1_protein.sig[KError in parsing signature; quitting.
[KException: parse error: premature EOF
                                       
                     (right here) ------^

[K
[Kloading sigs/bacteroides/proteomic/GCA_000403155.2_Bact_thet_dnLKV9_V1_protein.sig[K


In [11]:
%ls sourmash_compare/bac


bacteroides_k21_genomic_comp
bacteroides_k21_genomic_comp.labels.txt
bacteroides_k21_proteomic_comp
bacteroides_k21_proteomic_comp.labels.txt
bacteroides_k31_genomic_comp
bacteroides_k31_genomic_comp.csv
bacteroides_k31_genomic_comp.labels.txt
bacteroides_k31_proteomic_comp
bacteroides_k31_proteomic_comp.labels.txt
bacteroides_k51_genomic_comp
bacteroides_k51_genomic_comp.labels.txt
bacteroides_k51_proteomic_comp
bacteroides_k51_proteomic_comp.labels.txt
[0m[34;42mcompare_CSVs[0m/
denticola_k21_genomic_comp
denticola_k21_genomic_comp.labels.txt
denticola_k21_proteomic_comp
denticola_k21_proteomic_comp.labels.txt
denticola_k31_genomic_comp
denticola_k31_genomic_comp.labels.txt
denticola_k31_proteomic_comp
denticola_k31_proteomic_comp.labels.txt
denticola_k51_genomic_comp
denticola_k51_genomic_comp.labels.txt
denticola_k51_proteomic_comp
denticola_k51_proteomic_comp.labels.txt
gingivalis_k21_genomic_comp
gingivalis_k21_genomic_comp.labels.txt
gingivalis_k21