## Datasets that are as clean as possible



### Test data - Amphimedon and Nematostella  

I'm going to use a sponge and a cnido to start with because they are the kinds of organisms I want this program to be useful for (and also they are cool). They also have decent(?) genomes, or at least well-used ones, so maybe they can help me get an idea of how "clean" I can get an assembly to be.  

I'll downloaded a dataset for each one from the ENA. The Amphimedon is JADHEG01 and the Nematostella is ABAV01.  

### Kraken  

I've never used kraken (or kraken2 at this point) before, so this is kind of a slow process. I found the documentation (https://github.com/DerrickWood/kraken2/wiki/Manual#special-databases)(which seems to be organized in a non-intuitive way, so that's fun) and I think it has at least most of what I need. Hopefully I can get the rest from the help page and this tutorial I found: https://bioinformaticsworkbook.org/dataAnalysis/Metagenomics/Kraken.html#gsc.tab=0.  

I looked at a kraken job that Rebecca submitted (presumably sometime in the last six months) and tried to copy some of her settings whenever possible. She used a database that was already downloaded, and I think it is just the standard kraken one, so I'm going to try to use that too. If I get an error that it is not available, I'll figure out what to do next. The settings and command I tried are below.  

- CPU time: short  
- Memory: 8Gb  
- Type of PE: multi-thread  
- Number of CPUs: 12  
- Job name: amph_kraken

`kraken2 --db /data/genomics/db/Kraken/kraken2_db/ --unclassified-out amph_unclassified --report amph_report_kraken --threads $NSLOTS  Amphimedon.fasta`  

After I saved it to my computer and transferred it onto Hydra, I changed the module that it connects to. The only one you can access in the QSub Generation Utility is the oldest installation, so I just added the version directory in the module load part manually.

Ok, I finally got this running, and updated the kraken script above to reflect the one that functions.  

I also separated the std error, because I think it prints a lot of results there (since there does not seem to be a way to specify a main output file), and because it didn't seem to be able to find my input files when I used > to try to capture it in a different file.  

After I got results back I ran it on the Nematostella genome also.

### Kraken output  

**Amphimedon**  
1422 sequences classified (36.73%)  
2449 sequences unclassified (63.27%)  

**Nematostella**  
15862 sequences classified (26.82%)  
43287 sequences unclassified (73.18%)  

Since the database contains just viruses, bacteria, and human sequences, it is good that lots of sequences are unclassified. Also, probably lots of the human ones are misclassified, and if you were dealing with a more specific database, many of them would likely hit closer to the mark. 

### Investigating kmers  

In the log file (where kraken dumps all the actual results), you can get the specific info for each contig (or sequence, or whatever you put in it). The first letter just stands for classified or unclassified, then they give the name of the sequence/contig, then what category it was assigned to (unclassified, human, bacteria species, etc.) with its taxon id, the length of the sequence/contig, and then how it got divided up in terms of kmers.  

For example, `C	ENA|ABAV01000001|ABAV01000001.1	Homo sapiens (taxid 9606)	20022	0:3438 9606:8 0:16542` this is saying that there were 3438 kmers that did not have a match in the database, 8 kmers that matched human (9606), and then 16542 kmers that didn't have a match again. Somehow, this is still being classified as a human, which is pretty wild. Just anecdotally, looking through this, lots of the ones classified as humans look like this. I don't really understand if there is a threshold or something, because others remain unclassified even when kraken finds some kmers that match human sequences. Like this one: `U	ENA|ABAV01000008|ABAV01000008.1	unclassified (taxid 0)	4802	0:508 9606:5 0:4255`. It has five kmers that match a human, but it is unclassified while the above one is not, even though it's really a higher percentage of the whole sequence.  

I decided to sequester all of the contigs that were classified as bacterial and viral in their own file.  

`grep -v "unclassified\|Homo sapiens" nema_kraken.log > nema_contaminants.tsv`  

So I can look at these in isolation if I decide.  

In [1]:
import pandas as pd

In [11]:
nema = pd.read_csv("nema_kraken.log")

In [12]:
nema.head()

Unnamed: 0,class_unclass,contig_name,taxon,length,kmers
0,C,ENA_ABAV01000001_ABAV01000001.1,Homo sapiens (taxid 9606),20022.0,0:3438 9606:8 0:16542
1,U,ENA_ABAV01000002_ABAV01000002.1,unclassified (taxid 0),10536.0,0:10502
2,U,ENA_ABAV01000003_ABAV01000003.1,unclassified (taxid 0),904.0,0:870
3,U,ENA_ABAV01000004_ABAV01000004.1,unclassified (taxid 0),4548.0,0:4514
4,C,ENA_ABAV01000005_ABAV01000005.1,Homo sapiens (taxid 9606),49951.0,0:11978 9606:2 0:1452 9606:1 0:902 9606:5 0:84...


In [16]:
nema[nema.length <= 5000].head()

Unnamed: 0,class_unclass,contig_name,taxon,length,kmers
2,U,ENA_ABAV01000003_ABAV01000003.1,unclassified (taxid 0),904.0,0:870
3,U,ENA_ABAV01000004_ABAV01000004.1,unclassified (taxid 0),4548.0,0:4514
6,U,ENA_ABAV01000007_ABAV01000007.1,unclassified (taxid 0),2191.0,0:2157
7,U,ENA_ABAV01000008_ABAV01000008.1,unclassified (taxid 0),4802.0,0:508 9606:5 0:4255
9,U,ENA_ABAV01000010_ABAV01000010.1,unclassified (taxid 0),972.0,0:938


This is showing me the contigs that are below a certain length, but I'm going to try sorting them, so I can see all the ones that are small.  

In [38]:
sorted_nema = nema.sort_values(by = ["length"])
sorted_nema.head(n = 20)

Unnamed: 0,class_unclass,contig_name,taxon,length,kmers
9471,U,ENA_ABAV01009472_ABAV01009472.1,unclassified (taxid 0),12.0,0:0
9469,U,ENA_ABAV01009470_ABAV01009470.1,unclassified (taxid 0),18.0,0:0
9468,U,ENA_ABAV01009469_ABAV01009469.1,unclassified (taxid 0),21.0,0:0
9472,U,ENA_ABAV01009473_ABAV01009473.1,unclassified (taxid 0),28.0,0:0
9475,U,ENA_ABAV01009476_ABAV01009476.1,unclassified (taxid 0),29.0,0:0
9478,U,ENA_ABAV01009479_ABAV01009479.1,unclassified (taxid 0),30.0,0:0
9466,U,ENA_ABAV01009467_ABAV01009467.1,unclassified (taxid 0),30.0,0:0
9465,U,ENA_ABAV01009466_ABAV01009466.1,unclassified (taxid 0),30.0,0:0
9470,U,ENA_ABAV01009471_ABAV01009471.1,unclassified (taxid 0),31.0,0:0
9463,U,ENA_ABAV01009464_ABAV01009464.1,unclassified (taxid 0),32.0,0:0


In [35]:
sorted_nema.iloc[200:205]

Unnamed: 0,class_unclass,contig_name,taxon,length,kmers
48053,U,ENA_ABAV01048054_ABAV01048054.1,unclassified (taxid 0),404.0,0:370
55835,U,ENA_ABAV01055836_ABAV01055836.1,unclassified (taxid 0),404.0,0:370
41353,U,ENA_ABAV01041354_ABAV01041354.1,unclassified (taxid 0),404.0,0:370
18732,U,ENA_ABAV01018733_ABAV01018733.1,unclassified (taxid 0),404.0,0:370
52029,C,ENA_ABAV01052030_ABAV01052030.1,Homo sapiens (taxid 9606),405.0,0:313 9606:4 0:1 9606:5 0:48


The smallest contigs look like they are either unclassified or barely getting classified as human. I want to check out the bacteria and virus contigs specifically. I used the same grep line as above that I previously used on Hydra, but on the slightly modified csv I made of the nematostella log file above.

In [37]:
micros = pd.read_csv("nema_micros.csv")
micros.head()

Unnamed: 0,class_unclass,contig_name,taxon,length,kmers
0,C,ENA_ABAV01000148_ABAV01000148.1,Pandoraea oxalativorans (taxid 573737),12303.0,0:290 573737:1 0:994 9606:4 0:112 9606:4 13156...
1,C,ENA_ABAV01000310_ABAV01000310.1,cellular organisms (taxid 131567),10228.0,A:1 0:1669 9606:2 0:401 386415:2 0:8119
2,C,ENA_ABAV01000479_ABAV01000479.1,Streptomyces sp. TLI_053 (taxid 1855352),23069.0,0:721 1855352:5 0:155 1855352:5 0:27 1855352:5...
3,C,ENA_ABAV01000507_ABAV01000507.1,Streptococcus dysgalactiae subsp. equisimilis ...,3767.0,0:773 119602:5 0:1609 9606:4 0:1342
4,C,ENA_ABAV01000535_ABAV01000535.1,Rhodobacteraceae bacterium (taxid 1904441),22606.0,0:5819 1261031:2 0:10583 1904441:5 0:6163


In [41]:
sorted_micros = micros.sort_values(by = ["length"])
sorted_micros.head(n = 20)

Unnamed: 0,class_unclass,contig_name,taxon,length,kmers
355,C,ENA_ABAV01015544_ABAV01015544.1,other sequences (taxid 28384),280.0,28384:25 1:158 0:3 1:5 0:55
1442,C,ENA_ABAV01048353_ABAV01048353.1,Pseudomonas sp. CC6-YY-74 (taxid 1930532),444.0,0:66 135621:12 1534110:5 286:3 1534110:3 0:62 ...
632,C,ENA_ABAV01023546_ABAV01023546.1,Clostridium botulinum (taxid 1491),461.0,0:22 526969:5 131567:49 0:10 131567:1 0:29 131...
1250,C,ENA_ABAV01043368_ABAV01043368.1,Clostridium baratii str. Sullivan (taxid 1415775),521.0,0:103 262:1 0:30 1415775:2 0:351
1866,C,ENA_ABAV01056593_ABAV01056593.1,Actinoalloteichus sp. AHMU CJ021 (taxid 2072503),537.0,0:342 2072503:5 2:5 2072503:2 2:5 2072503:2 2:...
1329,C,ENA_ABAV01045675_ABAV01045675.1,Tenacibaculum todarodis (taxid 1850252),546.0,0:191 1850252:4 0:8 1312072:2 0:14 1850252:15 ...
1632,C,ENA_ABAV01051907_ABAV01051907.1,Clostridium botulinum (taxid 1491),550.0,0:11 1491:5 0:64 9606:4 0:69 1491:5 0:3 1491:2...
130,C,ENA_ABAV01005445_ABAV01005445.1,other sequences (taxid 28384),574.0,28384:29 1:3 0:508
1862,C,ENA_ABAV01056569_ABAV01056569.1,Clostridium botulinum (taxid 1491),589.0,0:303 1491:4 131567:3 0:5 1491:3 0:35 2057025:...
1863,C,ENA_ABAV01056570_ABAV01056570.1,Bacillus cereus m1293 (taxid 526973),599.0,0:53 A:34 9606:2 0:45 9606:7 0:4 9606:17 0:22 ...


In [43]:
sorted_micros.tail()

Unnamed: 0,class_unclass,contig_name,taxon,length,kmers
62,C,ENA_ABAV01002553_ABAV01002553.1,Proteobacteria (taxid 1224),63096.0,0:17121 1620215:1 0:4106 1002804:1 0:41833
264,C,ENA_ABAV01012069_ABAV01012069.1,Burkholderiales (taxid 80840),77005.0,0:2551 9606:6 0:4881 9606:4 0:10216 9606:45 0:...
112,C,ENA_ABAV01004701_ABAV01004701.1,Edwardsiella (taxid 635),79998.0,0:1716 9606:2 0:1973 9606:1 0:30 131567:2 0:94...
323,C,ENA_ABAV01014257_ABAV01014257.1,Edwardsiella (taxid 635),108723.0,0:12075 9606:10 0:1626 9606:2 0:2 9606:5 0:206...
2119,= Thu Aug 26 10:44:19 EDT 2021 job nema_kraken...,,,,


It looks like kraken is classifying these contigs based on a very small number of kmers in many if not all cases. This might be because it's really designed to work with reads (Illumina ones) not with longer sequences that come from long read tech or assembly. So I think my next move (after the new, correct database is up and running) is to map the reads for these assemblies onto the assembled contigs, and run kraken on all the reads that map. Then I'll be able to compare that output to this (redone with the same database) to see how comparable they are. Could be that a read would get classified really well, but it looks ridiculous in the middle of a hugh contig. Definitely some contaminating reads could have made it into the assembly itself, so it would be good to know how they behave.