# Metagenomics

## 1. Setting Up/Visualizing the Files

Installed KRAKEN, Krona and downloaded the data files to `qbb2020-answers/week13-assn10`

What are we looking at?

In [2]:
! head week13_data/KRAKEN/SRR492183.kraken

SRR492183.7153	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis
SRR492183.7155	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis;Enterococcus faecalis V583
SRR492183.7156	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis
SRR492183.7157	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis
SRR492183.7161	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis
SRR492183.7163	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis
SRR492183.7164	root;cellular organi

In [3]:
! head week13_data/READS/SRR492183_1.fastq

@SRR492183.2 HWI-ST330_0096:2:2:2220:2228/1
TTCTGTGTCACAATTTAAAATTGTTCTTCATTACCATTGCAAAGATATGTTTGAAAGCAAGAATTTTACTTC
+
>6<63:1.2703/69FB@F;CDA>CBEEEEEBBD<D=DDA984;:<6@A:CAECE@>22@D?D7;>2<A@9@
@SRR492183.6 HWI-ST330_0096:2:2:4535:2234/1
ACGTTCAATAATCCCTAGAACCCAATCTGTGATAATCGCCATAAAGGCCGTTGGTAAGGCACCGACTAAAATGCTAGAAGCACCATCTG
+
>@9>@ABBB>GGDGGFF>F@DDDDDBC@<@@AA3AB:DCBBEC6>DCA@D@AEE??5DBBA?FF:B.AA*;9B,519+7=;>0BC>>>C
@SRR492183.7 HWI-ST330_0096:2:2:4988:2228/1
TCATCCGATCCACTTCTTCAAATGAACGACAGCCCAAGTAGCGAAAAGCAGAAAGAGCGATGTCGTGATACACGTCATCGCTCGTTTTACTT


Use Krona-tools to visualize/parse the KRAKEN data:

In [35]:
#Krona needs to have each level separated by tabs, luckily this is not so complicated

def KRAKENparser(fname, outfile):
    file_dict = {}
    
    f = open(fname, 'r')
    
    for line in f: # reads line by line 
        line = line.strip('\n')
        x = line.split('\t')[1] # just take second tabbed value
            # we just want to count the identical taxonomies and can deal with parsing it after
        file_dict.setdefault(x, 0) #start the count at 0
        file_dict[x] += 1  #keep the huge mess as the key

    # now time to parse the keys
    fw = open(outfile, 'w') # and re-write the file at the same time!

    for key, item in file_dict.items():
        #wiriting into the new file here instead of adding to dictionary and then re-parsing 
        x = key.split(';')
        fw.write(str(item) + '\t')
        for y in x:
            fw.write(y + '\t')
        fw.write('\n')

    fw.close()
    f.close()


In [36]:
filenumbers = ['83', '86', '88', '89', '90', '93', '94', '97']

for i,j in enumerate(filenumbers):
    KRAKENparser(('week13_data/KRAKEN/SRR4921' + j + '.kraken'), (str(i) + '.kraken'))

now we can use krona-tools to visualize the files:

    ktImportText *kraken

output file: `text.krona.html`

*this file is really cool! screenshot saved to folder*

It looks like Enterococcus faecalis was the most abundant species overall, with significant amounts (~15-20% each) of Baccilales such as staph variants, especially cutibacterium avidum. These species aren't surprising considering the known micrbiota of living humans. One strange finding is that the levels of Baccilates and especially cutibacterium avidum seemed high at birth, reduced or disappeared a day later, only to reappear later. This makes me a bit suspicious of the quantitative sensitivity of these assays/sequences, or of my analysis of the data. 

## 2. Binning

***Question 2:** What metrics in the contigs can we use to group the genomes together?*

We should use kmer counting to find high coverage levels of overlapping contigs. That's what KRAKEN is all about!

And now to use `BWA tools` and `metabat`:

    bwa index -p assembly week13_data/assembly.fasta 
    
    bash bwa-to-sam.sh
        
    metabat2 -i week13_data/assembly.fasta -o week13_data/bins

***Question 3***

**(A)** Metabat2 made 7 bins. I was kind of expecting 3 bins, based on the taxonomic data (for Furmicutes, Actinobacteria, and Baccilales).

**(B)** Prokaryotic genomes are around 8-15 million bp long, and most of these bins found sequences in the hundreds of thousands, with one bin much shorter and another bin much longer. That's reasonable (I think?) to expext, as it covers at most 10% or so of a whole prokaryotic genome. Here are the lenghts of the bin sequences:

    124 691
    498 518
    116 726
    455 101
    269 228
    35 870
    1 447 137

**(C)** The assembly sequence is 38 070 686 nucleotides, so the total coverage of the 7 bins together is around 3 000 000, so around 8% of the total

**(D)** Use BLAST or a similar alignment tool to check for overlaps with known organisms

In [41]:
! head week13_data/bins*

==> week13_data/bins.1.fa <==
>NODE_27_length_124691_cov_12.297314
TTATTATTATTAGTAGTAGTAGTAGTAGTCCTATTTCTATTATTGTTATGATTATTATTC
CGAGTTGTATCATTCACAGCATCCATTGCTTGACGTAATTGTCTTTGTGTATCTGATTCA
TCTTGGAAATAATCATCATCAATCAATTCATCATCATCATCTTCATTAACAGTACCAATA
TTAATATTATTACTATCGTCATCATCATCAATTCTCATAGAATCTTCAATCTCATCTATT
GTAGGATATTGAGTCTCTTGATTATTTTTTGCTGTGGATGTCACATTATCAATACCATCT
ATTTGTACTAATGTGTCACCTTCATCTATACGCTCTTCATTTGCTCGAATAATGGCTTCA
TTGATGAATAGTTTAATATATTCCGATGACAATTCTAATGTACTCAAGGTGATTCTGGTC
TGGGGATCTTTAAAAGATAATTCTTTGAATAGTCTAGCAATGGTTGTCACAGGAATAGAT
TGATTGATTAATTCATTATCATTCTCAGGATCATTATGGATTGGCATTATGTATAGATTG

==> week13_data/bins.2.fa <==
>NODE_3_length_498518_cov_181.760000
CTAACCTATAAAGTTGGCATCTCTTCTTAATTATACATCATTATTCACTTACTTTCAACA
ACTGCAAGAAAATCATCCCATATAAATTCCCTCTTAATTATACATCATTATTCACTTACT
TTTAAAAGACCTATCTTAATAGTCACTTTCTACCTGTTTTTAGCGTCACTTAATAAGCCT
ATAGGCTTATTTTCCTAAAAGCTAATTCTTTTATTCAGTTCTTGTATATTATTCAAGTGC
TAATATTCTATTTCTAAATCTTTTGAAATTTTTCATTCCAAATGTTATTCTCTT

## 3. Estimate the taxonomy of your putative genomes

In [51]:
! head week13_data/KRAKEN/assembly.kraken

NODE_2_length_556123_cov_2361.439230	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis;Enterococcus faecalis V583
NODE_11_length_278925_cov_118.155370	root;cellular organisms;Bacteria;Terrabacteria group;Actinobacteria;Actinobacteria;Propionibacteriales;Propionibacteriaceae;Cutibacterium;Cutibacterium avidum;Cutibacterium avidum 44067
NODE_12_length_269228_cov_106.168966	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus;Staphylococcus aureus subsp. aureus;Staphylococcus aureus subsp. aureus ST72;Staphylococcus aureus subsp. aureus CN1
NODE_54_length_87669_cov_104.325427	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus;Staphylococcus aureus subsp. aureus;Staphylococcus aureus subsp. aureus ST72;Staphylococcus aureus subsp.

In [97]:
ref_dict = {} #this will be built from the assembly file

    
f = open('week13_data/KRAKEN/assembly.kraken', 'r')

#create dictionary of reference sequences and their associated organisms
for line in f:
    line = line.strip('\n')
    x = line.split('\t')
    y = x[1].split(';')
    ref_dict[x[0]] = y

f.close()

In [89]:
! grep NODE_14_length_235766_cov_39.967778 week13_data/KRAKEN/assembly.kraken bins/bin.1.fa

week13_data/KRAKEN/assembly.kraken:NODE_14_length_235766_cov_39.967778	root;cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus haemolyticus;Staphylococcus haemolyticus JCSC1435
bins/bin.1.fa:>NODE_14_length_235766_cov_39.967778


In [107]:
b = open('bins/bin.1.fa', 'r')



for line in b:
    x = line.strip('\n') 
    y = x.strip('>')    # it took me waaaaay too long to realize I needed to do this so the lines match the dictionary
    y = str(y)
    for key, item in ref_dict.item():
        
        x = key.split(';')
        fw.write(str(item) + '\t')
        for y in x:
            fw.write(y + '\t')
        fw.write('\n')

    
    
    if y in ref_dict.keys():
        ref_dict.setdefault(y, 0) #start the count at 0
        file_dict[y] += 1  #keep the huge mess as the key

b.close()

NODE_14_length_235766_cov_39.967778
NODE_25_length_133218_cov_37.848772
NODE_40_length_102363_cov_33.084783
NODE_49_length_91030_cov_33.607782
NODE_56_length_87142_cov_32.060330
NODE_71_length_77289_cov_32.681487
NODE_86_length_70352_cov_44.155000
NODE_91_length_68362_cov_38.675831
NODE_99_length_64419_cov_40.928516
NODE_120_length_57577_cov_32.798512
NODE_124_length_56343_cov_37.768192
NODE_136_length_52968_cov_37.941073
NODE_137_length_52857_cov_44.120014
NODE_140_length_51835_cov_34.972229
NODE_143_length_50009_cov_39.357509
NODE_167_length_44906_cov_33.704934
NODE_171_length_44362_cov_45.369129
NODE_175_length_44007_cov_36.980456
NODE_209_length_37815_cov_39.232521
NODE_217_length_36550_cov_35.435265
NODE_220_length_36390_cov_40.348232
NODE_225_length_35671_cov_31.844677
NODE_226_length_35654_cov_39.199837
NODE_242_length_34065_cov_33.039047
NODE_294_length_28949_cov_37.337198
NODE_310_length_27719_cov_53.546450
NODE_314_length_27176_cov_44.629033
NODE_315_length_26897_cov_41.69007

In [46]:
# strategy: build a dictionary out of the assembly.kraken file, 
# then check the NODEs in each bin file for matches to the dictionary
# and if there's a match, tally the taxonomies

ref_dict = {} #this will be built from the assembly file

    
f = open('week13_data/KRAKEN/assembly.kraken', 'r')

for line in f:
    line = line.strip('\n')
    x = line.split('\t')
    y = x.split(';')
    print(x)
    print(y)
    
    for line in f: # reads line by line 
        line = line.strip('\n')
        x = line.split('\t')[1] # just take second tabbed value
            # we just want to count the identical taxonomies and can deal with parsing it after
        file_dict.setdefault(x, 0) #start the count at 0
        file_dict[x] += 1  #keep the huge mess as the key

    # now time to parse the keys
    fw = open(outfile, 'w') # and re-write the file at the same time!

    for key, item in file_dict.items():
        #wiriting into the new file here instead of adding to dictionary and then re-parsing 
        x = key.split(';')
        fw.write(str(item) + '\t')
        for y in x:
            fw.write(y + '\t')
        fw.write('\n')

    fw.close()
    f.close()


day_dict={}
#glob is unix based so it can even do crazy things like this:
for file in glob.glob('*.kraken'): 
    #if I understood this right, glob should be like an intersection of the two files
    abundance_dict={}
    taxonomy=open(filepath)
    for line in taxonomy:
        print(line)
        full_tax=line.split(';')
        try:
            species=full_tax[-3]
            frequency_dict.setdefault(species,0)
            frequency_dict[species]+=1
        except IndexError:
            pass
    day_dict[name]=frequency_dict
    taxonomy.close()

1099892	root	cellular organisms	Bacteria	Terrabacteria group	Firmicutes	Bacilli	Lactobacillales	Enterococcaceae	Enterococcus	Enterococcus faecalis	

69433	root	cellular organisms	Bacteria	Terrabacteria group	Firmicutes	Bacilli	Lactobacillales	Enterococcaceae	Enterococcus	Enterococcus faecalis	Enterococcus faecalis D32	

11742	root	cellular organisms	Bacteria	Terrabacteria group	Firmicutes	Bacilli	Bacillales	Staphylococcaceae	Staphylococcus	Staphylococcus epidermidis	Staphylococcus epidermidis ATCC 12228	

69071	root	cellular organisms	Bacteria	Terrabacteria group	Firmicutes	Bacilli	Lactobacillales	Enterococcaceae	Enterococcus	Enterococcus faecalis	Enterococcus faecalis OG1RF	

160702	root	cellular organisms	Bacteria	Terrabacteria group	Firmicutes	Bacilli	Lactobacillales	Enterococcaceae	Enterococcus	Enterococcus faecalis	Enterococcus faecalis V583	

82431	root	cellular organisms	Bacteria	Terrabacteria group	Firmicutes	Bacilli	Bacillales	Staphylococcaceae	Staphylococcus	Staphylococcus ep

NameError: name 'frequency_dict' is not defined