April 9th 2021 Jacob copied this notebook from what Julia had previously done to redo this using the GTDB classification results.

### This notebook takes ~15 hours to run with 32 CPUs and 64G of memory. 
It should be run via a screen session on charlie

## Convoluted description of task

Basically, the 6th column () contains all of the best matches if there are multiple best matches. As long as all of these contigs/SAGs are assigned to the same Genus I can go ahead and classify that read as originating from that Genus. Otherwise, it is of ambiguous origin and I can not assign it to that Genus. I need some assistance in figuring out how to parse the outputs from the DNA and RNA recruitments so that instead of just looking at the Genus (the last column) it looks at the Genus, the contigs that are in the 6th column and confirms that all of the SAGs with contigs are part of that Genus. I have skimmed a couple of the output files and there are defiantly some examples of reads that have best matches to multiple SAGs like the example below from this file /mnt/scgc/simon/microg2p/analyses/GORG_recruitment/no_GEN/GEN_SAGs/results/annotations/All_190709_GoM_RNA_seq_bbmerge_reads_annotated_classified2.txt. 

The genus by data csv file is not sufficient for this. For each date I need a csv file that is a grid of the gene name (prokka_gene) by genus. That way I can differentiate expression between different genes at different time points sorry for this omission I completely spaced on that earlier.

## Data to use
Right now there are 2 sets of recruitment that I need to do this on. The transcriptomic \*annotated.txt files in this folder /mnt/scgc/simon/microg2p/analyses/GORG_recruitment/no_GEN/GEN_SAGs/results/annotations

And the DNA recruitment \*annotated.txt files that you previously created the read_counting notebook to process in this folder /mnt/scgc/simon/microg2p/analyses/GORG_recruitment/201130_GORG_DNA_recruitment/results/annotations

This file should contain all of the SAGs as well as the Genus that they are assigned to /mnt/scgc/simon/microg2p/analyses/ani/All_GoM_SAGs_1cell_20kb_decon_clusters_added.csv

## TODO

* Count reads recruited to each gene name + functional description per Genus, 
* Separate count by whether read recruited exclusively to given Genus, or recruited to several Genera (aka shared with other genera).  
* Count reads recruited per Genus, per functional category, per metagenome.

The plan:
* Dataframe per metagenome will have the columns: ['gene_id','GEN','Exclusive','Shared'], 
* Include ec_number and prokka functions as well
* 'Exclusive' and 'Shared' columns will be read counts


In [1]:
import glob
from collections import defaultdict
import pandas as pd
import os.path as op
import math
import os 
from collections import Counter

import sqlite3
import sqlalchemy

def safe_makedir(dname):
    """
    Make a directory if it doesn't exist, handling concurrent race conditions.
    """
    if not dname:
        return dname
    num_tries = 0
    max_tries = 5
    while not os.path.exists(dname):
        try:
            os.makedirs(dname)
        except OSError:
            if num_tries > max_tries:
                raise
            num_tries += 1
            time.sleep(2)
    return dname

read_files = glob.glob('/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/no_GEN/GEN_SAGs/results/annotations/*annotated.txt') + \
glob.glob('/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/201130_GORG_DNA_recruitment/results/annotations/*annotated.txt')

In [2]:
read_files

['/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/no_GEN/GEN_SAGs/results/annotations/All_171102_GoM_RNA_seq_bbmerge_reads_annotated.txt',
 '/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/no_GEN/GEN_SAGs/results/annotations/All_181030_GoM_RNA_seq_bbmerge_reads_annotated.txt',
 '/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/no_GEN/GEN_SAGs/results/annotations/All_190402_GoM_RNA_seq_bbmerge_reads_annotated.txt',
 '/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/no_GEN/GEN_SAGs/results/annotations/All_190709_GoM_RNA_seq_bbmerge_reads_annotated.txt',
 '/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/201130_GORG_DNA_recruitment/results/annotations/ALL_20171102_contf_pe_bbmerge_reads_annotated.txt',
 '/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/201130_GORG_DNA_recruitment/results/annotations/ALL_20181030_contf_pe_bbmerge_reads_annotated.txt',
 '/mnt/scgc/simon/microg2p/analyses/GORG_recruitment/201130_GORG_DNA_recruitment/results/annotations/ALL_20190402_contf_pe_bbm

Buliding a master dataframe of all GoM ORFs

In [3]:
# file that indicates which sags belong to which category

cat_file = '/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/Summary_files/All_GoM_SAGs_1cell_20kb_decon_531normalized_predresp_rate_GTDBclass.csv'
namecolumnid = 'name'
categorycolumnid = 'GTDB_classification'

#catdf_file = '/mnt/scgc/simon/microg2p/analyses/ani/All_GoM_SAGs_1cell_20kb_decon_clusters_added.csv'
catdf = pd.read_csv(cat_file)


gadict = {l[namecolumnid]:l[categorycolumnid] for i, l in catdf.iterrows()}

In [4]:
mmseqs_cluster_file = '/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/20210325_GoM_recluster_analysis/mmseqs/analyses/GoM_sag_orfs_80minid_m80.tsv'

head = ['gene_cluster_rep', 'gene']
gcdf = pd.read_csv(mmseqs_cluster_file,
                         sep='\t', names=head)

gcdf['gene_cluster_rep'] = ["_".join(i.split("_")[:-1]) for i in gcdf['gene_cluster_rep']]
gcdf['gene'] = ["_".join(i.split("_")[:-1]) for i in gcdf['gene']]

#gcdict = {}
#gcdict = {l['gene']:l['gene_cluster_rep'] for i, l in gcdf.iterrows()}
### Use the gadict from above to check if genes are part of multiple genera 

In [6]:
gcdf['gene_sag'] = [i.split("_")[0] for i in gcdf['gene']]
gcdf['gene_genus'] = [gadict[i] for i in gcdf['gene_sag']]

# this counts the number of genes represented in each gene cluster
genus_per_cluster = gcdf[gcdf['gene_genus'] != 'Unclassified'].drop_duplicates(subset=['gene_cluster_rep','gene_genus']).groupby('gene_cluster_rep', as_index=False)['gene'].count().rename(columns={'gene':'genus_count'})

genus_per_cluster['gene_cluster_genus_status'] = ['exclusive' if i == 1 else 'shared' for i in genus_per_cluster['genus_count']]
gcdf2 = gcdf.merge(genus_per_cluster[['gene_cluster_rep','gene_cluster_genus_status']], how='left')

gcdf2['gene_cluster_genus_status'] = gcdf2['gene_cluster_genus_status'].fillna('unclassified_only')



Counter(gcdf2['gene_cluster_genus_status']).most_common()

[('exclusive', 3471051), ('shared', 156770), ('unclassified_only', 36836)]

In [7]:
gcdf2.head()

Unnamed: 0,gene_cluster_rep,gene,gene_sag,gene_genus,gene_cluster_genus_status
0,AH-704-O17_NODE_7;76986;77843,AH-704-O17_NODE_7;76986;77843,AH-704-O17,Akkermansiaceae,exclusive
1,AH-704-O17_NODE_7;76986;77843,AH-704-K03_NODE_40;1572;2429,AH-704-K03,Akkermansiaceae,exclusive
2,AH-704-O17_NODE_7;76986;77843,AH-707-F08_NODE_84;3340;4197,AH-707-F08,Akkermansiaceae,exclusive
3,AH-704-O17_NODE_7;76986;77843,AH-707-K14_NODE_54;1628;2485,AH-707-K14,Akkermansiaceae,exclusive
4,AH-704-O17_NODE_7;76986;77843,AH-707-B03_NODE_10;6131;6988,AH-707-B03,Unclassified,exclusive


In [7]:
gcdf2['gene_cluster-gene_genus'] = ['{}-{}'.format(l['gene_cluster_rep'],l['gene_genus']) for i, l in gcdf2.iterrows()]
gcdf2.to_csv("/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/20210325_GoM_recluster_analysis/mmseqs/analyses/GoM_sag_orfs_80minid_m80_gtdb_added.csv", index=False)

After the above step, you can submit the qsub for the \*simpledict.py script, but before you do that, if you've made changes to the above table, you will want to go into that script and change the intermediate destination directory.  I've done that for the changes made on 7/20/21... the new intermediate output directory is: ```/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/20210720_GoM_recluster_analysis/GORG_recruitment_mmseq_clusts```


Below is just development scripting that does not need to be run if you run the associated python script.

In [16]:
d2 = {l['gene']:'{}--{}'.format(l['gene_cluster_rep'],l['gene_genus']) for i, l in gcdf2.iterrows()}

We are adding an additional piece of information the output files separated by genera.  We will add the gene cluster(s) associated with each gene hit in a new column.  If so, write that gene cluster id to the intermediate output file. Whether or not the gene cluster is shared or exclusive can be determined by merging the table we created above to the final output.

In [42]:
!head -20 {read_files[0]}

status	sequence_id	taxonomy_id	length	taxonomy_ids_lca	sequence_ids_lca	protein_sequence	taxonomic_lineage	ec_number	prokka_function	prokka_gene	gen
U	NB501011:101:HYFW5AFXY:1:11101:22109:16099	0												
U	NB501011:101:HYFW5AFXY:1:11101:9934:16106	0												
U	NB501011:101:HYFW5AFXY:1:11101:25787:16120	0												
U	NB501011:101:HYFW5AFXY:1:11101:16339:16127	0												
U	NB501011:101:HYFW5AFXY:1:11101:18264:16134	0												
U	NB501011:101:HYFW5AFXY:1:11101:10785:16139	0												
U	NB501011:101:HYFW5AFXY:1:11101:21840:16156	0												
U	NB501011:101:HYFW5AFXY:1:11101:11213:16159	0												
U	NB501011:101:HYFW5AFXY:1:11101:21896:16164	0												
U	NB501011:101:HYFW5AFXY:1:11101:15055:16174	0												
U	NB501011:101:HYFW5AFXY:1:11101:25331:16177	0												
U	NB501011:101:HYFW5AFXY:1:11101:10181:16186	0												
C	NB501011:101:HYFW5AFXY:1:11101:13771:16204	126	110	126,	AH-707-M18_NODE_75;2531;2986,	PEEGGDDVKSSWPLCLGLHTSYN,	Bacteria; Planctomycetes; Planc

In [24]:
## testing out method...
intermediate_outdir = safe_makedir('/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/20210325_GoM_recluster_analysis/GORG_recruitment_mmseq_clusts3')
#final_outdir = safe_makedir("/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/20210325_GoM_recluster_analysis/GORG_recruitment/MMseq_cluster_summaries")

columns=['genus',
        'ec_number', 
        'prokka_function',
        'prokka_gene',
        'seq_ids_lca',
        'exclusive',
        'shared',
        'gene_clusters',
        'gene_clusters_hit_count']

for infile in read_files[:1]:
    
    inid = op.basename(infile).split(".")[0]
    
    #final_gc_outfile = op.join(final_outdir, '{}_reads_by_gen_ko_and_gene_clust.csv'.format(inid))
    
    #if op.exists(final_gc_outfile):
    #    print("output for {infile} already exists, output found at {final_outfile}".format(infile=infile, final_outfile=final_outfile))
    #    continue
    #else:
    reads = 0
    recorded = 0
    with open(infile) as ih:

        outfiles = []
        if op.exists(intermediate_outdir):
            !rm -rf {op.join(intermediate_outdir, inid)}

        outdir = safe_makedir(op.join(intermediate_outdir, inid))

        for j, l in enumerate(ih):

            if j == 0:
                cids = l.strip().split("\t")

            if l.startswith('C'):
                reads += 1
                toks = dict(zip(cids, l.strip().split("\t")))
                nodes_hit = toks['sequence_ids_lca'][:-1].split(",")
                
                labels = [d2[i] for i in nodes_hit]
                gens = [i.split("--")[-1] for i in labels]
                gene_clusters = ",".join(list(set([i.split("--")[0] for i in labels])))

                if len(gens) == 0:
                    keep = l
                    break

                if len(set(gens)) > 1:
                    category = 'shared'
                    shared = 1
                    exclusive=0
                else:
                    category = 'exclusive'
                    exclusive = 1
                    shared = 0

                for gen in list(set(gens)):
                    recorded += 1
                    outfile = op.join(outdir, "{}.tsv".format(gen))
                    if not op.exists(outfile):
                        outfiles.append(outfile)
                        with open(outfile, "w") as oh:
                            print("\t".join(columns), file = oh)
                            print(gen, toks['ec_number'], toks['prokka_function'], toks['prokka_gene'], toks['sequence_ids_lca'], exclusive, shared, gene_clusters, sep="\t", file = oh)
                    else:
                        with open(outfile, "a") as oh:
                            print(gen, toks['ec_number'], toks['prokka_function'], toks['prokka_gene'], toks['sequence_ids_lca'], exclusive, shared, gene_clusters, sep="\t", file = oh)
            if j % 1000000 == 0:
                print(j, 'lines processed')

0 lines processed
1000000 lines processed
2000000 lines processed
3000000 lines processed
4000000 lines processed
5000000 lines processed
6000000 lines processed
7000000 lines processed
8000000 lines processed
9000000 lines processed
10000000 lines processed
11000000 lines processed
12000000 lines processed
13000000 lines processed
14000000 lines processed
15000000 lines processed
16000000 lines processed
17000000 lines processed
18000000 lines processed
19000000 lines processed
20000000 lines processed
21000000 lines processed
22000000 lines processed
23000000 lines processed
24000000 lines processed
25000000 lines processed
26000000 lines processed
27000000 lines processed
28000000 lines processed
29000000 lines processed
30000000 lines processed
31000000 lines processed
32000000 lines processed
33000000 lines processed
34000000 lines processed
35000000 lines processed
36000000 lines processed
37000000 lines processed
38000000 lines processed
39000000 lines processed
40000000 lines p

In [None]:
# to group/count/summarise results, do something like this...

intermediate_outdir = '/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/20210325_GoM_recluster_analysis/GORG_recruitment_mmseq_clusts3'
#final_outdir = safe_makedir("/mnt/scgc/simon/microg2p/analyses/20210325_GoM_recluster/20210325_GoM_recluster_analysis/GORG_recruitment/MMseq_cluster_summaries")


for infile in read_files:
    inid = op.basename(infile).split(".")[0]
    
    outdir = op.join(intermediate_outdir, inid)
    outfiles = glob.glob(op.join(outdir, '*.tsv'))
    outdf = pd.concat([pd.read_csv(i, sep = "\t") for i in outfiles]).groupby(['genus','gene_clusters'], 
                                                                              as_index=False)['ec_number'].count().rename(columns={'ec_number':'reads_hit'}).sort_values(by='reads_hit', ascending=False)
    outdf['read_library'] = inid