## MESO-SCOPE Meta-transcriptomic counts

There are two scripts in this folder

In [2]:
!ls -lrth *.py

-rwxr-xr-x. 1 jmeppley delonglab 4.2K Oct 11 18:15 split_db_mult.py
-rwxr-xr-x. 1 jmeppley delonglab  14K Oct 11 18:15 plot_counts.py


#### split_db_mult.py
This splits the counts into multiple tables, one per taxonomic group

#### plot_counts.py
This takes one taxon specific table and generates a PDF of plots.

## Splitting the DB by taxa

We can run the script to get a description of what it does and how to use it:

In [8]:
!./split_db_mult.py -h

This script will take:

    * combined table of gene counts and annotations 
    * clade definitions YAML

it will produce:

    * table of annotated counts for each  requested clade

Output table files will be named with the input file plus the clade name
unless a naming prefix is given iwth -o <prefix>

Usage:
  split_table_mult.py [options] <annot_counts> <tax_yaml>
  split_table_mult.py -h | --help
  split_table_mult.py --version

Options:
  -h --help     Show this screen.
  --version     Show version.
  -o <out_base> Name output files with this prefix


I've also provided an example taxon definitions file with two groups defined:

 * Crocosphaera: a single genus
 * Prochlorococcus: a family with a bunch of genera excluded

In [7]:
!cat tax.defs.yaml

Crocosphaera:
    keep:
        genus:
            - Crocosphaera
Prochlorococcus:
    keep:
        family:
            - Cyanobiaceae
    drop:
        genus:
            - Synechococcus_C
            - Vulcanococcus
            - Unknown
            - Synechococcus_D
            - PCC7001
            - RCC307
            - BAIKAL-G1
            - Cyanobium
            - Synechococcus_B
            - WH-5701


Using the above YAML file, we can split an annotated table of counts into multipl tables, one per each taxon:

In [11]:
#Norm expression data annotated
Norm_filter_taxa_data_file = "/data/urisheyn/Mesoscope_Trans_counts/20200128PerTimeCourse/transc_per_mL_annot_GTDB/Filtmin10_ERCCnorm_gene_counts_per_mL_annot_GTDB.tsv"
!./split_db_mult.py {Norm_filter_taxa_data_file} tax.defs.yaml -o tax.counts

This should have produced 2 files, one for each group defined in the YAML file:

In [12]:
!grep -c . tax.counts.*

tax.counts.Crocosphaera:5359
tax.counts.Prochlorococcus:42690


Let's peek at them to make sure they look OK

In [13]:
!head -n 3 tax.counts.* | cut -f-6

==> tax.counts.Crocosphaera <==
	MSR00-20a-DL-SL1C04-0015	MSR00-20a-DL-SL1C06-0015	MSR00-20a-DL-SL1C09-0015	MSR00-20a-DL-SL1C10-0015	MSR00-20a-DL-SL1C11-0015
CSHLIID00-20a-S06C001-0015-151211_c100857_1	0.0	24.90813	19.483435	0.0	0.0
CSHLIID00-20a-S06C001-0015-151211_c100857_10	0.0	49.81626	272.76809000000003	196.5264734299517	488.79280434782606

==> tax.counts.Prochlorococcus <==
	MSR00-20a-DL-SL1C04-0015	MSR00-20a-DL-SL1C06-0015	MSR00-20a-DL-SL1C09-0015	MSR00-20a-DL-SL1C10-0015	MSR00-20a-DL-SL1C11-0015
CSHLIID00-20a-S06C001-0015-151211_c105159_1	64.41768421052632	49.81626	214.317785	393.0529468599034	293.27568260869566
CSHLIID00-20a-S06C001-0015-151211_c108002_1	64.41768421052632	49.81626	97.417175	343.92132850241546	195.51712173913043


## Plotting
The plotting script is incomplete at this point. Currently it does this:

 * bar plots of counts at various taxonomic ranks
 * line plots of counts
 * puts them all in one PDF

In [15]:
!./plot_counts.py -h

Usage:
  plot_counts.py <tax_name> <count_file> <pdf_file>
  plot_counts.py -h | --help
  plot_counts.py --version

Options:
  -h --help     Show this screen.
  --version     Show version.


In [16]:
!./plot_counts.py Crocosphaera tax.counts.Crocosphaera Crocosphaera.pdf

In [17]:
!ls -lrth *pdf

-rw-r--r--. 1 jmeppley delonglab 249K Oct 11 18:39 Crocosphaera.pdf
