# Goal

* Designing primers for Methanobrevibacter
  * Download genomes
  * QC genomes
  * Design primers

# Var

In [5]:
work_dir = '/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter'
clade = 'Methanobrevibacter'
taxid = 2172

# Init

In [24]:
library(dplyr)
library(tidyr)
library(ggplot2)
library(LeyLabRMisc)

In [25]:
df.dims()
make_dir(work_dir)

Directory already exists: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter 


# Genome download

* Downloading genomes from NCBI

```
OUTDIR=/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/
mkdir -p $OUTDIR
ncbi-genome-download -p 12 -s genbank -F fasta --genera Methanobrevibacter -o $OUTDIR archaea
```

# Genome quality

In [10]:
D = file.path(work_dir, 'genbank')
files = list_files(D, '.fna.gz')
samps = data.frame(Name = files %>% as.character %>% basename,
                   Fasta = files,
                   Domain = 'Archaea',
                   Taxid = taxid) %>%
    mutate(Name = gsub('\\.fna\\.gz$', '', Name),
           Fasta = gsub('/+', '/', Fasta))
samps

# writing file
outfile = file.path(work_dir, 'genomes_raw.txt')
write_table(samps, outfile)

Name,Fasta,Domain,Taxid
<chr>,<chr>,<fct>,<dbl>
GCA_000016525.1_ASM1652v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/genbank/archaea/GCA_000016525.1/GCA_000016525.1_ASM1652v1_genomic.fna.gz,Archaea,2172
GCA_000024185.1_ASM2418v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/genbank/archaea/GCA_000024185.1/GCA_000024185.1_ASM2418v1_genomic.fna.gz,Archaea,2172
⋮,⋮,⋮,⋮
GCA_902384065.1_UHGG_MGYG-HGUT-02162_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/genbank/archaea/GCA_902384065.1/GCA_902384065.1_UHGG_MGYG-HGUT-02162_genomic.fna.gz,Archaea,2172
GCA_902387325.1_UHGG_MGYG-HGUT-02446_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/genbank/archaea/GCA_902387325.1/GCA_902387325.1_UHGG_MGYG-HGUT-02446_genomic.fna.gz,Archaea,2172


File written: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/genomes_raw.txt 


## LLG

#### Config

In [11]:
cat_file(file.path(work_dir, 'config_llg.yaml'))

# table with genome --> fasta_file information
samples_file: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/genomes_raw.txt

# output location
output_dir: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/

# temporary file directory (your username will be added automatically)
tmp_dir: /ebio/abt3_scratch/

# batch processing of genomes for certain steps
## increase to better parallelize
batches: 2 

# Domain of genomes ('Archaea' or 'Bacteria)
## Use "Skip" if provided as a "Domain" column in the genome table
Domain: Skip

# software parameters
# Use "Skip" to skip any of these steps. If no params for rule, use ""
# dRep MAGs are not further analyzed, but you can de-rep & then use the de-rep genome table as input.
params:
  ionice: -c 3
  # assembly assessment
  seqkit: ""
  quast: Skip #""
  multiqc_on_quast: "" 
  checkm: ""
  # de-replication (requires checkm)
  drep: -comp 90 -con 5 -sa 0.999
  # 

#### Run

```
(snakemake) @ rick:/ebio/abt3_projects/software/dev/ll_pipelines/llg
$ screen -L -S llg-thermo ./snakemake_sge.sh /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/config_llg.yaml 30 -F
```

### Samples table of high-quality genomes

In [12]:
# checkM summary
checkm = file.path(work_dir, 'LLG_output', 'checkM', 'checkm_qa_summary.tsv') %>%
    read.delim(sep='\t') 
checkm

Bin.Id,Marker.lineage,X..genomes,X..markers,X..marker.sets,Completeness,Contamination,Strain.heterogeneity,Genome.size..bp.,X..ambiguous.bases,⋯,X0,X1,X2,X3,X4,X5.,assembly.Id,assembler.Id,taxon.Id,File
<fct>,<fct>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<lgl>,<fct>
GCA_000016525.1_ASM1652v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,0,1853160,0,⋯,0,188,0,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|methanobrevibacter|LLG_output|checkM|1|checkm|markers_qa_summary.tsv.1,markers_qa_summary.tsv.1,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/checkM/1/checkm/markers_qa_summary.tsv.1
GCA_000151225.1_ASM15122v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,0,1729275,1500,⋯,0,188,0,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|methanobrevibacter|LLG_output|checkM|1|checkm|markers_qa_summary.tsv.2,markers_qa_summary.tsv.2,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/checkM/1/checkm/markers_qa_summary.tsv.2
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
GCA_901111125.1_PRJEB32190_genomic,p__Euryarchaeota (UID3),148,188,125,100.0,0,0,1712416,4,⋯,0,188,0,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|methanobrevibacter|LLG_output|checkM|2|checkm|markers_qa_summary.tsv.50,markers_qa_summary.tsv.50,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/checkM/2/checkm/markers_qa_summary.tsv.50
GCA_902384065.1_UHGG_MGYG-HGUT-02162_genomic,p__Euryarchaeota (UID3),148,188,125,96.8,0,0,2083511,2,⋯,7,181,0,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|methanobrevibacter|LLG_output|checkM|2|checkm|markers_qa_summary.tsv.51,markers_qa_summary.tsv.51,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/checkM/2/checkm/markers_qa_summary.tsv.51


In [33]:
# dRep summary
drep = file.path(work_dir, 'LLG_output', 'drep', 'checkm_markers_qa_summary.tsv') %>%
    read.delim(sep='\t') %>%
    mutate(Bin.Id = gsub('.+/', '', genome),
           Bin.Id = gsub('\\.fna$', '', Bin.Id))
drep

genome,completeness,contamination,Bin.Id
<fct>,<dbl>,<dbl>,<chr>
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000016525.1_ASM1652v1_genomic.fna,100,0,GCA_000016525.1_ASM1652v1_genomic
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000151225.1_ASM15122v1_genomic.fna,100,0,GCA_000151225.1_ASM15122v1_genomic
⋮,⋮,⋮,⋮
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_901111125.1_PRJEB32190_genomic.fna,100.0,0,GCA_901111125.1_PRJEB32190_genomic
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_902384065.1_UHGG_MGYG-HGUT-02162_genomic.fna,96.8,0,GCA_902384065.1_UHGG_MGYG-HGUT-02162_genomic


In [34]:
# de-replicated genomes
drep_gen = file.path(work_dir, 'LLG_output', 'drep', 'dereplicated_genomes.tsv') %>%
    read.delim(sep='\t')
drep_gen

Name,Fasta
<fct>,<fct>
GCA_001563245.1_ASM156324v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_001563245.1_ASM156324v1_genomic.fna
GCA_900114585.1_IMG-taxon_2593339150_annotated_assembly_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_900114585.1_IMG-taxon_2593339150_annotated_assembly_genomic.fna
⋮,⋮
GCA_002813085.1_ASM281308v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_002813085.1_ASM281308v1_genomic.fna
GCA_003111605.1_ASM311160v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_003111605.1_ASM311160v1_genomic.fna


In [35]:
# GTDBTk summary
tax = file.path(work_dir, 'LLG_output', 'gtdbtk', 'gtdbtk_ar_summary.tsv') %>%
    read.delim(, sep='\t') %>%
    separate(classification, 
             c('Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'),
             sep=';') %>%
    select(-note, -classification_method, -pplacer_taxonomy,
           -other_related_references.genome_id.species_name.radius.ANI.AF.)
tax

user_genome,Domain,Phylum,Class,Order,Family,Genus,Species,fastani_reference,fastani_reference_radius,⋯,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,msa_percent,translation_table,red_value,warnings
<fct>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,<fct>,<fct>
GCA_000016525.1_ASM1652v1_genomic,d__Archaea,p__Methanobacteriota,c__Methanobacteria,o__Methanobacteriales,f__Methanobacteriaceae,g__Methanobrevibacter_A,s__Methanobrevibacter_A smithii,GCF_000016525.1,95.0,⋯,1.0,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,100.0,1.0,98.09,11,,
GCA_000151225.1_ASM15122v1_genomic,d__Archaea,p__Methanobacteriota,c__Methanobacteria,o__Methanobacteriales,f__Methanobacteriaceae,g__Methanobrevibacter_A,s__Methanobrevibacter_A smithii,GCF_000016525.1,95.0,⋯,0.95,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,98.32,0.95,98.09,11,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
GCA_901111125.1_PRJEB32190_genomic,d__Archaea,p__Methanobacteriota,c__Methanobacteria,o__Methanobacteriales,f__Methanobacteriaceae,g__Methanobrevibacter_A,s__Methanobrevibacter_A smithii,GCF_000016525.1,95.0,⋯,0.95,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,98.36,0.95,98.09,11,,
GCA_902384065.1_UHGG_MGYG-HGUT-02162_genomic,d__Archaea,p__Methanobacteriota,c__Methanobacteria,o__Methanobacteriales,f__Methanobacteriaceae,g__Methanobrevibacter_A,s__Methanobrevibacter_A oralis,GCF_001639275.1,95.0,⋯,0.97,GCF_001639275.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A oralis,99.85,0.97,89.89,11,,


In [36]:
# checking overlap
cat('-- drep --\n')
overlap(basename(as.character(drep_gen$Fasta)), 
        basename(as.character(drep$genome)))
cat('-- checkm --\n')
overlap(drep$Bin.Id, checkm$Bin.Id)
cat('-- gtdbtk --\n')
overlap(drep$Bin.Id, tax$user_genome)

-- drep --
intersect(x,y): 59 
setdiff(x,y): 0 
setdiff(y,x): 44 
union(x,y): 103 
-- checkm --
intersect(x,y): 103 
setdiff(x,y): 0 
setdiff(y,x): 0 
union(x,y): 103 
-- gtdbtk --
intersect(x,y): 103 
setdiff(x,y): 0 
setdiff(y,x): 0 
union(x,y): 103 


In [37]:
# joining based on Bin.Id
drep = drep %>%
    inner_join(checkm, c('Bin.Id')) %>%
    mutate(GEN = genome %>% as.character %>% basename) %>%
    inner_join(drep_gen %>% mutate(GEN = Fasta %>% as.character %>% basename),
               by=c('GEN')) %>%
    inner_join(tax, c('Bin.Id'='user_genome')) #%>%
drep

genome,completeness,contamination,Bin.Id,Marker.lineage,X..genomes,X..markers,X..marker.sets,Completeness,Contamination,⋯,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,msa_percent,translation_table,red_value,warnings
<fct>,<dbl>,<dbl>,<chr>,<fct>,<int>,<int>,<int>,<dbl>,<dbl>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,<fct>,<fct>
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000016525.1_ASM1652v1_genomic.fna,100,0,GCA_000016525.1_ASM1652v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,⋯,1.0,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,100.0,1.0,98.09,11,,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000151225.1_ASM15122v1_genomic.fna,100,0,GCA_000151225.1_ASM15122v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,⋯,0.95,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,98.32,0.95,98.09,11,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_900103415.1_IMG-taxon_2593339167_annotated_assembly_genomic.fna,100,1.85,GCA_900103415.1_IMG-taxon_2593339167_annotated_assembly_genomic,p__Euryarchaeota (UID3),148,188,125,100,1.85,⋯,,,,,,,94.79,11,0.9626524776204902,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_900114585.1_IMG-taxon_2593339150_annotated_assembly_genomic.fna,100,1.60,GCA_900114585.1_IMG-taxon_2593339150_annotated_assembly_genomic,p__Euryarchaeota (UID3),148,188,125,100,1.60,⋯,1.0,GCF_900114585.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter;s__Methanobrevibacter olleyae,100.0,1.0,96.88,11,,


In [56]:
# filtering by quality
hq_genomes = drep %>%
    filter(completeness >= 90,
           contamination < 5,
           Strain.heterogeneity < 50,
           X..contigs <= 200,
           Mean.contig.length..bp. >= 15000,
           N50..contigs. >= 50000)
hq_genomes

genome,completeness,contamination,Bin.Id,Marker.lineage,X..genomes,X..markers,X..marker.sets,Completeness,Contamination,⋯,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,msa_percent,translation_table,red_value,warnings
<fct>,<dbl>,<dbl>,<chr>,<fct>,<int>,<int>,<int>,<dbl>,<dbl>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,<fct>,<fct>
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000016525.1_ASM1652v1_genomic.fna,100,0,GCA_000016525.1_ASM1652v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,⋯,1.0,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,100.0,1.0,98.09,11,,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000151225.1_ASM15122v1_genomic.fna,100,0,GCA_000151225.1_ASM15122v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,⋯,0.95,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,98.32,0.95,98.09,11,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_015062915.1_ASM1506291v1_genomic.fna,96.8,0,GCA_015062915.1_ASM1506291v1_genomic,p__Euryarchaeota (UID3),148,188,125,96.8,0,⋯,,GCF_003111625.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A thaueri,84.14,0.68,96.99,11,0.9866057450935961,Genome not assigned to closest species as it falls outside its pre-defined ANI radius
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_015062935.1_ASM1506293v1_genomic.fna,97.6,0,GCA_015062935.1_ASM1506293v1_genomic,p__Euryarchaeota (UID3),148,188,125,97.6,0,⋯,,GCA_900314615.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A sp900314615,85.62,0.69,97.40,11,0.9815758924163569,Genome not assigned to closest species as it falls outside its pre-defined ANI radius


In [57]:
# summarizing the taxonomy
df.dims(20)
hq_genomes %>%
    group_by(Family, Genus) %>%
    summarize(n_genomes = n(), .groups='drop')
df.dims()

Family,Genus,n_genomes
<chr>,<chr>,<int>
f__Methanobacteriaceae,g__Methanobrevibacter,3
f__Methanobacteriaceae,g__Methanobrevibacter_A,19
f__Methanobacteriaceae,g__Methanobrevibacter_B,3
f__Methanobacteriaceae,g__Methanobrevibacter_C,2
f__Methanobacteriaceae,g__UBA412,1


In [58]:
# filtering by taxonomy
hq_genomes = hq_genomes %>%
    filter(grepl('Methanobrevibacter', Genus)) 
hq_genomes

genome,completeness,contamination,Bin.Id,Marker.lineage,X..genomes,X..markers,X..marker.sets,Completeness,Contamination,⋯,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,msa_percent,translation_table,red_value,warnings
<fct>,<dbl>,<dbl>,<chr>,<fct>,<int>,<int>,<int>,<dbl>,<dbl>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,<fct>,<fct>
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000016525.1_ASM1652v1_genomic.fna,100,0,GCA_000016525.1_ASM1652v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,⋯,1.0,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,100.0,1.0,98.09,11,,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000151225.1_ASM15122v1_genomic.fna,100,0,GCA_000151225.1_ASM15122v1_genomic,p__Euryarchaeota (UID3),148,188,125,100,0,⋯,0.95,GCF_000016525.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A smithii,98.32,0.95,98.09,11,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_015062915.1_ASM1506291v1_genomic.fna,96.8,0,GCA_015062915.1_ASM1506291v1_genomic,p__Euryarchaeota (UID3),148,188,125,96.8,0,⋯,,GCF_003111625.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A thaueri,84.14,0.68,96.99,11,0.9866057450935961,Genome not assigned to closest species as it falls outside its pre-defined ANI radius
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_015062935.1_ASM1506293v1_genomic.fna,97.6,0,GCA_015062935.1_ASM1506293v1_genomic,p__Euryarchaeota (UID3),148,188,125,97.6,0,⋯,,GCA_900314615.1,95.0,d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter_A;s__Methanobrevibacter_A sp900314615,85.62,0.69,97.40,11,0.9815758924163569,Genome not assigned to closest species as it falls outside its pre-defined ANI radius


In [60]:
# summarizing
hq_genomes$X..contigs %>% summary_x('No. of contigs')
hq_genomes$Mean.contig.length..bp. %>% summary_x('Mean contig length')
hq_genomes$X..predicted.genes %>% summary_x('No. of genes')
hq_genomes$N50..contigs. %>% summary_x('N50')

Unnamed: 0,Min.,1st Qu.,Median,Mean,3rd Qu.,Max.,sd,sd_err_of_mean
No. of contigs,1,6,35,32.55556,45.5,176,64.436,26.306


Unnamed: 0,Min.,1st Qu.,Median,Mean,3rd Qu.,Max.,sd,sd_err_of_mean
Mean contig length,15742,43325,61125,551174.7,356244.5,2937203,1135233,463456.9


Unnamed: 0,Min.,1st Qu.,Median,Mean,3rd Qu.,Max.,sd,sd_err_of_mean
No. of genes,1585,1752.5,1841,1918.778,1984,2888,459.13,187.439


Unnamed: 0,Min.,1st Qu.,Median,Mean,3rd Qu.,Max.,sd,sd_err_of_mean
N50,83719,103966,127214,674844.8,1103877,2937203,1106067,451550.1


In [61]:
# writing samples table for LLPRIMER
outfile = file.path(work_dir, 'samples_genomes_hq.txt')
hq_genomes %>%
    select(Bin.Id, Fasta) %>%
    rename('Taxon' = Bin.Id) %>%
    mutate(Taxon = gsub('_chromosome.+', '', Taxon),
           Taxon = gsub('_bin_.+', '', Taxon),
           Taxon = gsub('_genomic', '', Taxon),
           Taxon = gsub('_annotated_assembly', '', Taxon),
           Taxid = taxid) %>%
    write_table(outfile)

File written: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/samples_genomes_hq.txt 


# Primer design

### Config

In [23]:
F = file.path(work_dir, 'primers', 'config.yaml')
cat_file(F)

#-- I/O --#
samples_file: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/samples_genomes_hq.txt

# output location
output_dir: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/primers/

# temporary file directory (your username will be added automatically)
tmp_dir: /ebio/abt3_scratch/

#-- software parameters --#
# See the README for a description
params:
  ionice: -c 3
  cgp:
    prodigal: ""    
    mmseqs:
      method: cluster    # or linclust (faster)
      run: --min-seq-id 0.8 -c 0.8
    core_genes: --frac 1 --max-clusters 500
    blastx: -evalue 1e-10 -max_target_seqs 3
    blastx_nontarget: -evalue 1e-5 -max_target_seqs 30
    align:
      method: linsi
      params: --auto --maxiterate 1000
    primer3:
      number: --num-primers 500
      size: --opt-size 20 --min-size 18 --max-size 24
      product: --opt-prod-size 150 --min-prod-size 100 --max-prod-size 200
      Tm: --opt-tm 62 --min-tm 55 --max-

### Run

```
(snakemake) @ rick:/ebio/abt3_projects/software/dev/ll_pipelines/llprimer
$ screen -L  -S llprimer-brevi ./snakemake_sge.sh /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/primers/config.yaml 30 -F
```

### Summary

In [62]:
primer_info = read.delim(file.path(work_dir, 'primers',  'cgp', 'primers_final_info.tsv'), sep='\t')
primer_info %>% unique_n('primer sets', primer_set)
primer_info

No. of unique primer sets: 5 


cluster_id,primer_set,amplicon_size_consensus,amplicon_size_avg,amplicon_size_sd,primer_id,primer_type,sequence,length,degeneracy,⋯,position_start,position_end,Tm_avg,Tm_sd,GC_avg,GC_sd,hairpin_avg,hairpin_sd,homodimer_avg,homodimer_sd
<int>,<int>,<int>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<int>,<int>,⋯,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5,123,112,112,0,123f,PRIMER_LEFT,CDGGWCCTGGTGCACAAGC,19,6,⋯,259,278,62.76498,1.208299,64.91228,2.481076,42.56951,0.00000,20.43330,0.9882
5,123,112,112,0,123r,PRIMER_RIGHT,CCRCCWGGDCKTCCWGTACC,20,48,⋯,351,371,63.62883,2.275517,66.66667,4.249183,45.49754,14.90341,14.79556,18.7045
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
38,471,156,155.037,0.1888526,471f,PRIMER_LEFT,GCYGCWCGTATGGTWATTGC,20,8,⋯,228,248,59.45945,1.2459783,52.5,2.5,0.00000,0.000000,-11.76346,13.11040
38,471,156,155.037,0.1888526,471r,PRIMER_RIGHT,TTACGWGCTCTWGCWCCAGG,20,8,⋯,364,384,60.11770,0.6980054,55.0,0.0,51.45057,8.608382,11.32991,11.09863


In [63]:
gene_annot = read.delim(file.path(work_dir, 'primers', 'cgp', 'core_clusters_blastx.tsv'), 
                        sep='\t') %>%
    mutate(cluster_id = gsub('cluster_', '', cluster_id) %>% as.Num) %>%
    semi_join(primer_info, c('cluster_id')) %>%
    mutate(gene_name = gsub(' \\[.+', '', subject_name),
           gene_taxonomy = gsub('.+\\[', '', subject_name),
           gene_taxonomy = gsub('\\]', '', gene_taxonomy))
gene_annot

cluster_id,query,subject,subject_name,pident,length,mismatch,qstart,qend,sstart,send,evalue,slen,qlen,sscinames,staxids,pident_rank,gene_name,gene_taxonomy
<dbl>,<fct>,<fct>,<fct>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<fct>,<fct>,<int>,<chr>,<chr>
5,f06e9f8c6cc84758802515cc79a96221,WP_067145940.1,30S ribosomal protein S11 [Methanobrevibacter olleyae],100.000,117,0,1,351,1,117,1.32e-53,130,393,Methanobrevibacter olleyae,294671,3,30S ribosomal protein S11,Methanobrevibacter olleyae
5,f06e9f8c6cc84758802515cc79a96221,WP_012955709.1,30S ribosomal protein S11 [Methanobrevibacter ruminantium],99.145,117,1,1,351,1,117,3.62e-53,130,393,Methanobrevibacter ruminantium;Methanobrevibacter ruminantium M1,83816;634498,1,30S ribosomal protein S11,Methanobrevibacter ruminantium
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
38,dc805759468048c6ac14c024fd2792f4,WP_042694541.1,30S ribosomal protein S9 [Methanobrevibacter oralis],96.992,133,4,1,399,1,133,6.09e-90,133,402,Methanobrevibacter oralis;Methanobrevibacter oralis JMR01,66851;1415626,1,30S ribosomal protein S9,Methanobrevibacter oralis
38,dc805759468048c6ac14c024fd2792f4,MBE6501453.1,30S ribosomal protein S9 [Methanobrevibacter thaueri],95.489,133,6,1,399,1,133,5.18e-89,133,402,Methanobrevibacter thaueri,190975,2,30S ribosomal protein S9,Methanobrevibacter thaueri


In [67]:
# non-target
gene_annot = read.delim(file.path(work_dir, 'primers', 'cgp', 'core_clusters_blastx_nontarget.tsv'), 
                        sep='\t') %>%
    mutate(cluster_id = gsub('cluster_', '', cluster_id) %>% as.Num) %>%
    semi_join(primer_info, c('cluster_id')) %>%
    mutate(gene_name = gsub(' \\[.+', '', subject_name),
           gene_taxonomy = gsub('.+\\[', '', subject_name),
           gene_taxonomy = gsub('\\]', '', gene_taxonomy))
gene_annot

cluster_id,query,subject,subject_name,pident,length,mismatch,qstart,qend,sstart,send,evalue,slen,qlen,sscinames,staxids,pident_rank,gene_name,gene_taxonomy
<dbl>,<fct>,<fct>,<fct>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<fct>,<fct>,<int>,<chr>,<chr>
5,f06e9f8c6cc84758802515cc79a96221,NYB26310.1,30S ribosomal protein S11 [Methanobacteriaceae archaeon],88.034,117,14,1,351,1,117,1.85e-47,130,393,Methanobacterium sp. PtaU1.Bin097;Methanobacteriaceae archaeon,1811675;2099680,7,30S ribosomal protein S11,Methanobacteriaceae archaeon
5,f06e9f8c6cc84758802515cc79a96221,AXV38297.1,30S ribosomal protein S11 [Methanobacterium sp. BRmetb2],88.889,117,13,1,351,1,117,2.99e-47,130,393,Methanobacterium sp. BRmetb2,2025350,3,30S ribosomal protein S11,Methanobacterium sp. BRmetb2
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
38,dc805759468048c6ac14c024fd2792f4,MBI5459065.1,30S ribosomal protein S9 [Methanobacterium sp.],78.195,133,29,1,399,1,133,2.67e-70,133,402,Methanobacterium sp.,2164,20,30S ribosomal protein S9,Methanobacterium sp.
38,dc805759468048c6ac14c024fd2792f4,NYB52771.1,30S ribosomal protein S9 [Methanobacteriaceae archaeon],78.195,133,29,1,399,1,133,4.48e-70,133,402,Methanobacteriaceae archaeon,2099680,20,30S ribosomal protein S9,Methanobacteriaceae archaeon


In [68]:
# most unique clusters
df.dims(10)
gene_annot %>%
    filter(pident_rank == 1) %>%
    arrange(pident) %>%
    head(n=10)
df.dims()

Unnamed: 0_level_0,cluster_id,query,subject,subject_name,pident,length,mismatch,qstart,qend,sstart,send,evalue,slen,qlen,sscinames,staxids,pident_rank,gene_name,gene_taxonomy
Unnamed: 0_level_1,<dbl>,<fct>,<fct>,<fct>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<fct>,<fct>,<int>,<chr>,<chr>
1,38,dc805759468048c6ac14c024fd2792f4,RAP54649.1,"30S ribosomal protein S9, partial [Methanosphaera sp. rholeuAM130]",81.25,128,24,16,399,1,128,2.19e-71,128,402,Methanosphaera sp. rholeuAM130,1945578,1,"30S ribosomal protein S9, partial",Methanosphaera sp. rholeuAM130
2,5,f06e9f8c6cc84758802515cc79a96221,NYB52767.1,30S ribosomal protein S11 [Methanobacteriaceae archaeon],89.381,113,12,13,351,7,119,9.29e-46,132,393,Methanobacteriaceae archaeon,2099680,1,30S ribosomal protein S11,Methanobacteriaceae archaeon
3,5,f06e9f8c6cc84758802515cc79a96221,CDG64903.1,30S ribosomal protein S11 [Methanobacterium sp. MB1],89.381,113,12,13,351,7,119,1.1899999999999999e-45,132,393,Methanobacterium sp. MB1,1379702,1,30S ribosomal protein S11,Methanobacterium sp. MB1
4,34,53df7896588e4f308614317a72ecfe65,AXV38276.1,stress response translation initiation inhibitor YciH [Methanobacterium sp. BRmetb2],91.089,101,9,7,309,2,102,9.9e-59,106,348,Methanobacterium sp. BRmetb2,2025350,1,stress response translation initiation inhibitor YciH,Methanobacterium sp. BRmetb2


In [69]:
df.dims(30, 40)
primer_info %>%
    filter(cluster_id %in% c(38)) %>%
    filter(Tm_sd <= 2) %>%
    group_by(primer_set) %>%
    mutate(n = n()) %>%
    ungroup() %>%
    filter(n == 2) %>%
    select(cluster_id, primer_set, amplicon_size_avg, primer_id, sequence, length, degeneracy, 
           position_start, position_end, Tm_avg, Tm_sd)
df.dims()

cluster_id,primer_set,amplicon_size_avg,primer_id,sequence,length,degeneracy,position_start,position_end,Tm_avg,Tm_sd
<int>,<int>,<dbl>,<fct>,<fct>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
38,259,156.037,259f,AGCYGCWCGTATGGTWATTGC,21,8,227,248,60.88055,1.2036566
38,259,156.037,259r,TTACGWGCTCTWGCWCCAGG,20,8,364,384,60.1177,0.6980054
38,283,161.037,283f,GCWGAAGCYGCWCGTATGG,19,8,222,241,60.88961,1.3018699
38,283,161.037,283r,TTACGWGCTCTWGCWCCAGG,20,8,364,384,60.1177,0.6980054
38,471,155.037,471f,GCYGCWCGTATGGTWATTGC,20,8,228,248,59.45945,1.2459783
38,471,155.037,471r,TTACGWGCTCTWGCWCCAGG,20,8,364,384,60.1177,0.6980054


In [73]:
# writing out primers
outF = file.path(work_dir, 'Methanobrevibacter_c38-259f-259r.tsv')
primer_info %>%
    filter(cluster_id %in% c(38)) %>%
    filter(Tm_sd <= 2) %>%
    group_by(primer_set) %>%
    mutate(n = n()) %>%
    ungroup() %>%
    filter(n == 2) %>%
    head(n=2) %>%
    write_table(outF)

File written: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/methanobrevibacter/Methanobrevibacter_c38-259f-259r.tsv 


In [78]:
# all hits to Methanothermobacter
df.dims(40)
gene_annot %>%
    filter(cluster_id == 38) %>%
    filter(grepl('Methanothermo', sscinames)) %>%
    arrange(-pident)
df.dims()

cluster_id,query,subject,subject_name,pident,length,mismatch,qstart,qend,sstart,send,evalue,slen,qlen,sscinames,staxids,pident_rank,gene_name,gene_taxonomy
<dbl>,<fct>,<fct>,<fct>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<fct>,<fct>,<int>,<chr>,<chr>
38,dc805759468048c6ac14c024fd2792f4,WP_160323333.1,30S ribosomal protein S9 [Methanothermobacter sp. THM-2],79.699,133,27,1,399,1,133,2.07e-74,133,402,Methanothermobacter sp. THM-2,2606912,8,30S ribosomal protein S9,Methanothermobacter sp. THM-2
38,dc805759468048c6ac14c024fd2792f4,WP_013295348.1,30S ribosomal protein S9 [Methanothermobacter marburgensis],79.699,133,27,1,399,1,133,2.78e-74,133,402,Methanothermobacter marburgensis str. Marburg;Methanothermobacter marburgensis,79929;145263,8,30S ribosomal protein S9,Methanothermobacter marburgensis
38,dc805759468048c6ac14c024fd2792f4,WP_048174846.1,MULTISPECIES: 30S ribosomal protein S9 [unclassified Methanothermobacter],78.947,133,28,1,399,1,133,4.62e-73,133,402,Methanothermobacter sp. CaT2;Methanothermobacter sp. KEPCO-1;unclassified Methanothermobacter,866790;2603820;2631116,16,MULTISPECIES: 30S ribosomal protein S9,unclassified Methanothermobacter


# sessionInfo

In [70]:
sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/tidyverse/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] LeyLabRMisc_0.1.6 ggplot2_3.3.1     tidyr_1.1.0       dplyr_1.0.0      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     magrittr_1.5     munsell_0.5.0    tidyselect_1.1.0
 [5] uuid_0.1-4       colorspace_1.4-1 R6_2.4.1         rlang_0.4.6     
 [9] tools_3.6.3