# Goal

* Primer design for clade of interest

# Var

In [5]:
base_dir = '/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/'
clade = 'Oscillibacter'
taxid = 459786

# Init

In [6]:
library(dplyr)
library(tidyr)
library(ggplot2)
library(LeyLabRMisc)

In [4]:
df.dims()
work_dir = file.path(base_dir, clade)
make_dir(work_dir)

Directory already exists: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted//Oscillibacter 


# Genome download

* Downloading genomes from NCBI

```
OUTDIR=/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/
mkdir -p $OUTDIR
ncbi-genome-download -p 12 -s genbank -F fasta --genera Oscillibacter -o $OUTDIR bacteria
```

# Genome quality

* Filtering genomes by quality

In [9]:
D = file.path(base_dir, clade, 'genbank')
files = list_files(D, '.fna.gz')
samps = data.frame(Name = files %>% as.character %>% basename,
                   Fasta = files,
                   Domain = 'Bacteria',
                   Taxid = taxid) %>%
    mutate(Name = gsub('\\.fna\\.gz$', '', Name),
           Fasta = gsub('/+', '/', Fasta))
samps

# writing file
outfile = file.path(D, 'samples.txt')
write_table(samps, outfile)

Name,Fasta,Domain,Taxid
<chr>,<chr>,<fct>,<dbl>
GCA_000283575.1_ASM28357v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/genbank/bacteria/GCA_000283575.1/GCA_000283575.1_ASM28357v1_genomic.fna.gz,Bacteria,459786
GCA_000307265.1_ASM30726v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/genbank/bacteria/GCA_000307265.1/GCA_000307265.1_ASM30726v1_genomic.fna.gz,Bacteria,459786
⋮,⋮,⋮,⋮
GCA_015052085.1_ASM1505208v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/genbank/bacteria/GCA_015052085.1/GCA_015052085.1_ASM1505208v1_genomic.fna.gz,Bacteria,459786
GCA_900115635.1_IMG-taxon_2623620509_annotated_assembly_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/genbank/bacteria/GCA_900115635.1/GCA_900115635.1_IMG-taxon_2623620509_annotated_assembly_genomic.fna.gz,Bacteria,459786


File written: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted//Oscillibacter/genbank/samples.txt 


### LLG

#### Config

In [13]:
cat_file(file.path(work_dir, 'config_llg.yaml'))

# table with genome --> fasta_file information
samples_file: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/genbank/samples.txt

# output location
output_dir: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/

# temporary file directory (your username will be added automatically)
tmp_dir: /ebio/abt3_scratch/

# batch processing of genomes for certain steps
## increase to better parallelize
batches: 2 

# Domain of genomes ('Archaea' or 'Bacteria)
## Use "Skip" if provided as a "Domain" column in the genome table
Domain: Skip

# software parameters
# Use "Skip" to skip any of these steps. If no params for rule, use ""
# dRep MAGs are not further analyzed, but you can de-rep & then use the de-rep genome table as input.
params:
  ionice: -c 3
  # assembly assessment
  seqkit: ""
  quast: Skip #""
  multiqc_on_quast: "" 
  checkm: ""
  # de-replication (requires checkm)
  drep: -com

#### Run

```
(snakemake) @ rick:/ebio/abt3_projects/software/dev/ll_pipelines/llg
$ screen -L -S llg-osc ./snakemake_sge.sh /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/config_llg.yaml 20 -F
```

### Samples table of high-quality genomes

In [18]:
# checkM summary
checkm = file.path(work_dir, 'LLG_output', 'checkM', 'checkm_qa_summary.tsv') %>%
    read.delim(sep='\t') 
checkm

Bin.Id,Marker.lineage,X..genomes,X..markers,X..marker.sets,Completeness,Contamination,Strain.heterogeneity,Genome.size..bp.,X..ambiguous.bases,⋯,X0,X1,X2,X3,X4,X5.,assembly.Id,assembler.Id,taxon.Id,File
<fct>,<fct>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<lgl>,<fct>
GCA_000283575.1_ASM28357v1_genomic,o__Clostridiales (UID1212),172,263,149,98.99,1.01,0,4470622,0,⋯,2,259,2,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|HMP_most-wanted|Oscillibacter|LLG_output|checkM|1|checkm|markers_qa_summary.tsv.1,markers_qa_summary.tsv.1,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/checkM/1/checkm/markers_qa_summary.tsv.1
GCA_000403435.2_Osci_bact_1-3_V1_genomic,o__Clostridiales (UID1212),172,263,149,99.33,1.45,0,4467686,37481,⋯,1,259,3,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|HMP_most-wanted|Oscillibacter|LLG_output|checkM|1|checkm|markers_qa_summary.tsv.2,markers_qa_summary.tsv.2,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/checkM/1/checkm/markers_qa_summary.tsv.2
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
GCA_014799915.1_ASM1479991v1_genomic,o__Clostridiales (UID1212),172,263,149,87.74,2.35,50.00,2875254,19077,⋯,50,209,4,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|HMP_most-wanted|Oscillibacter|LLG_output|checkM|2|checkm|markers_qa_summary.tsv.18,markers_qa_summary.tsv.18,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/checkM/2/checkm/markers_qa_summary.tsv.18
GCA_015052085.1_ASM1505208v1_genomic,o__Clostridiales (UID1212),172,263,149,89.56,11.63,82.61,2465477,0,⋯,45,195,23,0,0,0,|ebio|abt3_projects|software|dev|ll_pipelines|llprimer|experiments|HMP_most-wanted|Oscillibacter|LLG_output|checkM|2|checkm|markers_qa_summary.tsv.19,markers_qa_summary.tsv.19,,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/checkM/2/checkm/markers_qa_summary.tsv.19


In [19]:
# dRep summary
drep = file.path(work_dir, 'LLG_output', 'drep', 'checkm_markers_qa_summary.tsv') %>%
    read.delim(sep='\t') %>%
    mutate(Bin.Id = gsub('.+/', '', genome),
           Bin.Id = gsub('\\.fna$', '', Bin.Id))
drep

genome,completeness,contamination,Bin.Id
<fct>,<dbl>,<dbl>,<chr>
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000283575.1_ASM28357v1_genomic.fna,98.99,1.01,GCA_000283575.1_ASM28357v1_genomic
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000403435.2_Osci_bact_1-3_V1_genomic.fna,99.33,1.45,GCA_000403435.2_Osci_bact_1-3_V1_genomic
⋮,⋮,⋮,⋮
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_014799915.1_ASM1479991v1_genomic.fna,87.74,2.35,GCA_014799915.1_ASM1479991v1_genomic
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_015052085.1_ASM1505208v1_genomic.fna,89.56,11.63,GCA_015052085.1_ASM1505208v1_genomic


In [20]:
# de-replicated genomes
drep_gen = file.path(work_dir, 'LLG_output', 'drep', 'dereplicated_genomes.tsv') %>%
    read.delim(sep='\t')
drep_gen

Name,Fasta
<fct>,<fct>
GCA_000765235.1_ASM76523v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_000765235.1_ASM76523v1_genomic.fna
GCA_003525445.1_ASM352544v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_003525445.1_ASM352544v1_genomic.fna
⋮,⋮
GCA_001916835.1_ASM191683v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_001916835.1_ASM191683v1_genomic.fna
GCA_014799925.1_ASM1479992v1_genomic,/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/LLG_output/drep/drep/dereplicated_genomes/GCA_014799925.1_ASM1479992v1_genomic.fna


In [21]:
# GTDBTk summary
tax = file.path(work_dir, 'LLG_output', 'gtdbtk', 'gtdbtk_bac_summary.tsv') %>%
    read.delim(, sep='\t') %>%
    separate(classification, 
             c('Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'),
             sep=';') %>%
    select(-note, -classification_method, -pplacer_taxonomy,
           -other_related_references.genome_id.species_name.radius.ANI.AF.)
tax

user_genome,Domain,Phylum,Class,Order,Family,Genus,Species,fastani_reference,fastani_reference_radius,⋯,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,msa_percent,translation_table,red_value,warnings
<fct>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,<fct>,<fct>
GCA_000283575.1_ASM28357v1_genomic,d__Bacteria,p__Firmicutes_A,c__Clostridia,o__Oscillospirales,f__Oscillospiraceae,g__Oscillibacter,s__Oscillibacter valericigenes,GCF_000283575.1,95.0,⋯,1.0,GCF_000283575.1,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter valericigenes,100.0,1.0,95.38,11,,
GCA_000403435.2_Osci_bact_1-3_V1_genomic,d__Bacteria,p__Firmicutes_A,c__Clostridia,o__Oscillospirales,f__Oscillospiraceae,g__Oscillibacter,s__Oscillibacter sp000403435,GCF_000403435.2,95.0,⋯,0.99,GCF_000403435.2,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter sp000403435,99.99,0.99,95.67,11,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
GCA_014799915.1_ASM1479991v1_genomic,d__Bacteria,p__Firmicutes_A,c__Clostridia,o__Oscillospirales,f__Oscillospiraceae,g__,s__,,,⋯,,,,,,,79.35,11,0.9173648727242681,
GCA_015052085.1_ASM1505208v1_genomic,d__Bacteria,p__Firmicutes_A,c__Clostridia,o__Oscillospirales,f__Oscillospiraceae,g__Oscillibacter,s__Oscillibacter welbionis,GCF_005121165.1,95.0,⋯,0.88,GCF_005121165.1,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter welbionis,98.68,0.88,76.73,11,,


In [22]:
# checking overlap
cat('-- drep --\n')
overlap(basename(as.character(drep_gen$Fasta)), 
        basename(as.character(drep$genome)))
cat('-- checkm --\n')
overlap(drep$Bin.Id, checkm$Bin.Id)
cat('-- gtdbtk --\n')
overlap(drep$Bin.Id, tax$user_genome)

-- drep --
intersect(x,y): 16 
setdiff(x,y): 0 
setdiff(y,x): 23 
union(x,y): 39 
-- checkm --
intersect(x,y): 39 
setdiff(x,y): 0 
setdiff(y,x): 0 
union(x,y): 39 
-- gtdbtk --
intersect(x,y): 39 
setdiff(x,y): 0 
setdiff(y,x): 0 
union(x,y): 39 


In [23]:
# joining based on Bin.Id
drep = drep %>%
    inner_join(checkm, c('Bin.Id')) %>%
    mutate(GEN = genome %>% as.character %>% basename) %>%
    inner_join(drep_gen %>% mutate(GEN = Fasta %>% as.character %>% basename),
               by=c('GEN')) %>%
    inner_join(tax, c('Bin.Id'='user_genome')) #%>%
drep

genome,completeness,contamination,Bin.Id,Marker.lineage,X..genomes,X..markers,X..marker.sets,Completeness,Contamination,⋯,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,msa_percent,translation_table,red_value,warnings
<fct>,<dbl>,<dbl>,<chr>,<fct>,<int>,<int>,<int>,<dbl>,<dbl>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,<fct>,<fct>
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000283575.1_ASM28357v1_genomic.fna,98.99,1.01,GCA_000283575.1_ASM28357v1_genomic,o__Clostridiales (UID1212),172,263,149,98.99,1.01,⋯,1.0,GCF_000283575.1,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter valericigenes,100.0,1.0,95.38,11,,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000403435.2_Osci_bact_1-3_V1_genomic.fna,99.33,1.45,GCA_000403435.2_Osci_bact_1-3_V1_genomic,o__Clostridiales (UID1212),172,263,149,99.33,1.45,⋯,0.99,GCF_000403435.2,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter sp000403435,99.99,0.99,95.67,11,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_003497655.1_ASM349765v1_genomic.fna,91.14,0.00,GCA_003497655.1_ASM349765v1_genomic,o__Clostridiales (UID1212),172,263,149,91.14,0.00,⋯,0.82,GCF_000765235.1,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__ER4;s__ER4 sp000765235,96.71,0.82,91.90,11,,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_009774015.1_ASM977401v1_genomic.fna,91.03,3.68,GCA_009774015.1_ASM977401v1_genomic,o__Clostridiales (UID1212),172,263,149,91.03,3.68,⋯,,GCF_000403435.2,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter sp000403435,82.6,0.59,83.51,11,0.9623815263719147,


In [24]:
# filtering by quality
hq_genomes = drep %>%
    filter(completeness >= 90,
           contamination < 5,
           Strain.heterogeneity < 50)
hq_genomes

genome,completeness,contamination,Bin.Id,Marker.lineage,X..genomes,X..markers,X..marker.sets,Completeness,Contamination,⋯,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,msa_percent,translation_table,red_value,warnings
<fct>,<dbl>,<dbl>,<chr>,<fct>,<int>,<int>,<int>,<dbl>,<dbl>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,<fct>,<fct>
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000283575.1_ASM28357v1_genomic.fna,98.99,1.01,GCA_000283575.1_ASM28357v1_genomic,o__Clostridiales (UID1212),172,263,149,98.99,1.01,⋯,1.0,GCF_000283575.1,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter valericigenes,100.0,1.0,95.38,11,,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_000403435.2_Osci_bact_1-3_V1_genomic.fna,99.33,1.45,GCA_000403435.2_Osci_bact_1-3_V1_genomic,o__Clostridiales (UID1212),172,263,149,99.33,1.45,⋯,0.99,GCF_000403435.2,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter sp000403435,99.99,0.99,95.67,11,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_003497655.1_ASM349765v1_genomic.fna,91.14,0.00,GCA_003497655.1_ASM349765v1_genomic,o__Clostridiales (UID1212),172,263,149,91.14,0.00,⋯,0.82,GCF_000765235.1,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__ER4;s__ER4 sp000765235,96.71,0.82,91.90,11,,
/ebio/abt3_scratch/nyoungblut/LLG_62325884640/genomes/GCA_009774015.1_ASM977401v1_genomic.fna,91.03,3.68,GCA_009774015.1_ASM977401v1_genomic,o__Clostridiales (UID1212),172,263,149,91.03,3.68,⋯,,GCF_000403435.2,95.0,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter sp000403435,82.6,0.59,83.51,11,0.9623815263719147,


In [25]:
# summarizing the taxonomy
df.dims(20)
hq_genomes %>%
    group_by(Family, Genus) %>%
    summarize(n_genomes = n(), .groups='drop')
df.dims()

Family,Genus,n_genomes
<chr>,<chr>,<int>
f__Oscillospiraceae,g__,1
f__Oscillospiraceae,g__CAG-83,1
f__Oscillospiraceae,g__ER4,3
f__Oscillospiraceae,g__Oscillibacter,9


In [26]:
# writing samples table for LLPRIMER
outfile = file.path(work_dir, 'samples_genomes_hq.txt')
hq_genomes %>%
    select(Bin.Id, Fasta) %>%
    rename('Taxon' = Bin.Id) %>%
    mutate(Taxon = gsub('_chromosome.+', '', Taxon),
           Taxon = gsub('_bin_.+', '', Taxon),
           Taxon = gsub('_genomic', '', Taxon),
           Taxon = gsub('_annotated_assembly', '', Taxon),
           Taxid = taxid) %>%
    write_table(outfile)

File written: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted//Oscillibacter/samples_genomes_hq.txt 


# Primer design

### Config

In [37]:
F = file.path(work_dir, 'primers', 'config.yaml')
cat_file(F)

#-- I/O --#
samples_file: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/samples_genomes_hq.txt

# output location
output_dir: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/experiments/HMP_most-wanted/Oscillibacter/primers/

# temporary file directory (your username will be added automatically)
tmp_dir: /ebio/abt3_scratch/

#-- software parameters --#
# See the README for a description
params:
  ionice: -c 3
  cgp:
    prodigal: ""    
    mmseqs:
      method: cluster    # or linclust (faster)
      run: --min-seq-id 0.8 -c 0.8
    core_genes: --frac 1 --max-clusters 500
    blastx: -evalue 1e-10 -max_target_seqs 3
    blastx_nontarget: -evalue 1e-5 -max_target_seqs 30
    align:
      method: linsi
      params: --auto --maxiterate 1000
    primer3:
      number: --num-primers 500
      size: --opt-size 20 --min-size 18 --max-size 24
      product: --opt-prod-size 150 --min-prod-size 100 --max-prod-size 200
      Tm: --opt-tm

### Run

```
(snakemake) @ rick:/ebio/abt3_projects/software/dev/ll_pipelines/llprimer
$ screen -L -S llprimer-Osc ./snakemake_sge.sh experiments/HMP_most-wanted/Oscillibacter/primers/config.yaml 50 --notemp -F
```

## Summary

### Primers

In [15]:
primer_info = read.delim(file.path(work_dir, 'primers', 'cgp', 'primers_final_info.tsv'), sep='\t')
primer_info %>% unique_n('primers', primer_set)
primer_info

No. of unique primers: 9 


cluster_id,primer_set,amplicon_size_consensus,amplicon_size_avg,amplicon_size_sd,primer_id,primer_type,sequence,length,degeneracy,⋯,position_start,position_end,Tm_avg,Tm_sd,GC_avg,GC_sd,hairpin_avg,hairpin_sd,homodimer_avg,homodimer_sd
<int>,<int>,<int>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<int>,<int>,⋯,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
20,6,200,200,0,6f,PRIMER_LEFT,GGYGCCAARGARATYAAGTG,20,16,⋯,42,62,57.85193,2.173835,50.00000,5.000000,0.00000,0.00000,-1.047273,28.68951
20,6,200,200,0,6r,PRIMER_RIGHT,TCRTCRAAVCGGACATAGGT,20,12,⋯,222,242,58.63174,1.853255,48.33333,4.249183,29.63751,21.23943,-10.153331,19.60310
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
36,472,158,158,0,472f,PRIMER_LEFT,GBTGGAACCCYAARATGGC,19,12,⋯,73,92,58.54693,2.003485,56.14035,4.472824,35.24887,25.37298,-35.758653,16.38444
36,472,158,158,0,472r,PRIMER_RIGHT,CTTCTTSGTDCCVACGAACA,20,18,⋯,211,231,58.88266,1.695892,50.00000,3.333333,40.93949,22.72916,8.310367,9.29436


### Gene cluster annotations

In [27]:
gene_annot = read.delim(file.path(work_dir, 'primers', 'cgp', 'core_clusters_blastx.tsv'), 
                        sep='\t') %>%
    mutate(cluster_id = gsub('cluster_', '', cluster_id) %>% as.Num) %>%
    semi_join(primer_info, c('cluster_id')) %>%
    mutate(gene_name = gsub(' \\[.+', '', subject_name),
           gene_taxonomy = gsub('.+\\[', '', subject_name),
           gene_taxonomy = gsub('\\]', '', gene_taxonomy))
gene_annot

cluster_id,query,subject,subject_name,pident,length,mismatch,qstart,qend,sstart,send,evalue,slen,qlen,sscinames,staxids,pident_rank,gene_name,gene_taxonomy
<dbl>,<fct>,<fct>,<fct>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<fct>,<fct>,<int>,<chr>,<chr>
20,888b0c0302684d39b00941d56aa58710,WP_040663505.1,50S ribosomal protein L14 [Oscillibacter ruminantium],100.00,122,0,1,366,1,122,4.23e-81,122,369,Oscillibacter ruminantium GH1;Oscillibacter ruminantium,1007096;1263547,3,50S ribosomal protein L14,Oscillibacter ruminantium
20,888b0c0302684d39b00941d56aa58710,WP_187014712.1,50S ribosomal protein L14 [Dysosmobacter sp. BX15],99.18,122,1,1,366,1,122,2.03e-80,122,369,Oscillibacter sp. 57_20;Dysosmobacter sp. BX15,1897011;2763042,1,50S ribosomal protein L14,Dysosmobacter sp. BX15
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
36,72fb21e0be894548a80af498875e107b,WP_040661538.1,30S ribosomal protein S2 [Oscillibacter ruminantium],99.142,233,2,1,699,1,233,3.85e-173,244,759,Oscillibacter ruminantium GH1;Oscillibacter ruminantium,1007096;1263547,1,30S ribosomal protein S2,Oscillibacter ruminantium
36,72fb21e0be894548a80af498875e107b,WP_187028606.1,30S ribosomal protein S2 [Dysosmobacter sp. NSJ-60],93.562,233,15,1,699,1,233,1.25e-163,241,759,Dysosmobacter sp. NSJ-60,2763041,2,30S ribosomal protein S2,Dysosmobacter sp. NSJ-60


In [29]:
df.dims(50)
gene_annot %>%
    distinct(cluster_id, gene_name) 
df.dims()

cluster_id,gene_name
<dbl>,<chr>
20,50S ribosomal protein L14
23,50S ribosomal protein L20
36,30S ribosomal protein S2


In [30]:
df.dims(50)
gene_annot %>%
    distinct(cluster_id, gene_taxonomy) 
df.dims()

cluster_id,gene_taxonomy
<dbl>,<chr>
20,Oscillibacter ruminantium
20,Dysosmobacter sp. BX15
20,Oscillibacter valericigenes
23,Oscillibacter sp. 1-3
23,Oscillibacter sp.
23,Oscillibacter sp. CAG:155
36,Oscillibacter valericigenes
36,Oscillibacter ruminantium
36,Dysosmobacter sp. NSJ-60


### Gene clusters: clostest related

In [31]:
gene_annot = read.delim(file.path(work_dir, 'primers', 'cgp', 'core_clusters_blastx_nontarget.tsv'), 
                        sep='\t') %>%
    mutate(cluster_id = gsub('cluster_', '', cluster_id) %>% as.Num) %>%
    semi_join(primer_info, c('cluster_id')) %>%
    mutate(gene_name = gsub(' \\[.+', '', subject_name),
           gene_taxonomy = gsub('.+\\[', '', subject_name),
           gene_taxonomy = gsub('\\]', '', gene_taxonomy))
gene_annot

cluster_id,query,subject,subject_name,pident,length,mismatch,qstart,qend,sstart,send,evalue,slen,qlen,sscinames,staxids,pident_rank,gene_name,gene_taxonomy
<dbl>,<fct>,<fct>,<fct>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<fct>,<fct>,<int>,<chr>,<chr>
20,888b0c0302684d39b00941d56aa58710,WP_187014712.1,50S ribosomal protein L14 [Dysosmobacter sp. BX15],99.180,122,1,1,366,1,122,2.03e-80,122,369,Dysosmobacter sp. BX15,2763042,1,50S ribosomal protein L14,Dysosmobacter sp. BX15
20,888b0c0302684d39b00941d56aa58710,WP_187028153.1,50S ribosomal protein L14 [Dysosmobacter sp. NSJ-60],95.082,122,6,1,366,1,122,1.23e-77,122,369,Dysosmobacter sp. NSJ-60,2763041,3,50S ribosomal protein L14,Dysosmobacter sp. NSJ-60
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
36,72fb21e0be894548a80af498875e107b,WP_050618278.1,30S ribosomal protein S2 [Intestinimonas massiliensis],84.581,227,35,13,693,2,228,1.92e-144,247,759,Intestinimonas massiliensis,1673721,29,30S ribosomal protein S2,Intestinimonas massiliensis
36,72fb21e0be894548a80af498875e107b,CUQ23443.1,30S ribosomal protein S2 [Flavonifractor plautii],84.581,227,35,13,693,2,228,2.37e-144,247,759,Flavonifractor plautii;uncultured Flavonifractor sp.,292800;1193534,29,30S ribosomal protein S2,Flavonifractor plautii


In [36]:
df.dims(50)
gene_annot %>%
    filter(pident > 80,
           pident_rank <= 3) %>%
    select(cluster_id, gene_name, gene_taxonomy, pident)
    
df.dims()

cluster_id,pident,gene_name,gene_taxonomy
<dbl>,<dbl>,<chr>,<chr>
20,99.18,50S ribosomal protein L14,Dysosmobacter sp. BX15
20,95.082,50S ribosomal protein L14,Dysosmobacter sp. NSJ-60
20,95.902,50S ribosomal protein L14,Ruminococcaceae bacterium
20,95.082,50S ribosomal protein L14,Ruminococcaceae bacterium
23,91.379,MULTISPECIES: 50S ribosomal protein L20,Oscillospiraceae
23,90.517,50S ribosomal protein L20,Clostridiales bacterium
23,89.655,50S ribosomal protein L20,Ruminococcaceae bacterium
23,89.655,50S ribosomal protein L20,Dysosmobacter sp. BX15
36,93.562,30S ribosomal protein S2,Dysosmobacter sp. NSJ-60
36,93.103,MULTISPECIES: 30S ribosomal protein S2,Oscillospiraceae


In [32]:
df.dims(50)
gene_annot %>%
    distinct(cluster_id, gene_name) 
df.dims()

cluster_id,gene_name
<dbl>,<chr>
20,50S ribosomal protein L14
20,MULTISPECIES: 50S ribosomal protein L14
23,MULTISPECIES: 50S ribosomal protein L20
23,50S ribosomal protein L20
23,
36,30S ribosomal protein S2
36,MULTISPECIES: 30S ribosomal protein S2
36,


# sessionInfo

In [28]:
pipelineInfo('/ebio/abt3_projects/software/dev/ll_pipelines/llg/')

LLG
===

Ley Lab Genome analysis pipeline (LLG)

* Version: 0.1.9
* Authors:
  * Nick Youngblut <nyoungb2@gmail.com>
* Maintainers:
  * Nick Youngblut <nyoungb2@gmail.com>

--- conda envs ---
==> /ebio/abt3_projects/software/dev/ll_pipelines/llg//bin/envs/gtdbtk.yaml <==
channels:
- conda-forge
- bioconda
dependencies:
- pigz
- bioconda::gtdbtk

==> /ebio/abt3_projects/software/dev/ll_pipelines/llg//bin/envs/checkm.yaml <==
channels:
- bioconda
dependencies:
- python=2.7
- pigz
- bioconda::prodigal
- bioconda::pplacer
- bioconda::checkm-genome

==> /ebio/abt3_projects/software/dev/ll_pipelines/llg//bin/envs/quast.yaml <==
channels:
- conda-forge
- bioconda
dependencies:
- bioconda::seqkit
- bioconda::quast>=5.0.0

==> /ebio/abt3_projects/software/dev/ll_pipelines/llg//bin/envs/sourmash.yaml <==
channels:
- conda-forge
- bioconda
dependencies:
- bioconda::sourmash=2.0.0a4

==> /ebio/abt3_projects/software/dev/ll_pipelines/llg//bin/envs/fastqc.yaml <==
channels:
- conda-forge
- bioconda


In [29]:
pipelineInfo('/ebio/abt3_projects/software/dev/ll_pipelines/llprimer/')

LLPRIMER

Ley Lab Primer generation pipeline (LLPRIMER)

* Version: 0.2.2
* Authors:
  * Nick Youngblut <nyoungb2@gmail.com>
* Maintainers:
  * Nick Youngblut <nyoungb2@gmail.com>

--- conda envs ---
==> /ebio/abt3_projects/software/dev/ll_pipelines/llprimer//bin/envs/pdp.yaml <==
channels:
- conda-forge
- bioconda
dependencies:
- python=3.7
- intervaltree
- prodigal
- blast
- bedtools
- mafft
- mummer=3.23
- emboss
- primer3=1.1.4
- biopython<1.78
- pybedtools
- joblib
- tqdm
- openpyxl

==> /ebio/abt3_projects/software/dev/ll_pipelines/llprimer//bin/envs/genes.yaml <==
channels:
- bioconda
dependencies:
- pigz
- python=3
- numpy
- pyfaidx
- bioconda::seqkit
- bioconda::fasta-splitter
- bioconda::vsearch
- bioconda::prodigal
- bioconda::mmseqs2
==> /ebio/abt3_projects/software/dev/ll_pipelines/llprimer//bin/envs/aln.yaml <==
channels:
- bioconda
- conda-forge
dependencies:
- pigz
- bioconda::kalign3
- bioconda::mafft

==> /ebio/abt3_projects/software/dev/ll_pipelines/llprimer//bin/env

In [30]:
sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/tidyverse/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=C.UTF-8    LC_NUMERIC=C        LC_TIME=C          
 [4] LC_COLLATE=C        LC_MONETARY=C       LC_MESSAGES=C      
 [7] LC_PAPER=C          LC_NAME=C           LC_ADDRESS=C       
[10] LC_TELEPHONE=C      LC_MEASUREMENT=C    LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] LeyLabRMisc_0.1.6 ggplot2_3.3.1     tidyr_1.1.0       dplyr_1.0.0      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     magrittr_1.5     munsell_0.5.0    tidyselect_1.1.0
 [5] uuid_0.1-4       colorspace_1.4-1 R6_2.4.1         rlang_0.4.6     
 [9] tools_3.6.3      grid_3.6.3       gtable_0.3.0     withr_2.2.0     
[13] htmltools_0.4.0  ellipsis_0.3.1 