<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Goal" data-toc-modified-id="Goal-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Goal</a></span></li><li><span><a href="#Var" data-toc-modified-id="Var-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Var</a></span></li><li><span><a href="#Init" data-toc-modified-id="Init-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Init</a></span></li><li><span><a href="#Load" data-toc-modified-id="Load-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load</a></span></li><li><span><a href="#MG-samples" data-toc-modified-id="MG-samples-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>MG samples</a></span><ul class="toc-item"><li><span><a href="#Writing-samples-file" data-toc-modified-id="Writing-samples-file-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Writing samples file</a></span></li></ul></li><li><span><a href="#LLMGQC" data-toc-modified-id="LLMGQC-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>LLMGQC</a></span></li><li><span><a href="#Creating-genome-sample-map-table" data-toc-modified-id="Creating-genome-sample-map-table-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Creating genome-sample map table</a></span></li><li><span><a href="#---debug---" data-toc-modified-id="---debug----8"><span class="toc-item-num">8&nbsp;&nbsp;</span>-- debug --</a></span></li></ul></div>

# Goal

* create feature tables for [AlmeidaA et al., 2019 dataset](https://doi.org/10.1038/s41586-019-0965-1)

# Var

In [1]:
work_dir = '/ebio/abt3_projects/databases_no-backup/DeepMAsED/MAG_datasets/AlmeidaA-2019/'

# checkM results for all MAGs
checkm_res_file = file.path(work_dir, 'mags-gut_qs50_checkm.tab')

# Init

In [2]:
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(18734)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



# Load

In [3]:
checkm_res = read.delim(checkm_res_file, sep='\t') 
checkm_res %>% nrow %>% print
checkm_res %>% head(n=3)

[1] 92143


MAG,Completeness,Contamination,Strain_heterogeneity,CheckM_lineage
SRR3496379_bin.19,96.98,0.67,0,k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales
SRR3496379_bin.31,97.09,1.8,0,k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae
SRR3496379_bin.37,78.63,0.17,100,k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales


# MG samples

In [4]:
# getting NCBI accessions
checkm_res = checkm_res %>%
    mutate(acc = gsub('_bin\\.[0-9]+$', '', MAG)) 

checkm_res$acc %>% unique %>% length %>% print

[1] 10902


In [5]:
# randomly selecting subset of samples
acc_sub1 = sample(checkm_res$acc, 100)
acc_sub1 %>% length %>% print
acc_sub1 %>% sort %>% head

[1] 100


In [6]:
# number of MAGs corresponding to samples
checkm_res_f = checkm_res %>%
    filter(acc %in% acc_sub1) 

checkm_res_f %>% nrow

In [7]:
# checkm stats
checkm_res %>%
    dplyr::select(Completeness, Contamination, Strain_heterogeneity) %>%
    summary

checkm_res_f %>%
    dplyr::select(Completeness, Contamination, Strain_heterogeneity) %>%
    summary

  Completeness    Contamination   Strain_heterogeneity
 Min.   : 50.03   Min.   :0.000   Min.   :  0.00      
 1st Qu.: 75.44   1st Qu.:0.220   1st Qu.:  0.00      
 Median : 88.77   Median :1.070   Median :  0.00      
 Mean   : 84.34   Mean   :1.563   Mean   : 14.67      
 3rd Qu.: 95.33   3rd Qu.:2.260   3rd Qu.: 20.00      
 Max.   :100.00   Max.   :9.730   Max.   :100.00      

  Completeness    Contamination   Strain_heterogeneity
 Min.   : 50.04   Min.   :0.000   Min.   :  0.00      
 1st Qu.: 76.49   1st Qu.:0.340   1st Qu.:  0.00      
 Median : 88.22   Median :1.200   Median :  0.00      
 Mean   : 84.37   Mean   :1.723   Mean   : 13.42      
 3rd Qu.: 94.67   3rd Qu.:2.510   3rd Qu.: 16.67      
 Max.   :100.00   Max.   :9.430   Max.   :100.00      

## Writing samples file

In [10]:
samples_file = file.path(work_dir, 'samples_n100.txt')
checkm_res_f %>%
    rename('Sample' = acc) %>%
    mutate(Remote = Sample) %>%
    distinct(Sample, Remote) %>%
    mutate(Run = 1, Lane = 1) %>%
    write.table(file=samples_file, sep='\t', quote=FALSE, row.names=FALSE)
cat('File written:', samples_file, '\n')

File written: /ebio/abt3_projects/databases_no-backup/DeepMAsED/MAG_datasets/AlmeidaA-2019//samples_n100.txt 


# LLMGQC

* downloading and QC of selected MG samples

```{bash}
(snakemake_dev) @ rick:/ebio/abt3_projects/databases_no-backup/DeepMAsED/bin/llmgqc
$ screen -L -S llmgqc-DM ./snakemake_sge.sh /ebio/abt3_projects/databases_no-backup/DeepMAsED/MAG_datasets/AlmeidaA-2019/LLMGQC/config.yaml cluster.json /ebio/abt3_projects/databases_no-backup/DeepMAsED/MAG_datasets/AlmeidaA-2019/LLMGQC/SGE_log 20
```

# Creating genome-sample map table

* for mapping reads to genomes

# -- debug --

In [21]:
checkm_res_f %>%
    filter(acc == 'ERR1018311') %>%
    .$MAG %>% head %>% paste(collapse='\n') %>% cat

ERR1018311_bin.53
ERR1018311_bin.57
ERR1018311_bin.54
ERR1018311_bin.66
ERR1018311_bin.55
ERR1018311_bin.75

In [19]:
checkm_res_f %>%
    filter(acc == 'SRR2047628') %>%
    .$MAG %>% head %>% paste(collapse='\n') %>% cat

SRR2047628_bin.7
SRR2047628_bin.5
SRR2047628_bin.1