# Description:

* The first thing to do is download a microbial genome dataset.
* Here, we will be downloading 3 bacterial genomes with differing G+C contents:
  * Clostridium ljungdahlii DSM13528 (G+C = 31.1)
  * Escherichia coli 1303 (G+C = 50.7)
  * Streptomyces pratensis ATCC33331 (G+C = 71.1)

# Setting variables

* "workDir" is the path to the working directory for this analysis (where the files will be download to) 
* **NOTE:** MAKE SURE to modify this path to the directory where YOU want to run the example. 

In [51]:
workDir = '/home/nick/t/SIPSim_wSeq/'

# Initializing

* Loading packages & libraries
* Make sure you have all of the dependencies!

In [52]:
import os

In [53]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [54]:
# making directories
## working directory
if not os.path.isdir(workDir):
    os.makedirs(workDir)
%cd $workDir

/home/nick/t/SIPSim_wSeq


In [55]:
# making directories
## genome directory
workDirGenome = os.path.join(workDir, 'genomes')
if not os.path.isdir(workDirGenome):
    os.mkdir(workDirGenome)  
print(workDirGenome)

/home/nick/t/SIPSim_wSeq/genomes


**Let's check that SIPSim is installed**

In [56]:
%%bash
source activate py27
SIPSim -l

#-- Commands --#
BD_shift
communities
DBL
deltaBD
diffusion
fragment_KDE
fragment_KDE_cat
fragment_parse
fragments
genome_download
genome_index
genome_rename
gradient_fractions
HRSIP
incorp_config_example
isotope_incorp
KDE_bandwidth
KDE_info
KDE_parse
KDE_plot
KDE_sample
KDE_select_taxa
OTU_add_error
OTU_PCR
OTU_sample_data
OTU_subsample
OTU_sum
OTU_table
OTU_wide_long
qSIP
qSIP_atom_excess
tree_sim


# Downloading genomes

* Downloading the genome sequences from NCBI based on their accession numbers.
* If you had Taxonomy IDs, then you could use [ncbi-genome-download](https://github.com/kblin/ncbi-genome-download) instead for downloading.

In [60]:
taxa="""Clostridium_ljungdahlii_DSM_13528	NC_014328.1
Escherichia_coli_1303	NZ_CP009166.1
Streptomyces_pratensis_ATCC_33331	NC_016114.1
"""

genome_file = os.path.join(workDir, 'genome_list.txt')
with open(genome_file, 'wb') as oFH:
    oFH.write(taxa)
    
print 'File written: {}'.format(genome_file)

File written: /home/nick/t/SIPSim_wSeq/genome_list.txt


In [69]:
%%bash -s $genome_file
source activate py27

# downloading genomes
SIPSim genome_download -d genomes -n 3 $1

File written: genomes/Clostridium_ljungdahlii_DSM_13528.fna
File written: genomes/Escherichia_coli_1303.fna
File written: genomes/Streptomyces_pratensis_ATCC_33331.fna


In [70]:
!ls -thlc genomes

total 17M
-rw-rw-r-- 1 nick nick 7.1M Jul  9 14:02 Streptomyces_pratensis_ATCC_33331.fna
-rw-rw-r-- 1 nick nick 4.8M Jul  9 14:02 Escherichia_coli_1303.fna
-rw-rw-r-- 1 nick nick 4.5M Jul  9 14:02 Clostridium_ljungdahlii_DSM_13528.fna


Hopefully all 3 genomes downloaded (the files should be non-empty)

# Renaming genome sequences

* Let's make the genome sequences a bit simpler 

In [71]:
# current sequence names
!grep ">" genomes/*fna | perl -pe 's/.+:>/>/'

>NC_014328.1 Clostridium ljungdahlii DSM 13528, complete genome
>NZ_CP009166.1 Escherichia coli 1303, complete genome
>NC_016114.1 Streptomyces pratensis ATCC 33331, complete genome


In [73]:
%%bash 
source activate py27

# making sure each sequence is unique
find ./genomes/ -name "*fna" | \
    SIPSim genome_rename -n 3 --prefix genomes_rn - 

File written: /home/nick/t/SIPSim_wSeq/genomes_rn/Escherichia_coli_1303.fna
File written: /home/nick/t/SIPSim_wSeq/genomes_rn/Clostridium_ljungdahlii_DSM_13528.fna
File written: /home/nick/t/SIPSim_wSeq/genomes_rn/Streptomyces_pratensis_ATCC_33331.fna


In [74]:
# NEW sequence names
!grep ">" genomes_rn/*fna | perl -pe 's/.+:>/>/'

>NC_014328_1_Clostridium_ljungdahlii_DSM_13528
>NZ_CP009166_1_Escherichia_coli_1303
>NC_016114_1_Streptomyces_pratensis_ATCC_33331


# Indexing genomes

* One more step!
  * Creating genome indices is needed for the upcoming *in-silico* PCR step

In [75]:
# changing the working directory
workDirGenome = os.path.join(workDir, 'genomes_rn')
%cd $workDirGenome

/home/nick/t/SIPSim_wSeq/genomes_rn


In [76]:
# making index file (taxon_name<tab>taxon_genome_file_name)
indexFile = """Clostridium_ljungdahlii_DSM_13528 Clostridium_ljungdahlii_DSM_13528.fna
Escherichia_coli_1303 Escherichia_coli_1303.fna
Streptomyces_pratensis_ATCC_33331 Streptomyces_pratensis_ATCC_33331.fna""".replace(' ', '\t')

F = os.path.join(workDirGenome, 'genome_index.txt')
with open(F, 'wb') as oFH:
    oFH.write(indexFile)

print 'File written: {}'.format(F)

File written: /home/nick/t/SIPSim_wSeq/genomes_rn/genome_index.txt


**Note:** this next step will use 3 processors (`--np`). Change this option if needed.

In [77]:
%%bash
source activate py27
# indexing genomes; saving log 
SIPSim genome_index genome_index.txt --fp . --np 3 > index_log.txt

Indexing: "Clostridium_ljungdahlii_DSM_13528"
Indexing: "Escherichia_coli_1303"
Indexing: "Streptomyces_pratensis_ATCC_33331"
#-- All genomes indexed --#


In [78]:
# checking all of the files produced in the ./genome_rn/ directory
!ls -thlc

total 319M
-rw-rw-r-- 1 nick nick 4.4K Jul  9 14:04 index_log.txt
-rw-r--r-- 1 nick nick 135M Jul  9 14:04 Streptomyces_pratensis_ATCC_33331.fna.sqlite3.db
-rw-r--r-- 1 nick nick  83M Jul  9 14:04 Escherichia_coli_1303.fna.sqlite3.db
-rw-r--r-- 1 nick nick  82M Jul  9 14:04 Clostridium_ljungdahlii_DSM_13528.fna.sqlite3.db
-rw-rw-r-- 1 nick nick 1.8M Jul  9 14:03 Streptomyces_pratensis_ATCC_33331.fna.2bit
-rw-rw-r-- 1 nick nick 1.2M Jul  9 14:03 Escherichia_coli_1303.fna.2bit
-rw-rw-r-- 1 nick nick 1.2M Jul  9 14:03 Clostridium_ljungdahlii_DSM_13528.fna.2bit
-rw-rw-r-- 1 nick nick   91 Jul  9 14:03 Streptomyces_pratensis_ATCC_33331.fna.uni
-rw-rw-r-- 1 nick nick   81 Jul  9 14:03 Escherichia_coli_1303.fna.uni
-rw-rw-r-- 1 nick nick   91 Jul  9 14:03 Clostridium_ljungdahlii_DSM_13528.fna.uni
-rw-rw-r-- 1 nick nick  191 Jul  9 14:03 genome_index.txt
-rw-rw-r-- 1 nick nick 7.1M Jul  9 14:02 Streptomyces_pratensis_ATCC_33331.fna
-rw-rw-r-- 1 nick nick 4.5M Jul  9 14:02 Clostrid

# Next steps

Now its time to move on to a [simulation](./2_simulation-shotgun.ipynb)! 
We will simulate some shotgun genome sequences. 