# Description:

* The first thing to do is download a microbial genome dataset.
* Here, we will be downloading 3 bacterial genomes with differing G+C contents:
  * Clostridium ljungdahlii DSM13528 (G+C = 31.1)
  * Escherichia coli 1303 (G+C = 50.7)
  * Streptomyces pratensis ATCC33331 (G+C = 71.1)

# Setting variables

* "workDir" is the path to the working directory for this analysis (where the files will be download to) 
* **NOTE:** MAKE SURE to modify this path to the directory where YOU want to run the example. 

In [1]:
workDir = '/home/nick/t/SIPSim/'

# Initializing

* Loading packages & libraries
* Make sure you have all of the dependencies!

In [2]:
import os

In [3]:
%load_ext rpy2.ipython

In [4]:
# making directories
## working directory
if not os.path.isdir(workDir):
    os.makedirs(workDir)
%cd $workDir

## genome directory
workDirGenome = os.path.join(workDir, 'genomes')
if not os.path.isdir(workDirGenome):
    os.mkdir(workDirGenome)  
print(workDirGenome)

/home/nick/t/SIPSim
/home/nick/t/SIPSim/genomes


**Let's check that SIPSim is installed**

In [5]:
!SIPSim -h

SIPSim: simulate gradient fractionation of microbial community DNA

Usage:
  SIPSim <command> [<args>...]
  SIPSim -l | --list
  SIPSim -h | --help
  SIPSim --version

Options:
  -l --list     List subcommands.
  -h --help     Show this screen.
  --version     Show version.

Commands:
  Use the `list` option.
Description:
  Simulate how taxa would be distributed in isopycnic gradients as assessed by
  high throughput sequencing.


# Downloading genomes

* Downloading the genome sequences from NCBI based on their accession numbers.
* Here, I'm using a helper script `seqDB_tools` for downloading the genomes, but there are other ways.

In [6]:
taxa="""Clostridium_ljungdahlii_DSM_13528	NC_014328.1
Escherichia_coli_1303	NZ_CP009166.1
Streptomyces_pratensis_ATCC_33331	NC_016114.1
"""

F = os.path.join(workDir, 'genome_list.txt')
with open(F, 'wb') as oFH:
    oFH.write(taxa)
    
print 'File written: {}'.format(F)

File written: /home/nick/t/SIPSim/genome_list.txt


**Note:** For the next step, `seqDB_tools` should be installed. See [here](https://github.com/nyoungb2/seqDB_tools) for installation

In [7]:
# Don't worry about 'WARNING', this should happen
!seqDB_tools accession-GI2fasta -n 1 -a 2 -o genomes < genome_list.txt

Starting batch->trial: 0->1

MSG: No whitespace allowed in FASTA ID [NC_014328|Clostridium ljungdahlii DSM 13528, complete genome.]
---------------------------------------------------

MSG: No whitespace allowed in FASTA ID [NC_014328|Clostridium ljungdahlii DSM 13528, complete genome.]
---------------------------------------------------

MSG: No whitespace allowed in FASTA ID [NC_016114|Streptomyces pratensis ATCC 33331, complete genome.]
---------------------------------------------------

MSG: No whitespace allowed in FASTA ID [NC_016114|Streptomyces pratensis ATCC 33331, complete genome.]
---------------------------------------------------

MSG: No whitespace allowed in FASTA ID [NZ_CP009166|Escherichia coli 1303, complete genome.]
---------------------------------------------------

MSG: No whitespace allowed in FASTA ID [NZ_CP009166|Escherichia coli 1303, complete genome.]
---------------------------------------------------


In [8]:
!ls -thlc genomes/

total 17M
-rw-rw-r-- 1 nick nick 4.8M Jun 28 13:17 Escherichia_coli_1303.fna
-rw-rw-r-- 1 nick nick 7.2M Jun 28 13:16 Streptomyces_pratensis_ATCC_33331.fna
-rw-rw-r-- 1 nick nick 4.5M Jun 28 13:15 Clostridium_ljungdahlii_DSM_13528.fna


Hopefully all 3 genomes downloaded (the files should be non-empty)

# Renaming genome sequences

* Let's make the genome sequences a bit simpler 

In [9]:
# current sequence names
!grep ">" genomes/*fna | perl -pe 's/.+:>/>/'

>NC_014328|Clostridium ljungdahlii DSM 13528, complete genome. Clostridium ljungdahlii DSM 13528, complete genome.
>NZ_CP009166|Escherichia coli 1303, complete genome. Escherichia coli 1303, complete genome.
>NC_016114|Streptomyces pratensis ATCC 33331, complete genome. Streptomyces pratensis ATCC 33331, complete genome.


In [12]:
# making sure each sequence is unique
!find ./genomes/ -name "*fna" | \
    SIPSim genome_rename -n 3 --prefix genomes_rn - 

File written: /home/nick/t/SIPSim/genomes_rn/Clostridium_ljungdahlii_DSM_13528.fna
File written: /home/nick/t/SIPSim/genomes_rn/Escherichia_coli_1303.fna
File written: /home/nick/t/SIPSim/genomes_rn/Streptomyces_pratensis_ATCC_33331.fna


In [13]:
# NEW sequence names
!grep ">" genomes_rn/*fna | perl -pe 's/.+:>/>/'

>Clostridium_ljungdahlii_DSM_13528
>Escherichia_coli_1303
>Streptomyces_pratensis_ATCC_33331


# Indexing genomes

* One more step!
  * Creating genome indices is needed for the upcoming *in-silico* PCR step

In [14]:
# changing the working directory
workDirGenome = os.path.join(workDir, 'genomes_rn')
%cd $workDirGenome

/home/nick/t/SIPSim/genomes_rn


In [15]:
# making index file (taxon_name<tab>taxon_genome_file_name)
indexFile = """Clostridium_ljungdahlii_DSM_13528 Clostridium_ljungdahlii_DSM_13528.fna
Escherichia_coli_1303 Escherichia_coli_1303.fna
Streptomyces_pratensis_ATCC_33331 Streptomyces_pratensis_ATCC_33331.fna""".replace(' ', '\t')

F = os.path.join(workDirGenome, 'genome_index.txt')
with open(F, 'wb') as oFH:
    oFH.write(indexFile)

print 'File written: {}'.format(F)

File written: /home/nick/t/SIPSim/genomes_rn/genome_index.txt


**Note:** this next step will use 3 processors (`--np`). Change this option if needed.

In [16]:
!SIPSim genome_index genome_index.txt --fp . --np 3 > index_log.txt

Indexing: "Clostridium_ljungdahlii_DSM_13528"
Indexing: "Escherichia_coli_1303"
Indexing: "Streptomyces_pratensis_ATCC_33331"
#-- All genomes indexed --#


In [17]:
# checking all of the files produced in the ./genome_rn/ directory
!ls -thlc

total 333M
-rw-rw-r-- 1 nick nick 4.4K Jun 28 13:18 index_log.txt
-rw-r--r-- 1 nick nick 130M Jun 28 13:18 Streptomyces_pratensis_ATCC_33331.fna.sqlite3.db
-rw-r--r-- 1 nick nick  97M Jun 28 13:18 Escherichia_coli_1303.fna.sqlite3.db
-rw-r--r-- 1 nick nick  87M Jun 28 13:18 Clostridium_ljungdahlii_DSM_13528.fna.sqlite3.db
-rw-rw-r-- 1 nick nick 1.8M Jun 28 13:17 Streptomyces_pratensis_ATCC_33331.fna.2bit
-rw-rw-r-- 1 nick nick 1.2M Jun 28 13:17 Escherichia_coli_1303.fna.2bit
-rw-rw-r-- 1 nick nick   79 Jun 28 13:17 Streptomyces_pratensis_ATCC_33331.fna.uni
-rw-rw-r-- 1 nick nick 1.2M Jun 28 13:17 Clostridium_ljungdahlii_DSM_13528.fna.2bit
-rw-rw-r-- 1 nick nick   67 Jun 28 13:17 Escherichia_coli_1303.fna.uni
-rw-rw-r-- 1 nick nick   79 Jun 28 13:17 Clostridium_ljungdahlii_DSM_13528.fna.uni
-rw-rw-r-- 1 nick nick  191 Jun 28 13:17 genome_index.txt
-rw-rw-r-- 1 nick nick 7.2M Jun 28 13:17 Streptomyces_pratensis_ATCC_33331.fna
-rw-rw-r-- 1 nick nick 4.8M Jun 28 13:17 Escheric

# Next steps

Now its time to move on to a [simulation](./2_simulation-simple.ipynb)!