### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [19]:
import ipyrad as ip
import ipyparallel as ipp
## Print the version
print ip.__version__

0.7.23


### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [37]:
%%bash
## The curl command needs a capital O, not a zero
curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
tar -xvzf ipsimdata.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   147  100   147    0     0    583      0 --:--:-- --:--:-- --:--:--   583
  8 11.8M    8 1072k    0     0  1431k      0  0:00:08 --:--:--  0:00:08 1431k 94 11.8M   94 11.1M    0     0  6550k      0  0:00:01  0:00:01 --:--:-- 10.1M100 11.8M  100 11.8M    0     0  6708k      0  0:00:01  0:00:01 --:--:-- 10.1M
x ./ipsimdata/
x ./ipsimdata/pairgbs_example_R2_.fastq.gz
x ./ipsimdata/pairgbs_wmerge_example_barcodes.txt
x ./ipsimdata/rad_example_genome.fa
x ./ipsimdata/pairddrad_example_genome.fa
x ./ipsimdata/pairgbs_example_R1_.fastq.gz
x ./ipsimdata/pairgbs_wmerge_example_R2_.fastq.gz
x ./ipsimdata/rad_example_genome.fa.fai
x ./ipsimdata/pairddrad_example_R2_.fastq.gz
x ./ipsimdata/pairddrad_example_genome.fa.sma
x ./ipsimdata/pairddrad_example_gen

In [2]:
ls ipsimdata/

gbs_example_R1_.fastq.gz               pairgbs_example_R1_.fastq.gz
gbs_example_barcodes.txt               pairgbs_example_R2_.fastq.gz
gbs_example_genome.fa                  pairgbs_example_barcodes.txt
pairddrad_example_R1_.fastq.gz         pairgbs_wmerge_example_R1_.fastq.gz
pairddrad_example_R2_.fastq.gz         pairgbs_wmerge_example_R2_.fastq.gz
pairddrad_example_barcodes.txt         pairgbs_wmerge_example_barcodes.txt
pairddrad_example_genome.fa            pairgbs_wmerge_example_genome.fa
pairddrad_example_genome.fa.fai        rad_example_R1_.fastq.gz
pairddrad_example_genome.fa.sma        rad_example_barcodes.txt
pairddrad_example_genome.fa.smi        rad_example_genome.fa
pairddrad_wmerge_example_R1_.fastq.gz  rad_example_genome.fa.fai
pairddrad_wmerge_example_R2_.fastq.gz  rad_example_genome.fa.sma
pairddrad_wmerge_example_barcodes.txt  rad_example_genome.fa.smi
pairddrad_wmerge_example_genome.fa


### Connect to an ipcluster instance

In [50]:
# If on your own machine - open new terminal : 
# ipcluster start --n=4

### Assembly the dataset from step 1 to step 7

Create an Assembly object

In [4]:
ipsimdata = ip.Assembly("ipsimdata")

New Assembly: ipsimdata


Setting/modifying parameters for this Assembly object

In [5]:
ipsimdata.set_params('project_dir', "pedicularis")
ipsimdata.set_params('filter_adapters', 2)
ipsimdata.set_params('datatype', 'rad')
ipsimdata.set_params('barcodes_path', "./ipsimdata/rad_example_barcodes.txt")
ipsimdata.set_params('raw_fastq_path', "./ipsimdata/rad_example_R1_.fastq.gz")

# I am using the raw fastq reads so I didn't fill in the directory for the demultiplexed reads which would fill in 'sorted_fastq_path'

# Print the parameters to the screen
ipsimdata.get_params()

0   assembly_name               ipsimdata                                    
1   project_dir                 ./pedicularis                                
2   raw_fastq_path              ./ipsimdata/rad_example_R1_.fastq.gz         
3   barcodes_path               ./ipsimdata/rad_example_barcodes.txt         
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                               

In [6]:
# Can check Parameter Settings

## Ie. parameter cannot be 2.0
ipsimdata.set_params("clust_threshold", 2.0)

IPyradError:     Error setting parameter 'clust_threshold'
    clust_threshold must be a decimal value between 0 and 1.
    You entered: 2.0
    

In [7]:
# Can exploring attributes
print ipsimdata.dirs

fastqs : 
edits : 
clusts : 
consens : 
outfiles : 



### Run Step 1 to create Samples objects:

If you rerun a step - this will be skipped and data will not be overwritten unless using force = True.

In [8]:
# This sorted the reads because the data file was not yet demultiplexed
ipsimdata.run("1")

Assembly: ipsimdata
[####################] 100%  sorting reads         | 0:00:05 | s1 | 
[####################] 100%  writing/compressing   | 0:00:02 | s1 | 


In [9]:
# keys are Sample names and the values of the dictionary are the Sample objects
ipsimdata.samples

{'1A_0': <ipyrad.core.sample.Sample at 0x10b821810>,
 '1B_0': <ipyrad.core.sample.Sample at 0x10b805290>,
 '1C_0': <ipyrad.core.sample.Sample at 0x10b821450>,
 '1D_0': <ipyrad.core.sample.Sample at 0x10b80e510>,
 '2E_0': <ipyrad.core.sample.Sample at 0x10b821050>,
 '2F_0': <ipyrad.core.sample.Sample at 0x10b821fd0>,
 '2G_0': <ipyrad.core.sample.Sample at 0x10b821bd0>,
 '2H_0': <ipyrad.core.sample.Sample at 0x10b7adb50>,
 '3I_0': <ipyrad.core.sample.Sample at 0x10b80e110>,
 '3J_0': <ipyrad.core.sample.Sample at 0x10b80ec50>,
 '3K_0': <ipyrad.core.sample.Sample at 0x10b805b10>,
 '3L_0': <ipyrad.core.sample.Sample at 0x10b821f90>}

In [10]:
print ipsimdata.stats

      state  reads_raw
1A_0      1      19862
1B_0      1      20043
1C_0      1      20136
1D_0      1      19966
2E_0      1      20017
2F_0      1      19933
2G_0      1      20030
2H_0      1      20199
3I_0      1      19885
3J_0      1      19822
3K_0      1      19965
3L_0      1      20008


### Run Step 2: Filter reads

In [11]:
ipsimdata.run("2")

Assembly: ipsimdata
[####################] 100%  processing reads      | 0:00:07 | s2 | 


In [12]:
# show results from step 2
print ipsimdata.stats_dfs.s2

      reads_raw  trim_adapter_bp_read1  trim_quality_bp_read1  \
1A_0      19862                    360                      0   
1B_0      20043                    362                      0   
1C_0      20136                    349                      0   
1D_0      19966                    404                      0   
2E_0      20017                    394                      0   
2F_0      19933                    376                      0   
2G_0      20030                    381                      0   
2H_0      20199                    386                      0   
3I_0      19885                    372                      0   
3J_0      19822                    381                      0   
3K_0      19965                    382                      0   
3L_0      20008                    424                      0   

      reads_filtered_by_Ns  reads_filtered_by_minlen  reads_passed_filter  
1A_0                     0                         0                19862  
1B

### Run Steps 3-6
* Step 3: Clustering Within Reads 

* Step 4: Joint estimation of heterozygosity and error rate

* Step 5: Consensus base calls

* Step 6: Cluster across samples

In [13]:
ipsimdata.run("3456")

Assembly: ipsimdata
[####################] 100%  dereplicating         | 0:00:00 | s3 | 
[####################] 100%  clustering            | 0:00:01 | s3 | 
[####################] 100%  building clusters     | 0:00:00 | s3 | 
[####################] 100%  chunking              | 0:00:00 | s3 | 
[####################] 100%  aligning              | 0:00:34 | s3 | 
[####################] 100%  concatenating         | 0:00:00 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:04 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:00:20 | s5 | 
[####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
[####################] 100%  clustering across     | 0:00:04 | s6 | 
[####################] 100%  building clusters     | 0:00:00 | s6 | 
[####################] 100%  aligning clusters     | 0:00:07 | s6 | 
[#############

In [14]:
print ipsimdata.stats

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      6      19862                19862            1000              1000   
1B_0      6      20043                20043            1000              1000   
1C_0      6      20136                20136            1000              1000   
1D_0      6      19966                19966            1000              1000   
2E_0      6      20017                20017            1000              1000   
2F_0      6      19933                19933            1000              1000   
2G_0      6      20030                20030            1000              1000   
2H_0      6      20199                20198            1000              1000   
3I_0      6      19885                19885            1000              1000   
3J_0      6      19822                19822            1000              1000   
3K_0      6      19965                19965            1000              1000   
3L_0      6      20008      

### Step 7: Filter and write output files

In [15]:
ipsimdata.run("7") 

Assembly: ipsimdata
[####################] 100%  filtering loci        | 0:00:10 | s7 | 
[####################] 100%  building loci/stats   | 0:00:00 | s7 | 
[####################] 100%  building vcf file     | 0:00:05 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:00 | s7 | 
[####################] 100%  writing outfiles      | 0:00:00 | s7 | 
Outfiles written to: ~/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata_outfiles



### Save assembly object 

In [16]:
##also auto-saves after every run() command
ipsimdata.save()

In [21]:
## load assembly object
ipsimdata = ip.load_json("pedicularis/ipsimdata.json")

loading Assembly: ipsimdata
from saved path: ~/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata.json


### Print the final assembly stats

In [17]:
print ipsimdata.stats

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      6      19862                19862            1000              1000   
1B_0      6      20043                20043            1000              1000   
1C_0      6      20136                20136            1000              1000   
1D_0      6      19966                19966            1000              1000   
2E_0      6      20017                20017            1000              1000   
2F_0      6      19933                19933            1000              1000   
2G_0      6      20030                20030            1000              1000   
2H_0      6      20199                20198            1000              1000   
3I_0      6      19885                19885            1000              1000   
3J_0      6      19822                19822            1000              1000   
3K_0      6      19965                19965            1000              1000   
3L_0      6      20008      

### Show the location of your assembled output files

In [22]:
print ipsimdata.outfiles

alleles : /Users/meredithvanacker/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata_outfiles/ipsimdata.alleles.loci
loci : /Users/meredithvanacker/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata_outfiles/ipsimdata.loci
phy : /Users/meredithvanacker/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata_outfiles/ipsimdata.phy
snpsmap : /Users/meredithvanacker/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata_outfiles/ipsimdata.snps.map
snpsphy : /Users/meredithvanacker/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata_outfiles/ipsimdata.snps.phy
vcf : /Users/meredithvanacker/PDSB/hw12/12-parallel-genomics/notebooks/pedicularis/ipsimdata_outfiles/ipsimdata.vcf

