### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [1]:
import ipyrad as ip
import ipyparallel as ipp

  from ._conv import register_converters as _register_converters


### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [2]:
%%bash
## The curl command needs a capital O, not a zero
curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
tar -xvzf ipsimdata.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   147  100   147    0     0   1065      0 --:--:-- --:--:-- --:--:--  1072
 22 11.8M   22 2672k    0     0  4380k      0  0:00:02 --:--:--  0:00:02 4380k100 11.8M  100 11.8M    0     0   9.8M      0  0:00:01  0:00:01 --:--:-- 15.7M
x ./ipsimdata/
x ./ipsimdata/pairgbs_example_R2_.fastq.gz
x ./ipsimdata/pairgbs_wmerge_example_barcodes.txt
x ./ipsimdata/rad_example_genome.fa
x ./ipsimdata/pairddrad_example_genome.fa
x ./ipsimdata/pairgbs_example_R1_.fastq.gz
x ./ipsimdata/pairgbs_wmerge_example_R2_.fastq.gz
x ./ipsimdata/rad_example_genome.fa.fai
x ./ipsimdata/pairddrad_example_R2_.fastq.gz
x ./ipsimdata/pairddrad_example_genome.fa.sma
x ./ipsimdata/pairddrad_example_genome.fa.fai
x ./ipsimdata/pairgbs_wmerge_example_genome.fa
x ./ipsimdata/pairddr

In [3]:
ls ipsimdata/

gbs_example_R1_.fastq.gz               pairgbs_example_R1_.fastq.gz
gbs_example_barcodes.txt               pairgbs_example_R2_.fastq.gz
gbs_example_genome.fa                  pairgbs_example_barcodes.txt
pairddrad_example_R1_.fastq.gz         pairgbs_wmerge_example_R1_.fastq.gz
pairddrad_example_R2_.fastq.gz         pairgbs_wmerge_example_R2_.fastq.gz
pairddrad_example_barcodes.txt         pairgbs_wmerge_example_barcodes.txt
pairddrad_example_genome.fa            pairgbs_wmerge_example_genome.fa
pairddrad_example_genome.fa.fai        rad_example_R1_.fastq.gz
pairddrad_example_genome.fa.sma        rad_example_barcodes.txt
pairddrad_example_genome.fa.smi        rad_example_genome.fa
pairddrad_wmerge_example_R1_.fastq.gz  rad_example_genome.fa.fai
pairddrad_wmerge_example_R2_.fastq.gz  rad_example_genome.fa.sma
pairddrad_wmerge_example_barcodes.txt  rad_example_genome.fa.smi
pairddrad_wmerge_example_genome.fa


### Connect to an ipcluster instance

In [24]:
# Run this in terminal: ipcluster start --n=4

In [4]:
ipyclient = ipp.Client()

In [5]:
ipyclient.ids

[0, 1, 2, 3]

### Assembly the dataset from step 1 to step 7

In [6]:
gbs = ip.Assembly("gbs")

New Assembly: gbs


In [7]:
gbs.set_params("project_dir", "pedicularis")
gbs.set_params("raw_fastq_path", "./ipsimdata/gbs_example_R1_.fastq.gz")
gbs.set_params("barcodes_path", "./ipsimdata/gbs_example_barcodes.txt")
gbs.set_params('filter_adapters', 2)
gbs.set_params('datatype', 'gbs')

In [8]:
gbs.get_params()

0   assembly_name               gbs                                          
1   project_dir                 ./pedicularis                                
2   raw_fastq_path              ./ipsimdata/gbs_example_R1_.fastq.gz         
3   barcodes_path               ./ipsimdata/gbs_example_barcodes.txt         
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    gbs                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                               

In [9]:
print gbs.dirs

fastqs : 
edits : 
clusts : 
consens : 
outfiles : 



In [10]:
gbs.run("1")

Assembly: gbs
[####################] 100%  sorting reads         | 0:00:04 | s1 | 
[####################] 100%  writing/compressing   | 0:00:00 | s1 | 


In [11]:
print gbs.stats 

      state  reads_raw
1A_0      1      19862
1B_0      1      20043
1C_0      1      20136
1D_0      1      19966
2E_0      1      20017
2F_0      1      19933
2G_0      1      20030
2H_0      1      20199
3I_0      1      19885
3J_0      1      19822
3K_0      1      19965
3L_0      1      20008


In [13]:
gbs.run("234567", show_cluster=True, force=True)

host compute node: [4 cores] on dyn-160-39-171-6.dyn.columbia.edu
Assembly: gbs
[####################] 100%  processing reads      | 0:00:11 | s2 | 
[####################] 100%  dereplicating         | 0:00:00 | s3 | 
[####################] 100%  clustering            | 0:00:01 | s3 | 
[####################] 100%  building clusters     | 0:00:00 | s3 | 
[####################] 100%  chunking              | 0:00:00 | s3 | 
[####################] 100%  aligning              | 0:00:43 | s3 | 
[####################] 100%  concatenating         | 0:00:00 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:04 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:00:27 | s5 | 
[####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
[####################] 100%  clustering across     | 0:00:03 | s6 | 
[####################] 

In [14]:
# Similarly, do the same process to rad

rad = ip.Assembly("rad")

rad.set_params("project_dir", "pedicularis")
rad.set_params("raw_fastq_path", "./ipsimdata/rad_example_R1_.fastq.gz")
rad.set_params("barcodes_path", "./ipsimdata/rad_example_barcodes.txt")
rad.set_params('filter_adapters', 2)
rad.set_params('datatype', 'rad')

rad.get_params()

rad.run("1234567", show_cluster=True)

New Assembly: rad
0   assembly_name               rad                                          
1   project_dir                 ./pedicularis                                
2   raw_fastq_path              ./ipsimdata/rad_example_R1_.fastq.gz         
3   barcodes_path               ./ipsimdata/rad_example_barcodes.txt         
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6             

### Print the final assembly stats

In [15]:
print gbs.stats

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      6      19862                19862            1000              1000   
1B_0      6      20043                20043            1000              1000   
1C_0      6      20136                20136            1000              1000   
1D_0      6      19966                19966            1000              1000   
2E_0      6      20017                20017            1000              1000   
2F_0      6      19933                19933            1000              1000   
2G_0      6      20030                20030            1000              1000   
2H_0      6      20199                20198            1000              1000   
3I_0      6      19885                19885            1000              1000   
3J_0      6      19822                19822            1000              1000   
3K_0      6      19965                19965            1000              1000   
3L_0      6      20008      

In [16]:
print rad.stats

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      6      19862                19862            1000              1000   
1B_0      6      20043                20043            1000              1000   
1C_0      6      20136                20136            1000              1000   
1D_0      6      19966                19966            1000              1000   
2E_0      6      20017                20017            1000              1000   
2F_0      6      19933                19933            1000              1000   
2G_0      6      20030                20030            1000              1000   
2H_0      6      20199                20198            1000              1000   
3I_0      6      19885                19885            1000              1000   
3J_0      6      19822                19822            1000              1000   
3K_0      6      19965                19965            1000              1000   
3L_0      6      20008      

### Show the location of your assembled output files

In [17]:
ls ~/PDSB/12-parallel-genomics/notebooks/pedicularis

gbs.json        [1m[34mgbs_consens[m[m/    [1m[34mgbs_outfiles[m[m/   [1m[34mrad_clust_0.85[m[m/ [1m[34mrad_fastqs[m[m/
[1m[34mgbs_across[m[m/     [1m[34mgbs_edits[m[m/      rad.json        [1m[34mrad_consens[m[m/    [1m[34mrad_outfiles[m[m/
[1m[34mgbs_clust_0.85[m[m/ [1m[34mgbs_fastqs[m[m/     [1m[34mrad_across[m[m/     [1m[34mrad_edits[m[m/
