### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [5]:
import ipyrad as ip
import ipyparallel as ipp

### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [13]:
%%bash
## The curl command needs a capital O, not a zero
curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
tar -xvzf ipsimdata.tar.gz

./ipsimdata/
./ipsimdata/pairgbs_example_R2_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_barcodes.txt
./ipsimdata/rad_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa
./ipsimdata/pairgbs_example_R1_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_R2_.fastq.gz
./ipsimdata/rad_example_genome.fa.fai
./ipsimdata/pairddrad_example_R2_.fastq.gz
./ipsimdata/pairddrad_example_genome.fa.sma
./ipsimdata/pairddrad_example_genome.fa.fai
./ipsimdata/pairgbs_wmerge_example_genome.fa
./ipsimdata/pairddrad_wmerge_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa.smi
./ipsimdata/pairgbs_wmerge_example_R1_.fastq.gz
./ipsimdata/rad_example_genome.fa.smi
./ipsimdata/gbs_example_barcodes.txt
./ipsimdata/pairgbs_example_barcodes.txt
./ipsimdata/pairddrad_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_barcodes.txt
./ipsimdata/rad_example_barcodes.txt
./ipsimdata/pairddrad_wmerge_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_R2_.fastq.gz
./ipsimdata/gbs_example_R1_.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   147  100   147    0     0    835      0 --:--:-- --:--:-- --:--:--   830
 96 11.8M   96 11.5M    0     0  10.4M      0  0:00:01  0:00:01 --:--:-- 10.4M100 11.8M  100 11.8M    0     0  10.7M      0  0:00:01  0:00:01 --:--:-- 32.4M


In [14]:
ls ipsimdata/

gbs_example_barcodes.txt               pairgbs_example_barcodes.txt
gbs_example_genome.fa                  [0m[01;31mpairgbs_example_R1_.fastq.gz[0m
[01;31mgbs_example_R1_.fastq.gz[0m               [01;31mpairgbs_example_R2_.fastq.gz[0m
pairddrad_example_barcodes.txt         pairgbs_wmerge_example_barcodes.txt
pairddrad_example_genome.fa            pairgbs_wmerge_example_genome.fa
pairddrad_example_genome.fa.fai        [01;31mpairgbs_wmerge_example_R1_.fastq.gz[0m
pairddrad_example_genome.fa.sma        [01;31mpairgbs_wmerge_example_R2_.fastq.gz[0m
pairddrad_example_genome.fa.smi        rad_example_barcodes.txt
[01;31mpairddrad_example_R1_.fastq.gz[0m         rad_example_genome.fa
[01;31mpairddrad_example_R2_.fastq.gz[0m         rad_example_genome.fa.fai
pairddrad_wmerge_example_barcodes.txt  rad_example_genome.fa.sma
pairddrad_wmerge_example_genome.fa     rad_example_genome.fa.smi
[01;31mpairddrad_wmerge_example_R1_.fastq.gz[0m  [01;31mrad_example_R1_.fast

### Connect to an ipcluster instance

In [16]:
ipyclient = ipp.Client()
ipyclient.ids

[0, 1, 2, 3]

In [29]:
data_pairddrad = ip.Assembly("pairddrad")

data_pairddrad.set_params('project_dir', 'ipsimdata/pairddrad/')
data_pairddrad.set_params('barcodes_path', 'ipsimdata/pairddrad/pairddrad_example_barcodes.txt')
data_pairddrad.set_params('raw_fastq_path','ipsimdata/pairddrad/*.gz')
data_pairddrad.set_params('datatype', 'pairddrad')
data_pairddrad.set_params('filter_adapters', 2)
data_pairddrad.get_params() # prints the parameters to the screen

New Assembly: pairddrad
0   assembly_name               pairddrad                                    
1   project_dir                 ./ipsimdata/pairddrad                        
2   raw_fastq_path              ./ipsimdata/pairddrad/*.gz                   
3   barcodes_path               ./ipsimdata/pairddrad/pairddrad_example_barcodes.txt
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    pairddrad                                    
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6

### Assembly the dataset from step 1 to step 7

In [30]:
data_pairddrad.run("1234567", ipyclient=ipyclient)

Assembly: pairddrad
[####################] 100%  sorting reads         | 0:00:03 | s1 | 
[####################] 100%  writing/compressing   | 0:00:00 | s1 | 
[####################] 100%  processing reads      | 0:00:05 | s2 | 
[####################] 100%  dereplicating         | 0:00:01 | s3 | 
[####################] 100%  clustering            | 0:00:02 | s3 | 
[####################] 100%  building clusters     | 0:00:00 | s3 | 
[####################] 100%  chunking              | 0:00:00 | s3 | 
[####################] 100%  aligning              | 0:00:24 | s3 | 
[####################] 100%  concatenating         | 0:00:00 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:03 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:00:16 | s5 | 
[####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
[#############

### Print the final assembly stats

In [32]:
print data_pairddrad.stats 

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      6      19835                19835            1000              1000   
1B_0      6      20071                20071            1000              1000   
1C_0      6      19969                19969            1000              1000   
1D_0      6      20082                20082            1000              1000   
2E_0      6      20004                20004            1000              1000   
2F_0      6      19899                19899            1000              1000   
2G_0      6      19928                19928            1001              1000   
2H_0      6      20110                20110            1000              1000   
3I_0      6      20078                20078            1000              1000   
3J_0      6      19965                19965            1000              1000   
3K_0      6      19846                19846            1000              1000   
3L_0      6      20025      

### Show the location of your assembled output files

In [33]:
ls ipsimdata/pairddrad

[0m[01;34mpairddrad_across[0m/                pairddrad_example_genome.fa.sma
[01;34mpairddrad_clust_0.85[0m/            pairddrad_example_genome.fa.smi
[01;34mpairddrad_consens[0m/               [01;31mpairddrad_example_R1_.fastq.gz[0m
[01;34mpairddrad_edits[0m/                 [01;31mpairddrad_example_R2_.fastq.gz[0m
pairddrad_example_barcodes.txt   [01;34mpairddrad_fastqs[0m/
pairddrad_example_genome.fa      pairddrad.json
pairddrad_example_genome.fa.fai  [01;34mpairddrad_outfiles[0m/
