### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [1]:
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp
import toyplot
import toytree

  from ._conv import register_converters as _register_converters


### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [5]:
%%bash
## The curl command needs a capital O, not a zero
curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
tar -xvzf ipsimdata.tar.gz

./ipsimdata/
./ipsimdata/pairgbs_example_R2_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_barcodes.txt
./ipsimdata/rad_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa
./ipsimdata/pairgbs_example_R1_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_R2_.fastq.gz
./ipsimdata/rad_example_genome.fa.fai
./ipsimdata/pairddrad_example_R2_.fastq.gz
./ipsimdata/pairddrad_example_genome.fa.sma
./ipsimdata/pairddrad_example_genome.fa.fai
./ipsimdata/pairgbs_wmerge_example_genome.fa
./ipsimdata/pairddrad_wmerge_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa.smi
./ipsimdata/pairgbs_wmerge_example_R1_.fastq.gz
./ipsimdata/rad_example_genome.fa.smi
./ipsimdata/gbs_example_barcodes.txt
./ipsimdata/pairgbs_example_barcodes.txt
./ipsimdata/pairddrad_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_barcodes.txt
./ipsimdata/rad_example_barcodes.txt
./ipsimdata/pairddrad_wmerge_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_R2_.fastq.gz
./ipsimdata/gbs_example_R1_.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   147  100   147    0     0   3340      0 --:--:-- --:--:-- --:--:--  3266
100 11.8M  100 11.8M    0     0  52.9M      0 --:--:-- --:--:-- --:--:-- 52.9M


In [6]:
ls ipsimdata/

gbs_example_barcodes.txt               pairgbs_example_barcodes.txt
gbs_example_genome.fa                  [0m[38;5;9mpairgbs_example_R1_.fastq.gz[0m
[38;5;9mgbs_example_R1_.fastq.gz[0m               [38;5;9mpairgbs_example_R2_.fastq.gz[0m
pairddrad_example_barcodes.txt         pairgbs_wmerge_example_barcodes.txt
pairddrad_example_genome.fa            pairgbs_wmerge_example_genome.fa
pairddrad_example_genome.fa.fai        [38;5;9mpairgbs_wmerge_example_R1_.fastq.gz[0m
pairddrad_example_genome.fa.sma        [38;5;9mpairgbs_wmerge_example_R2_.fastq.gz[0m
pairddrad_example_genome.fa.smi        rad_example_barcodes.txt
[38;5;9mpairddrad_example_R1_.fastq.gz[0m         rad_example_genome.fa
[38;5;9mpairddrad_example_R2_.fastq.gz[0m         rad_example_genome.fa.fai
pairddrad_wmerge_example_barcodes.txt  rad_example_genome.fa.sma
pairddrad_wmerge_example_genome.fa     rad_example_genome.fa.smi
[38;5;9mpairddrad_wmerge_example_R1_.fastq.gz[0m  [38;5;9mrad_example

### Connect to an ipcluster instance

In [None]:
# after tunneling into jupyter-edu-node.sbatch run this in a terminal:
# ipcluster start

In [40]:
ipyclient = ipp.Client()

In [41]:
# use ipyrad to print cluster info
ip.cluster_info(ipyclient)

host compute node: [24 cores] on node302


### Assembly the dataset from step 1 to step 7

In [74]:
data = ip.Assembly("ipsim3")

New Assembly: ipsim3


In [75]:
data.set_params("project_dir", "/rigel/edu/w4050/users/jmz2134/12-parallel-genomics/notebooks/ipsimdata")
data.set_params("sorted_fastq_path", "/rigel/edu/w4050/users/jmz2134/12-parallel-genomics/notebooks/ipsimdata/rad*.gz")
data.set_params("clust_threshold", 0.90)
data.set_params("mindepth_majrule", 10)
data.set_params("mindepth_statistical", 10)
data.set_params("filter_adapters", 2)
data.set_params("output_formats", "*")
data.set_params("datatype", "rad")

In [76]:
# print parameter settings for posterity
data.get_params()

0   assembly_name               ipsim3                                       
1   project_dir                 ./ipsimdata                                  
2   raw_fastq_path                                                           
3   barcodes_path                                                            
4   sorted_fastq_path           ./ipsimdata/rad*.gz                          
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        10                                           
12  mindepth_majrule            10                              

In [77]:
data.run("12", ipyclient=ipyclient, show_cluster=True)

host compute node: [24 cores] on node302
Assembly: ipsim3
[####################] 100%  loading reads         | 0:00:00 | s1 | 
[####################] 100%  processing reads      | 0:00:05 | s2 | 


In [78]:
data.run("3", ipyclient=ipyclient, show_cluster=True)

host compute node: [24 cores] on node302
Assembly: ipsim3
[####################] 100%  dereplicating         | 0:00:00 | s3 | 
[####################] 100%  clustering            | 0:00:01 | s3 | 
[####################] 100%  building clusters     | 0:00:00 | s3 | 
[####################] 100%  chunking              | 0:00:00 | s3 | 
[####################] 100%  aligning              | 0:00:11 | s3 | 
[####################] 100%  concatenating         | 0:00:00 | s3 | 


In [79]:
data.run("456", ipyclient=ipyclient, show_cluster=True)

host compute node: [24 cores] on node302
Assembly: ipsim3
[####################] 100%  inferring [H, E]      | 0:00:06 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:00:01 | s5 | 
[####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
[####################] 100%  clustering across     | 0:00:00 | s6 | 
[####################] 100%  building clusters     | 0:00:00 | s6 | 
[####################] 100%  aligning clusters     | 0:00:00 | s6 | 
[####################] 100%  database indels       | 0:00:00 | s6 | 
[####################] 100%  indexing clusters     | 0:00:01 | s6 | 
[####################] 100%  building database     | 0:00:00 | s6 | 


In [80]:
min4 = data.branch("min4")
min10 = data.branch("min10")

In [81]:
min4.set_params("min_samples_locus", 4)
min10.set_params("min_samples_locus", 10)

In [86]:
min4.run("7")

Assembly: min4
[####################] 100%  filtering loci        | 0:00:03 | s7 | 
[####################] 100%  building loci/stats   | 0:00:00 | s7 | 

  Encountered an unexpected error (see ./ipyrad_log.txt)
  Error message is below -------------------------------

    Exception: empty varcounts array. This could be because no samples 
    passed filtering, or it could be because you have overzealous filtering.
    Check the values for `trim_loci` and make sure you are not trimming the
    edge too far
    


In [87]:
min10.run("7")

Assembly: min10
[####################] 100%  filtering loci        | 0:00:03 | s7 | 
[####################] 100%  building loci/stats   | 0:00:00 | s7 | 

  Encountered an unexpected error (see ./ipyrad_log.txt)
  Error message is below -------------------------------

    Exception: empty varcounts array. This could be because no samples 
    passed filtering, or it could be because you have overzealous filtering.
    Check the values for `trim_loci` and make sure you are not trimming the
    edge too far
    


### Print the final assembly stats

In [88]:
data.stats

Unnamed: 0,state,reads_raw,reads_passed_filter,clusters_total,clusters_hidepth,hetero_est,error_est,reads_consens
rad_example,6,239866,239866,1579,1217,0.046,0.009,134


In [89]:
min4.stats_dfs.s7_loci

Unnamed: 0,locus_coverage,sum_coverage
1,0,0


In [90]:
min10.stats_dfs.s7_loci

Unnamed: 0,locus_coverage,sum_coverage
1,0,0


### Show the location of your assembled output files

In [91]:
min10.outfiles

alleles : /rigel/edu/w4050/users/jmz2134/12-parallel-genomics/notebooks/ipsimdata/min10_outfiles/min10.alleles.loci
loci : /rigel/edu/w4050/users/jmz2134/12-parallel-genomics/notebooks/ipsimdata/min10_outfiles/min10.loci