### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [1]:
%%bash
export PATH=/rigel/home/$USER/miniconda3/bin:$PATH

In [2]:
import ipyrad as ip
import ipyparallel as ipp

  from ._conv import register_converters as _register_converters


### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [3]:
%%bash

#get into the correct directory where I want to download the data
cd /rigel/home/ngs2116/w4050/users/ngs2116

## The curl command needs a capital O, not a zero
#curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
#tar -xvzf ipsimdata.tar.gz
#ls ipsimdata/

### Connect to an ipcluster instance
Make sure to submit an sbatch job to connect to the ipcluster. Here I've named the job "MPI60". 

In [4]:
## connect to the client
ipyclient = ipp.Client(profile="MPI60")

## print how many engines are connected
print(len(ipyclient), 'cores')

## or, use ipyrad to print cluster info
ip.cluster_info(ipyclient)

(20, 'cores')
host compute node: [20 cores] on node241


### Assembly the dataset from step 1 to step 7

I've chosen to use the pairddrad example dataset. First I need to create an Assembly object and then set the parameters. 

In [19]:
data2 = ip.Assembly("data1") #First I need to create an object Assembly for the data. 

## Then I need to set/modify the parameters
data2.set_params('project_dir', '/rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad')
data2.set_params('barcodes_path', '/rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/pairddrad_example_barcodes.txt')
data2.set_params('raw_fastq_path', '/rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/*.gz')
data2.set_params('datatype', 'pairddrad')
data2.set_params('filter_adapters', 2)
data2.get_params() # prints the parameters to the screen

New Assembly: data1
0   assembly_name               data1                                        
1   project_dir                 /rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad
2   raw_fastq_path              /rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/*.gz
3   barcodes_path               /rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/pairddrad_example_barcodes.txt
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    pairddrad                                    
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                      

In [21]:
data2.run("1234567", ipyclient=ipyclient) #runs all steps
data2.samples #lists the name of the filtered samples

Assembly: data1
[####################] 100%  sorting reads         | 0:00:04 | s1 | 
[####################] 100%  writing/compressing   | 0:00:00 | s1 | 
[####################] 100%  processing reads      | 0:00:03 | s2 | 
[####################] 100%  dereplicating         | 0:00:00 | s3 | 
[####################] 100%  clustering            | 0:00:01 | s3 | 
[####################] 100%  building clusters     | 0:00:00 | s3 | 
[####################] 100%  chunking              | 0:00:00 | s3 | 
[####################] 100%  aligning              | 0:00:10 | s3 | 
[####################] 100%  concatenating         | 0:00:00 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:01 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:00:05 | s5 | 

  Encountered an unexpected error (see ./ipyrad_log.txt)
  Error message is below ----

{'1A_0': <ipyrad.core.sample.Sample at 0x2aaaf62fd7d0>,
 '1B_0': <ipyrad.core.sample.Sample at 0x2aaaf6312690>,
 '1C_0': <ipyrad.core.sample.Sample at 0x2aaaf62e2e90>,
 '1D_0': <ipyrad.core.sample.Sample at 0x2aaaf62ec910>,
 '2E_0': <ipyrad.core.sample.Sample at 0x2aaaf630de90>,
 '2F_0': <ipyrad.core.sample.Sample at 0x2aaaf6312a50>,
 '2G_0': <ipyrad.core.sample.Sample at 0x2aaaf635fed0>,
 '2H_0': <ipyrad.core.sample.Sample at 0x2aaaf62e2a10>,
 '3I_0': <ipyrad.core.sample.Sample at 0x2aaaf62fde90>,
 '3J_0': <ipyrad.core.sample.Sample at 0x2aaaf635fe90>,
 '3K_0': <ipyrad.core.sample.Sample at 0x2aaaf6352d10>,
 '3L_0': <ipyrad.core.sample.Sample at 0x2aaaf62fd9d0>}

### Print the final assembly stats

In [22]:
print data2.stats #prints the number of reads in each sample

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      5      19835                19835            1000              1000   
1B_0      5      20071                20071            1000              1000   
1C_0      5      19969                19969            1000              1000   
1D_0      5      20082                20082            1000              1000   
2E_0      5      20004                20004            1000              1000   
2F_0      5      19899                19899            1000              1000   
2G_0      5      19928                19928            1001              1000   
2H_0      5      20110                20110            1000              1000   
3I_0      5      20078                20078            1000              1000   
3J_0      5      19965                19965            1000              1000   
3K_0      5      19846                19846            1000              1000   
3L_0      5      20025      

### Show the location of your assembled output files

In [25]:
%%bash 
cd /rigel/home/ngs2116/w4050/users/ngs2116/ipsimdata/pairddrad
ls

data1_across
data1_clust_0.85
data1_consens
data1_edits
data1_fastqs
data1.json
data1_s1_demultiplex_stats.txt
pairddrad_example_barcodes.txt
pairddrad_example_genome.fa
pairddrad_example_genome.fa.fai
pairddrad_example_genome.fa.sma
pairddrad_example_genome.fa.smi
pairddrad_example_R1_.fastq.gz
pairddrad_example_R2_.fastq.gz
