### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [1]:
%%bash
export PATH=/rigel/home/$USER/miniconda3/bin:$PATH

In [3]:
import ipyrad as ip
import ipyparallel as ipp

### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [4]:
%%bash

#get into the correct directory where I want to download the data
cd /rigel/home/ngs2116/w4050/users/ngs2116

## The curl command needs a capital O, not a zero
#curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
#tar -xvzf ipsimdata.tar.gz
#ls ipsimdata/

### Connect to an ipcluster instance
Make sure to submit an sbatch job to connect to the ipcluster. Here I've named the job "MPI60". 

In [5]:
## connect to the client
ipyclient = ipp.Client()

## use ipyrad to print cluster info
ip.cluster_info(ipyclient)

host compute node: [24 cores] on node163


### Assembly the dataset from step 1 to step 7

I've chosen to use the pairddrad example dataset. First I need to create an Assembly object and then set the parameters. 

In [6]:
data2 = ip.Assembly("data1") #First I need to create an object Assembly for the data. 

## Then I need to set/modify the parameters
data2.set_params('project_dir', '/rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad')
data2.set_params('barcodes_path', '/rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/pairddrad_example_barcodes.txt')
data2.set_params('raw_fastq_path', '/rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/*.gz')
data2.set_params('datatype', 'pairddrad')
data2.set_params('filter_adapters', 2)
data2.get_params() # prints the parameters to the screen

New Assembly: data1
0   assembly_name               data1                                        
1   project_dir                 /rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad
2   raw_fastq_path              /rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/*.gz
3   barcodes_path               /rigel/edu/w4050/users/ngs2116/ipsimdata/pairddrad/pairddrad_example_barcodes.txt
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    pairddrad                                    
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                      

In [7]:
data2.run("1234567", ipyclient=ipyclient) #runs all steps
data2.samples #lists the name of the filtered samples

Assembly: data1
[####################] 100%  sorting reads         | 0:00:13 | s1 | 
[####################] 100%  writing/compressing   | 0:00:02 | s1 | 
[####################] 100%  processing reads      | 0:00:04 | s2 | 
[####################] 100%  dereplicating         | 0:00:00 | s3 | 
[####################] 100%  clustering            | 0:00:00 | s3 | 
[####################] 100%  building clusters     | 0:00:01 | s3 | 
[####################] 100%  chunking              | 0:00:00 | s3 | 
[####################] 100%  aligning              | 0:00:09 | s3 | 
[####################] 100%  concatenating         | 0:00:00 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:02 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:00:05 | s5 | 
[####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
[#################

{'1A_0': <ipyrad.core.sample.Sample at 0x2aaaf66c58d0>,
 '1B_0': <ipyrad.core.sample.Sample at 0x2aaaf66b3b10>,
 '1C_0': <ipyrad.core.sample.Sample at 0x2aaaf66d5f90>,
 '1D_0': <ipyrad.core.sample.Sample at 0x2aaaf66d5d50>,
 '2E_0': <ipyrad.core.sample.Sample at 0x2aaaf66afb50>,
 '2F_0': <ipyrad.core.sample.Sample at 0x2aaaf66d9c10>,
 '2G_0': <ipyrad.core.sample.Sample at 0x2aaaf66af690>,
 '2H_0': <ipyrad.core.sample.Sample at 0x2aaaf66c5410>,
 '3I_0': <ipyrad.core.sample.Sample at 0x2aaaf66b3d10>,
 '3J_0': <ipyrad.core.sample.Sample at 0x2aaaf66afc50>,
 '3K_0': <ipyrad.core.sample.Sample at 0x2aaaf66ca710>,
 '3L_0': <ipyrad.core.sample.Sample at 0x2aaaf66c5e50>}

### Print the final assembly stats

In [8]:
print data2.stats #prints the number of reads in each sample

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      6      19835                19835            1000              1000   
1B_0      6      20071                20071            1000              1000   
1C_0      6      19969                19969            1000              1000   
1D_0      6      20082                20082            1000              1000   
2E_0      6      20004                20004            1000              1000   
2F_0      6      19899                19899            1000              1000   
2G_0      6      19928                19928            1001              1000   
2H_0      6      20110                20110            1000              1000   
3I_0      6      20078                20078            1000              1000   
3J_0      6      19965                19965            1000              1000   
3K_0      6      19846                19846            1000              1000   
3L_0      6      20025      

### Show the location of your assembled output files

In [10]:
%%bash 
cd /rigel/home/ngs2116/w4050/users/ngs2116/ipsimdata/pairddrad/data1_outfiles
ls

data1.hdf5
data1.loci
data1.phy
data1.snps.map
data1.snps.phy
data1_stats.txt
data1.vcf
