### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [None]:
# source activate py27
# conda install ipyrad -c pyrad
# sbatch /rigel/home/nk2777/w4050/users/nk2777/jupyter-one-hour-edu.sbatch
#　cat ./outputs/slurm-6280137-notebook.out
# in bash $ ssh -N -L 9014:10.43.4.161:9014 habanero
# in browser $ localhost:9014

In [1]:
import ipyrad as ip
import ipyparallel as ipp

  from ._conv import register_converters as _register_converters


In [2]:
## print the version of ipyrad
print ip.__version__

0.7.23


### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [3]:
%%bash
## The curl command needs a capital O, not a zero
curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
tar -xvzf ipsimdata.tar.gz

./ipsimdata/
./ipsimdata/pairgbs_example_R2_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_barcodes.txt
./ipsimdata/rad_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa
./ipsimdata/pairgbs_example_R1_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_R2_.fastq.gz
./ipsimdata/rad_example_genome.fa.fai
./ipsimdata/pairddrad_example_R2_.fastq.gz
./ipsimdata/pairddrad_example_genome.fa.sma
./ipsimdata/pairddrad_example_genome.fa.fai
./ipsimdata/pairgbs_wmerge_example_genome.fa
./ipsimdata/pairddrad_wmerge_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa.smi
./ipsimdata/pairgbs_wmerge_example_R1_.fastq.gz
./ipsimdata/rad_example_genome.fa.smi
./ipsimdata/gbs_example_barcodes.txt
./ipsimdata/pairgbs_example_barcodes.txt
./ipsimdata/pairddrad_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_barcodes.txt
./ipsimdata/rad_example_barcodes.txt
./ipsimdata/pairddrad_wmerge_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_R2_.fastq.gz
./ipsimdata/gbs_example_R1_.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   147  100   147    0     0    599      0 --:--:-- --:--:-- --:--:--   602
100 11.8M  100 11.8M    0     0  19.4M      0 --:--:-- --:--:-- --:--:-- 19.4M


In [3]:
ls ipsimdata/

gbs_example_barcodes.txt               pairgbs_example_barcodes.txt
gbs_example_genome.fa                  [0m[38;5;9mpairgbs_example_R1_.fastq.gz[0m
[38;5;9mgbs_example_R1_.fastq.gz[0m               [38;5;9mpairgbs_example_R2_.fastq.gz[0m
pairddrad_example_barcodes.txt         pairgbs_wmerge_example_barcodes.txt
pairddrad_example_genome.fa            pairgbs_wmerge_example_genome.fa
pairddrad_example_genome.fa.fai        [38;5;9mpairgbs_wmerge_example_R1_.fastq.gz[0m
pairddrad_example_genome.fa.sma        [38;5;9mpairgbs_wmerge_example_R2_.fastq.gz[0m
pairddrad_example_genome.fa.smi        rad_example_barcodes.txt
[38;5;9mpairddrad_example_R1_.fastq.gz[0m         rad_example_genome.fa
[38;5;9mpairddrad_example_R2_.fastq.gz[0m         rad_example_genome.fa.fai
pairddrad_wmerge_example_barcodes.txt  rad_example_genome.fa.sma
pairddrad_wmerge_example_genome.fa     rad_example_genome.fa.smi
[38;5;9mpairddrad_wmerge_example_R1_.fastq.gz[0m  [38;5;9mrad_example

In [11]:
%%bash
#move pairddrad_wmerge files into a new directly called pairddrad

#pairddrad_wmerge_example_barcodes.txt  
#pairddrad_wmerge_example_genome.fa     
#pairddrad_wmerge_example_R1_.fastq.gz  
#pairddrad_wmerge_example_R2_.fastq.gz


In [4]:
ls

[0m[38;5;27mipsimdata[0m/        nb-12.1-parallel-threading.ipynb          [38;5;27mpairddrad[0m/
[38;5;9mipsimdata.tar.gz[0m  nb-12.2-multiprocesing-ipyparallel.ipynb
ipyrad_log.txt    nb-12.3-ipyrad-assignment.ipynb


### Connect to an ipcluster instance

In [3]:
%%bash
ipcluster start --n=4 --daemonize


In [4]:
# The run command will automatically parallelize work across all cores of a running ipcluster instance 
# Start this outside of notebook. 
# If ipcluster is running on the default profile then ipyrad will detect and use it when the run command is called. 
# However, if you start an ipcluster instance with a specific profile name then you will need to connect 
# to it using the ipyparallel library and then pass the connection client object to ipyrad. 

## connect to the client
ipyclient = ipp.Client()
## use ipyrad to print cluster info
ip.cluster_info(ipyclient)

host compute node: [4 cores] on node037


In [6]:
ipsimdata = ip.Assembly("ipsimdata")

New Assembly: ipsimdata


### Assembly the dataset from step 1 to step 7

In [8]:
## Then I need to set/modify the parameters
## refer http://ipyrad.readthedocs.io/parameters.html#filter-adapters

ipsimdata.set_params('project_dir', '/rigel/home/nk2777/w4050/users/nk2777/pairddrad')
ipsimdata.set_params('raw_fastq_path', '/rigel/home/nk2777/w4050/users/nk2777/pairddrad/*.gz')
ipsimdata.set_params('barcodes_path', '/rigel/home/nk2777/w4050/users/nk2777/pairddrad/pairddrad_wmerge_example_barcodes.txt')
ipsimdata.set_params('datatype', 'pairddrad') # paired ddrad type (2 different cutters)
ipsimdata.set_params('filter_adapters', 2) # reads are thesearched for the common Illumina adapter, plus the reverse complement of the second cut site (if present), plus the barcode (if present), and this part of the read is trimmed. 
ipsimdata.set_params("output_formats", "*") # Make all output datatypes

# Print the parameters to the screen
ipsimdata.get_params() 

0   assembly_name               ipsimdata                                    
1   project_dir                 /rigel/home/nk2777/w4050/users/nk2777/pairddrad
2   raw_fastq_path              /rigel/home/nk2777/w4050/users/nk2777/pairddrad/*.gz
3   barcodes_path               /rigel/home/nk2777/w4050/users/nk2777/pairddrad/pairddrad_wmerge_example_barcodes.txt
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    pairddrad                                    
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_ma

In [9]:
ipsimdata.run("1234567", ipyclient=ipyclient) #runs all steps
ipsimdata.samples #lists the name of the filtered samples

Assembly: ipsimdata
[####################] 100%  sorting reads         | 0:00:17 | s1 | 
[####################] 100%  writing/compressing   | 0:00:07 | s1 | 
[####################] 100%  processing reads      | 0:00:24 | s2 | 
[####################] 100%  dereplicating         | 0:00:04 | s3 | 
[####################] 100%  clustering            | 0:00:14 | s3 | 
[####################] 100%  building clusters     | 0:00:00 | s3 | 
[####################] 100%  chunking              | 0:00:00 | s3 | 
[####################] 100%  aligning              | 0:02:13 | s3 | 
[####################] 100%  concatenating         | 0:00:00 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:16 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:01:04 | s5 | 
[####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
[#############

{'1A_0': <ipyrad.core.sample.Sample at 0x2aaaf625ec10>,
 '1B_0': <ipyrad.core.sample.Sample at 0x2aaaf625ef90>,
 '1C_0': <ipyrad.core.sample.Sample at 0x2aaaf6234d50>,
 '1D_0': <ipyrad.core.sample.Sample at 0x2aaaf6234890>,
 '2E_0': <ipyrad.core.sample.Sample at 0x2aaaf624af50>,
 '2F_0': <ipyrad.core.sample.Sample at 0x2aaaf625b410>,
 '2G_0': <ipyrad.core.sample.Sample at 0x2aaaf6224a10>,
 '2H_0': <ipyrad.core.sample.Sample at 0x2aaaf6238b10>,
 '3I_0': <ipyrad.core.sample.Sample at 0x2aaaf625b9d0>,
 '3J_0': <ipyrad.core.sample.Sample at 0x2aaaf624ae50>,
 '3K_0': <ipyrad.core.sample.Sample at 0x2aaaf624a790>,
 '3L_0': <ipyrad.core.sample.Sample at 0x2aaaf625bf90>}

### Print the final assembly stats

In [10]:
print ipsimdata.stats

      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      6      20040                20040            1000              1000   
1B_0      6      19982                19982            1001              1000   
1C_0      6      20105                20105            1000              1000   
1D_0      6      20172                20172            1001              1000   
2E_0      6      20082                20082            1000              1000   
2F_0      6      20082                20082            1000              1000   
2G_0      6      20095                20095            1000              1000   
2H_0      6      20005                20005            1000              1000   
3I_0      6      19824                19824            1000              1000   
3J_0      6      20100                20100            1000              1000   
3K_0      6      20076                20076            1000              1000   
3L_0      6      19932      

### Show the location of your assembled output files

In [12]:
%%bash 
cd /rigel/home/nk2777/w4050/users/nk2777/pairddrad/ipsimdata_outfiles
ls

ipsimdata.alleles.loci
ipsimdata.geno
ipsimdata.gphocs
ipsimdata.hdf5
ipsimdata.loci
ipsimdata.nex
ipsimdata.phy
ipsimdata.snps.map
ipsimdata.snps.phy
ipsimdata_stats.txt
ipsimdata.str
ipsimdata.u.geno
ipsimdata.u.snps.phy
ipsimdata.ustr
ipsimdata.vcf
