# Installing and Intro to iPyrad

**20170512**
Here, I'm documenting installation and version number, going through the walkthrough provided on the iPyrad website, and taking notes from looking through documentation.


## Installing iPyrad

**20170511**

I just installed iPyrad v 0.6.20 using this [installation tutorial](http://ipyrad.readthedocs.io/installation.html). I first installed MiniConda2 and used that to install iPyrad:

```
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
```
and then:

```
conda install -c ipyrad ipyrad
```

If I want to update my version of iPyrad, I can use:

```
conda update -c ipyrad ipyrad 
```


In [3]:
!ipyrad -v

ipyrad 0.6.20


In [1]:
!ipyrad

usage: ipyrad [-h] [-v] [-r] [-f] [-q] [-d] [-n new] [-p params]
              [-b [branch [branch ...]]] [-m [merge [merge ...]]] [-s steps]
              [-c cores] [-t threading] [--MPI] [--preview] [--ipcluster]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -r, --results         show results summary for Assembly in params.txt and
                        exit
  -f, --force           force overwrite of existing data
  -q, --quiet           do not print to stderror or stdout.
  -d, --debug           print lots more info to ipyrad_log.txt.
  -n new                create new file 'params-{new}.txt' in current
                        directory
  -p params             path to params file for Assembly:
                        params-{assembly_name}.txt
  -b [branch [branch ...]]
                        create a new branch of the Assembly as
                        params-{branc

## Introductory Tutorial

The folks who made iPyrad provide an introductory tutorial [here](http://ipyrad.readthedocs.io/tutorial_intro_cli.html). I'm going to use it to get the hang of the program.

#### Get the tutorial data
Navigate to the directory where you want to store the data. It will create a new folder with the data there.

In [8]:
cd /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/

/mnt/hgfs/SHARED_FOLDER/Learn_iPyrad


In [9]:
!curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   147  100   147    0     0    471      0 --:--:-- --:--:-- --:--:--   471
100 11.8M  100 11.8M    0     0  8433k      0  0:00:01  0:00:01 --:--:-- 46.6M


In [11]:
!tar -xvzf ipsimdata.tar.gz

./ipsimdata/
./ipsimdata/pairgbs_example_R2_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_barcodes.txt
./ipsimdata/rad_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa
./ipsimdata/pairgbs_example_R1_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_R2_.fastq.gz
./ipsimdata/rad_example_genome.fa.fai
./ipsimdata/pairddrad_example_R2_.fastq.gz
./ipsimdata/pairddrad_example_genome.fa.sma
./ipsimdata/pairddrad_example_genome.fa.fai
./ipsimdata/pairgbs_wmerge_example_genome.fa
./ipsimdata/pairddrad_wmerge_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa.smi
./ipsimdata/pairgbs_wmerge_example_R1_.fastq.gz
./ipsimdata/rad_example_genome.fa.smi
./ipsimdata/gbs_example_barcodes.txt
./ipsimdata/pairgbs_example_barcodes.txt
./ipsimdata/pairddrad_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_barcodes.txt
./ipsimdata/rad_example_barcodes.txt
./ipsimdata/pairddrad_wmerge_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_R2_.fastq.gz
./ipsimdata/gbs_example_R1_.fastq.gz

In [12]:
ls ipsimdata/

[0m[01;32mgbs_example_barcodes.txt[0m*               [01;32mpairgbs_example_barcodes.txt[0m*
[01;32mgbs_example_genome.fa[0m*                  [01;32mpairgbs_example_R1_.fastq.gz[0m*
[01;32mgbs_example_R1_.fastq.gz[0m*               [01;32mpairgbs_example_R2_.fastq.gz[0m*
[01;32mpairddrad_example_barcodes.txt[0m*         [01;32mpairgbs_wmerge_example_barcodes.txt[0m*
[01;32mpairddrad_example_genome.fa[0m*            [01;32mpairgbs_wmerge_example_genome.fa[0m*
[01;32mpairddrad_example_genome.fa.fai[0m*        [01;32mpairgbs_wmerge_example_R1_.fastq.gz[0m*
[01;32mpairddrad_example_genome.fa.sma[0m*        [01;32mpairgbs_wmerge_example_R2_.fastq.gz[0m*
[01;32mpairddrad_example_genome.fa.smi[0m*        [01;32mrad_example_barcodes.txt[0m*
[01;32mpairddrad_example_R1_.fastq.gz[0m*         [01;32mrad_example_genome.fa[0m*
[01;32mpairddrad_example_R2_.fastq.gz[0m*         [01;32mrad_example_genome.fa.fai[0m*
[01;32mpairddrad_wmerge_example_ba

In [13]:
# look at first three reads
!gunzip -c ./ipsimdata/rad_example_R1_.fastq.gz | head -n 12

@lane1_locus0_2G_0_0 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_1 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCCACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_2 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

gzip: stdout: Broken pipe


In [15]:
# look at barcodes file
!cat ./ipsimdata/rad_example_barcodes.txt

1A_0	CATCATCAT
1B_0	CCAGTGATA
1C_0	TGGCCTAGT
1D_0	GGGAAAAAC
2E_0	GTGGATATC
2F_0	AGAGCCGAG
2G_0	CTCCAATCC
2H_0	CTCACTGCA
3I_0	GGCGCATAC
3J_0	CCTTATGTC
3K_0	ACGTGTGTG
3L_0	TTACTAACA


#### Make params file

In [16]:
!ipyrad -n iptest


  New file 'params-iptest.txt' created in /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad



In [17]:
>>> cat params-iptest.txt

------- ipyrad params file (v.0.6.20)-------------------------------------------
iptest                         ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./                             ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut

Change the params file to include the path to the raw files and the path to the barcodes file. I did this in Atom:

![imge](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Notebooks/images_for_notebooks/change_params.png?raw=true)

#### Step 1: Demultiplex

There are 4 main parts to this step: (1) It creates a new Assembly called iptest, since this is our first time running any steps for the named assembly; (2) It launches a number of parallel Engines, by default this is the number of available CPUs on your machine; (3) It performs the step functions, in this case it sorts the data and writes the outputs; and (4) It saves the Assembly.

In [19]:
!ipyrad -p params-iptest.txt -s 1


 -------------------------------------------------------------
  ipyrad [v.0.6.20]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: iptest
  from saved path: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest.json
  host compute node: [1 cores] on ubuntu

  Step 1: Demultiplexing fastq data to Samples
    Skipping: 12 Samples already found in Assembly iptest.
    (can overwrite with force argument)    



In [21]:
# look at results of this step in fastqs output directory
!ls iptest_fastqs

1A_0_R1_.fastq.gz  2F_0_R1_.fastq.gz  3K_0_R1_.fastq.gz
1B_0_R1_.fastq.gz  2G_0_R1_.fastq.gz  3L_0_R1_.fastq.gz
1C_0_R1_.fastq.gz  2H_0_R1_.fastq.gz  s1_demultiplex_stats.txt
1D_0_R1_.fastq.gz  3I_0_R1_.fastq.gz
2E_0_R1_.fastq.gz  3J_0_R1_.fastq.gz


In [22]:
# -r fetches informative results from currently executed steps like raw reads with -r
!ipyrad -p params-iptest.txt -r


Summary stats of Assembly iptest
------------------------------------------------
      state  reads_raw
1A_0      1      19862
1B_0      1      20043
1C_0      1      20136
1D_0      1      19966
2E_0      1      20017
2F_0      1      19933
2G_0      1      20030
2H_0      1      20199
3I_0      1      19885
3J_0      1      19822
3K_0      1      19965
3L_0      1      20008


Full stats files
------------------------------------------------
step 1: ./iptest_fastqs/s1_demultiplex_stats.txt
step 2: None
step 3: None
step 4: None
step 5: None
step 6: None
step 7: None




In [24]:
# to see full stats from step one, try
!cat ./iptest_fastqs/s1_demultiplex_stats.txt

raw_file                               total_reads    cut_found  bar_matched
rad_example_R1_.fastq                       239866       239866       239866

sample_name                            total_reads
1A_0                                         19862
1B_0                                         20043
1C_0                                         20136
1D_0                                         19966
2E_0                                         20017
2F_0                                         19933
2G_0                                         20030
2H_0                                         20199
3I_0                                         19885
3J_0                                         19822
3K_0                                         19965
3L_0                                         20008

sample_name                               true_bar       obs_bar     N_records
1A_0                                     CATCATCAT     CATCATCAT         19862
1B_0

#### Step 2: Filter reads

This step filters reads based on quality scores, and can be used to detect Illumina adapters in your reads, which is a common concern with any NGS data set, and especially so for homebrew type library preparations. Here the filter is set to the default value of 0 (zero), meaning it filters only based on quality scores of base calls, and does not search for adapters. This is a good option if your data are already pre-filtered. The resuling filtered files from step 2 are written to a new directory called ``iptest_edits/``.

In [26]:
# filter only based on quality scores (will not search for adapters)
!ipyrad -p params-iptest.txt -s 2


 -------------------------------------------------------------
  ipyrad [v.0.6.20]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: iptest
  from saved path: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest.json
  host compute node: [1 cores] on ubuntu

  Step 2: Filtering reads 
  [####################] 100%  processing reads      | 0:00:08  



In [28]:
# view the output of filtering
!ls iptest_edits/

1A_0.trimmed_R1_.fastq.gz  2F_0.trimmed_R1_.fastq.gz  3K_0.trimmed_R1_.fastq.gz
1B_0.trimmed_R1_.fastq.gz  2G_0.trimmed_R1_.fastq.gz  3L_0.trimmed_R1_.fastq.gz
1C_0.trimmed_R1_.fastq.gz  2H_0.trimmed_R1_.fastq.gz  s2_rawedit_stats.txt
1D_0.trimmed_R1_.fastq.gz  3I_0.trimmed_R1_.fastq.gz
2E_0.trimmed_R1_.fastq.gz  3J_0.trimmed_R1_.fastq.gz


In [29]:
# Get current stats including # raw reads and # reads after filtering.
!ipyrad -p params-iptest.txt -r


Summary stats of Assembly iptest
------------------------------------------------
      state  reads_raw  reads_passed_filter
1A_0      2      19862                19862
1B_0      2      20043                20043
1C_0      2      20136                20136
1D_0      2      19966                19966
2E_0      2      20017                20017
2F_0      2      19933                19933
2G_0      2      20030                20030
2H_0      2      20199                20199
3I_0      2      19885                19885
3J_0      2      19822                19822
3K_0      2      19965                19965
3L_0      2      20008                20008


Full stats files
------------------------------------------------
step 1: ./iptest_fastqs/s1_demultiplex_stats.txt
step 2: ./iptest_edits/s2_rawedit_stats.txt
step 3: None
step 4: None
step 5: None
step 6: None
step 7: None




Tutorial provides a line to look at filtered reads, but couldn't get line of code to work. Might be because there's also a discrepancy in the name of the filtered files (on my computer, they say "trimmed" and in the tutorial they don't).

In [34]:
!head -n 12 ./iptest_edits/1A_0_R1_.fastq

head: cannot open './iptest_edits/1A_0_R1_.fastq' for reading: No such file or directory


### Step 3: Clustering within samples
Step 3 de-replicates and then clusters reads within each sample by the set clustering threshold and then writes the clusters to new files in a directory called ``iptest_clust_0.85/``. Intuitively we are trying to identify all the reads that map to the same locus within each sample. The clustering threshold specifies the minimum percentage of sequence similarity below which we will consider two reads to have come from different loci.

The true name of this output directory will be dictated by the value you set for the clust_threshold parameter in the params file.

In [35]:
# cluster based on % similarity in params file, here default = .85
!ipyrad -p params-iptest.txt -s 3


 -------------------------------------------------------------
  ipyrad [v.0.6.20]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: iptest
  from saved path: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest.json
  host compute node: [1 cores] on ubuntu

  Step 3: Clustering/Mapping reads
  [####################] 100%  dereplicating         | 0:00:01  
  [####################] 100%  clustering            | 0:00:03  
  [####################] 100%  building clusters     | 0:00:00  
  [####################] 100%  chunking              | 0:00:00  
  [####################] 100%  aligning              | 0:00:26  
  [####################] 100%  concatenating         | 0:00:00  



In [36]:
# check out the stats output
!ipyrad -p params-iptest.txt -r


Summary stats of Assembly iptest
------------------------------------------------
      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth
1A_0      3      19862                19862            1000              1000
1B_0      3      20043                20043            1000              1000
1C_0      3      20136                20136            1000              1000
1D_0      3      19966                19966            1000              1000
2E_0      3      20017                20017            1000              1000
2F_0      3      19933                19933            1000              1000
2G_0      3      20030                20030            1000              1000
2H_0      3      20199                20199            1000              1000
3I_0      3      19885                19885            1000              1000
3J_0      3      19822                19822            1000              1000
3K_0      3      19965                19965            1000

The aligned clusters found during this step are now located in ./iptest_clust_0.85/. You can get a feel for what this looks like by examining a portion of one of the files using the command below.

In [37]:
## Same as above, gunzip -c means print to the screen and
## `head -n 28` means just show me the first 28 lines. If
## you're interested in what more of the loci look like
## you can increase the number of lines you ask head for,
## e.g. ... | head -n 100
!gunzip -c iptest_clust_0.85/1A_0.clustS.gz | head -n 28

lane1_locus100_1A_0_0;size=18;*
TGCAGCAAGATCACGGCGGACAGAACCGCCCCTTTTCTTGTTGCTGGTTAACTTCACGCCGTCATGGTTAGTGGTCAGGCTTTACAGGTCC
lane1_locus100_1A_0_14;size=1;+
TGCAGCAAGATCACGGCGGACAGAACCGCCCCTTTGCTTGTTGCTGGTTAACTTCTCGCCGTCATGGTTAGTGGTCAGGCTTTACAGGTCC
lane1_locus100_1A_0_2;size=1;+
TGCAGCAAGATCACGGCGGACAGAACCGCCCCTTTTCTTGTTGCTGGTTAACTTCACGCCGTCATGGTTAGTGGACAGGCTTTACAGGTCC
//
//
lane1_locus10_1A_0_1;size=17;*
TGCAGACGTGATGGCTATCCATAGAGCGCCTTATTTGCGGGTACGTACACCCATCATGTGCCCCGAAGACTGGGTGATTTCGCCCGAGCGT
lane1_locus10_1A_0_0;size=1;+
TGCAGACGTGATGGCTATCCATAGAGCGCCTTATTTGCTGGTACGTACACCCATCATGTGCCCCGAAGACTGGGTGATTTCGCCCGAGCGT
lane1_locus10_1A_0_2;size=1;+
TGCAGACGTGATGGCTATCCATAGAGCGCCTTATTTGCGGGTACGTACACCCATCATGTGCCCCGAAGACTGGGTCATTTCGCCCGAGCGT
lane1_locus10_1A_0_8;size=1;+
TGCAGACGTGATGGCTATCCATAGAGCGCCTTATTTGCGGGTACGTGCACCCATCATGTGCCCCGAAGACTGGGTGATTTCGCCCGAGCGT
//
//
lane1_locus102_1A_0_0;size=16;*
TGCAGGAGCGGTGCTACGTCGTGATGCCTTCACCCTCAATGTTAATAGCAGGTCAGGGCCTAATTTGATAATGACTA

``size =`` refers to the number of reads of that sequence. So between each ``// //`` is all the different sequences that were clustered together, and the number of times they appeared.

### Step 4: Join estimation of heterozygosity and error rate

Step 4 jointly estimates sequencing error rate and heterozygosity to disentangle which reads are “real” and which are sequencing error. We need to know which reads are “real” because in diploid organisms there are a maximum of 2 alleles at any given locus. If we look at the raw data and there are 5 or ten different “alleles”, and 2 of them are very high frequency, and the rest are singletons then this gives us evidence that the 2 high frequency alleles are good reads and the rest are probably not. This step is pretty straightforward, and pretty fast. 

In [38]:
!ipyrad -p params-iptest.txt -s 4


 -------------------------------------------------------------
  ipyrad [v.0.6.20]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: iptest
  from saved path: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest.json
  host compute node: [1 cores] on ubuntu

  Step 4: Joint estimation of error rate and heterozygosity
  [####################] 100%  inferring [H, E]      | 0:00:09  



This step does not produce new output files, only a stats file with the estimated heterozygosity and error rate parameters. You can also invoke the ``-r`` flag to see the estimated values.

In [39]:
!ipyrad -p params-iptest.txt -r


Summary stats of Assembly iptest
------------------------------------------------
      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      4      19862                19862            1000              1000   
1B_0      4      20043                20043            1000              1000   
1C_0      4      20136                20136            1000              1000   
1D_0      4      19966                19966            1000              1000   
2E_0      4      20017                20017            1000              1000   
2F_0      4      19933                19933            1000              1000   
2G_0      4      20030                20030            1000              1000   
2H_0      4      20199                20199            1000              1000   
3I_0      4      19885                19885            1000              1000   
3J_0      4      19822                19822            1000              1000   
3K_0      4      19965    

### Step 5: Consensus base calls

Step 5 uses the inferred error rate and heterozygosity to call the consensus of sequences within each cluster. Here we are identifying what we believe to be the real haplotypes at each locus within each sample.

In [40]:
!ipyrad -p params-iptest.txt -s 5


 -------------------------------------------------------------
  ipyrad [v.0.6.20]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: iptest
  from saved path: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest.json
  host compute node: [1 cores] on ubuntu

  Step 5: Consensus base calling 
  Mean error  [0.00076 sd=0.00001]
  Mean hetero [0.00192 sd=0.00012]
  [####################] 100%  calculating depths    | 0:00:01  
  [####################] 100%  chunking clusters     | 0:00:00  
  [####################] 100%  consens calling       | 0:00:23  



In [41]:
# Again we can ask for the results with the -r flag
!ipyrad -p params-iptest.txt -r


Summary stats of Assembly iptest
------------------------------------------------
      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  \
1A_0      5      19862                19862            1000              1000   
1B_0      5      20043                20043            1000              1000   
1C_0      5      20136                20136            1000              1000   
1D_0      5      19966                19966            1000              1000   
2E_0      5      20017                20017            1000              1000   
2F_0      5      19933                19933            1000              1000   
2G_0      5      20030                20030            1000              1000   
2H_0      5      20199                20199            1000              1000   
3I_0      5      19885                19885            1000              1000   
3J_0      5      19822                19822            1000              1000   
3K_0      5      19965    

And here the important information is the number of ``reads_consens``. This is the number of “good” reads within each sample that we’ll send on to the next step. As you’ll see in examples with empirical data, this is often a step where many reads are filtered out of the data set. If no reads were filtered, then the number of reads_consens should be equal to the number of clusters_hidepth.

I think "good" reads here refers to what we'd think of as retained loci. Number of good consensus sequences = number of good loci?

This step creates a new directory called ``./iptest_consens`` to store the consensus sequences for each sample. We can use our trusty head command to look at the output.

In [42]:
!gunzip -c iptest_consens/1A_0.consens.gz | head

>1A_0_0
TGCAGCAAGATCACGGCGGACAGAACCGCCCCTTTTCTTGTTGCTGGTTAACTTCACGCCGTCATGGTTAGTGGTCAGGCTTTACAGGTCC
>1A_0_1
TGCAGACGTGATGGCTATCCATAGAGCGCCTTATTTGCGGGTACGTACACCCATCATGTGCCCCGAAGACTGGGTGATTTCGCCCGAGCGT
>1A_0_2
TGCAGGAGCGGTGCTACGTCGTGATGCCTTCACCCTCAATGTTAATAGCAGGTCAGGGCCTAATTTGATAATGACTAACCTGTAAACCTAC
>1A_0_3
TGCAGCACGAAGTTAACTTCAACCCTCGCCACTACTGCGTACAAAACGCGAGAGGTCTCCATGAGTGTCGCATCCGCTGTGGTGGTTACAT
>1A_0_4
TGCAGTGCTCCCGATATGCATGAACACTTGGAGGGAGGACTTCTCCGTGAGTTCAAGGCTCAGTCGGCAAGACGTCAATGAATATGCGGTC

gzip: stdout: Broken pipe


You can see that all loci within each sample have been reduced to one consensus sequence. Heterozygous sites are represented by IUPAC ambiguity codes (find the K in sequence 1A_0_1), and all other sites are homozygous.

### Step 6: Cluster across samples

Step 6 clusters consensus sequences across samples. Now that we have good estimates for haplotypes within samples we can try to identify similar sequences at each locus between samples. We use the same clustering threshold as step 3 to identify sequences between samples that are probably sampled from the same locus, based on sequence similarity.

In [43]:
!ipyrad -p params-iptest.txt -s 6


 -------------------------------------------------------------
  ipyrad [v.0.6.20]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: iptest
  from saved path: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest.json
  host compute node: [1 cores] on ubuntu

  Step 6: Clustering at 0.85 similarity across 12 samples
  [####################] 100%  concat/shuffle input  | 0:00:01  
  [####################] 100%  clustering across     | 0:00:00  
  [####################] 100%  building clusters     | 0:00:00  
  [####################] 100%  aligning clusters     | 0:00:08  
  [####################] 100%  database indels       | 0:00:00  
  [####################] 100%  indexing clusters     | 0:00:01  
  [####################] 100%  building database     | 0:00:00  



This step differs from previous steps in that we are no longer applying a function to each Sample individually, but instead we apply it to all Samples collectively. Our end result is a map telling us which loci cluster together from which Samples. This output is stored as an HDF5 database (iptest_test.hdf5), which is not easily human readable. It contains the clustered sequence data, depth information, phased alleles, and other metadata. If you really want to see the contents of the database see the h5py cookbook recipe.

There is no simple way to summarize the outcome of step 6, so the output of ipyrad -p params-iptest -r and the content of the ``./iptest_consens/s6_cluster_stats.txt`` stats file are uniquely uninteresting.

### Step 7: Filter and write output files

The final step is to filter the data and write output files in many convenient file formats. First we apply filters for maximum number of indels per locus, max heterozygosity per locus, max number of snps per locus, and minimum number of samples per locus. All these filters are configurable in the params file and you are encouraged to explore different settings, but the defaults are quite good and quite conservative.

In [44]:
!ipyrad -p params-iptest.txt -s 7


 -------------------------------------------------------------
  ipyrad [v.0.6.20]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: iptest
  from saved path: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest.json
  host compute node: [1 cores] on ubuntu

  Step 7: Filter and write output files for 12 Samples
  [####################] 100%  filtering loci        | 0:00:04  
  [####################] 100%  building loci/stats   | 0:00:00  
  [####################] 100%  building vcf file     | 0:00:03  
  [####################] 100%  writing vcf file      | 0:00:00  
  [####################] 100%  building arrays       | 0:00:00  
  [####################] 100%  writing outfiles      | 0:00:00  
  Outfiles written to: /mnt/hgfs/SHARED_FOLDER/Learn_iPyrad/iptest_outfiles



A new directory is created called iptest_outfiles. This directory contains all the output files specified in the params file. The default is to create all supported output files which include PHYLIP(.phy), NEXUS(.nex), EIGENSTRAT’s genotype format(.geno), STRUCTURE(.str), as well as many others. Explore some of these files below.

### Final stats file

The final stats output file contains a large number of statistics telling you why some loci were filtered from the data set, how many loci were recovered per sample, how many loci were shared among some number of samples, and how much variation is present in the data. Check out the results file. (Unclear from the tutorial how to access the stats file... going to poke around in the directories.

In [47]:
# ha, I found it!
!head -n 100 iptest_outfiles/iptest_stats.txt



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci               1000              0           1000
filtered_by_rm_duplicates               0              0           1000
filtered_by_max_indels                  0              0           1000
filtered_by_max_snps                    0              0           1000
filtered_by_max_shared_het              0              0           1000
filtered_by_min_sample                  0              0           1000
filtered_by_max_alleles                 0              0           1000
total_filtered_loci                  1000              0           1000


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

      sample_coverage
1A_0             1000
1B_0             1000
1C_0             1000
1D_0             1000
2E_0  

Check out the .loci output (this is ipyrad native internal format). Each locus is delineated by a pair of forward slashes //. Within each locus are all the reads from each sample that clustered together. The line containing the // also indicates the positions of SNPs in the sequence. See if you can spot the SNPs in the first locus. Many more output formats are available. See the section on output formats for more information.

In [51]:
!head -n 24 iptest_outfiles/iptest.loci

1A_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTATACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
1B_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTATACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
1C_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTATACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
1D_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTATACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
2E_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTACACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
2F_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTACACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
2G_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTATACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
2H_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTACACGTGGCAGGACCTGTTGGAAAAACACGCAGAGAGGA
3I_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTATACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
3J_0     TTAGTTCTTAGACTATTCGTTAACTCGAGGCGAGTGCCCTAAGCGCTATACGTGGCAGGACCTGTTGGAAAAACACGCAGAAAGGA
3K_0     TTAGTTCTTAGACTATTCGTT

That's the end of the tutorial!

## Learning about branching assemblies

From iPyrad's site on [Assembly Outline](http://ipyrad.readthedocs.io/outline.html#branching-workflow).

The reason we separate assembly into distinct steps is to create a modular workflow that can be easily restarted if interrupted, and can be easily branched at different points to create assemblies under different combinations of parameter settings.

If you want to run all steps at once, you make your params file and call:
```
ipyrad -p params-data1.txt -s 1234567
```

Branching is where you use the same output files from one step to move forward using multiple parameter sets. Branching does not create hard copies of existing data files, and so is not an “expensive” action in terms of disk space or time. We suggest it be used quite liberally whenever applying a new set of parameters. The code for branching is only a tad more complicated:

```
## create an initial Assembly and params file, here called 'data1'
>>> ipyrad -n data1

## edit the params file for data1 with your text editor
## ... editing params-data1.txt

## run steps 1-2 with the params file
>>> ipyrad -p params-data1.txt -s 12

## create a new branch of 'data1' before step3, here called 'data2'.
>>> ipyrad -p params-data1.txt -b data2

## edit the params file for data2 using a text editor
## ... editing params-data2.txt

## run steps 3-7 for both assemblies
>>> ipyrad -p params-data1.txt -s 34567
>>> ipyrad -p params-data2.txt -s 34567

```

## Python API

Apparently I don't have a full understanding of what an API is, but lookie here, you can run iPyrad within Python. That sounds like it could be useful at times!

```
## import ipyrad
import ipyrad as ip

## create an Assembly and modify some parameter settings
data1 = ip.Assembly("data1")
data1.set_params("project_dir", "example")
data1.set_params("raw_fastq_path", "data/*.fastq")
data1.set_params("barcodes_path", "barcodes.txt")

## run steps 1-2
data1.run("12")

## create a new branch of this Assembly named data2
## and change some parameter settings
data2 = data1.branch("data2")
data2.set_params("clust_threshold", 0.90)

## run steps 3-7 for the two Assemblies
data1.run("34567")
data2.run("34567")
```