# Assembling Olympia oyster GBS data with *ipyrad*, Silliman (2018)

This is a reproducible notebook detailing the de novo assembly and filtering of genotype-by-sequencing (GBS) data from [Silliman (2018) (open access)], which examines the population structure and connectivity of the Olympia oyster (*Ostrea lurida*) along the west coast of North America. The assembly took ~4 days to run on a remote workstation with 20 cores. While this notbook will run with fewer cores, I don't recommend fewer than 8 cores and it will take longer to complete.  

The dataset is composed of single-end 100bp reads from self-made GBS libraries prepared with the ApeKI enzyme (two libraries were sequenced paired-end, but only the first read is used in the paper). Previously, raw sequencing reads were demultiplexed, all individuals with less than 200,000 raw sequencing reads were removed, and replicate sequencing runs were compared to identify the replicate with the greatest number of clusters. Those steps are detailed in the [Demultiplex.ipynb notebook]. All demultiplexed samples with greater than 200,000 raw sequencing reads, including replicates, are available on NCBI SRA. Below I include instructions on how to download these from NCBI and select only the replicates that were used in the paper, or you can download a folder of pre-filtered files from a public Dropbox (which may be more streamline).

The figure below shows the sampling locations used in the study.

![Sampling Sites](img/Figure_Phylo_1_rev.jpg)

## Setup (software and data files)

I highly recommmend running this assembly on a workstation or cluster that you can access remotely. This way, you don't tie up your computer for days but you can still view and interact with the notebook on your personal computer. The `ipyrad` documentation site has a [great tutorial](https://ipyrad.readthedocs.io/HPC_Tunnel.html) on how to use SSH tunneling to run and access a Jupyter notebook remotely. 
 
If you haven't done so yet, start by installing `ipyrad` using conda (see ipyrad installation instructions) on the computer that the analysis will be running on. The other two packages below are only if you intend to use `ipyrad` to download files from NCBI. This is easiest to do in a terminal. Then start a `jupyter lab` instance or `jupyter-notebook`, like this one, on your workstation and follow along with this notebook by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results. You could also download this notebook from Github and run it as is. 

In [2]:
## conda install ipyrad -c ipyrad
## conda install entrez-direct -c bioconda
## conda install sra-tools -c bioconda

In [None]:
## import basic modules and ipyrad and print version
import os
import socket
import glob
import subprocess as sps
import numpy as np
import ipyparallel as ipp
import ipyrad as ip

print "ipyrad v.{}".format(ip.__version__)
print "ipyparallel v.{}".format(ipp.__version__)
print "numpy v.{}".format(np.__version__)

Start an `ipcluster` instance in a terminal on the computer that the assembly will be run. If you are using [jupyter lab](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html) (which I also highly recommend!), then all you need to do is open a terminal in the same `jupyter lab` instance as your notebook and type the following command specifying the number of cores available to you (here, I have a job running with 20 cores).

In [None]:
## ipcluster start --n=20 --ip='*'

Two ways to check if you have connected to the right number of cores.

In [3]:
## open direct and load-balanced views to the client
ipyclient = ipp.Client()
lbview = ipyclient.load_balanced_view()
print "{} total cores".format(len(ipyclient.ids))

## confirm are connected to 1 20-core node
hosts = ipyclient[:].apply_sync(socket.gethostname)
for hostname in set(hosts):
    print("host compute node: [{} cores] on {}"\
          .format(hosts.count(hostname), hostname))

20 total cores
host compute node: [20 cores] on crocorock.uchicago.edu


## Download the dataset from NCBI

There are two options for downloading the raw demultiplexed sequencing data. If you are interested in comparing across technical replicates or utilizing the paired-end data available for two of the sequencing libraries, then you can download the NCBI sequence read archive (SRA) under accession id SRP174167 using the `ipyrad` analysis tools wrapper around the SRAtools software. Run the code below to download the fastq data files associated with this study. The data will be saved the specified directory which will be created if it does not already exist. If you pass your ipyclient to the .run() command below then the download will be parallelized. 

Alternatively, you can download a folder with only single-end reads and excluding technical replicates from a [public Dropbox folder]. This may be more streamlined for most users, but could be slow as it isn't parallelized.

In [None]:
## download the Pedicularis data set from NCBI
sra = ipa.sratools(accession="PRJNA511386", workdir="OL-all-fastqs")
sra.run(force=True, ipyclient=ipyclient)

## Reading in pre-sorted .fastq.gz files and setting parameters 

In [3]:
OL = ip.Assembly("OL-c85-t10")
## set parameters
OL.set_params("project_dir", "./")
OL.set_params("sorted_fastq_path", "OL-best-fastqs/*.fastq.gz")
OL.set_params("max_barcode_mismatch", 1)
OL.set_params("datatype", "gbs")
OL.set_params("restriction_overhang", ("CWGC"))
OL.set_params("mindepth_statistical","10")
OL.set_params("mindepth_majrule","10")
OL.set_params("clust_threshold", "0.85")
OL.set_params("filter_adapters", "2")
OL.set_params("max_Hs_consens",(10,10))
OL.set_params("max_SNPs_locus",(17,17))
OL.set_params("max_shared_Hs_locus","1.0")
OL.set_params("trim_loci",("0","5","0","0"))
OL.set_params("output_formats",('l','a','v'))

## see/print all parameters
OL.get_params()

New Assembly: OL-c85-t10
0   assembly_name               OL-c85-t10                                   
1   project_dir                 /home/ksilliman/Ostrea_Phylo/Full-c85-run    
2   raw_fastq_path                                                           
3   barcodes_path                                                            
4   sorted_fastq_path           ./OL-best-fastqs/*.fastq.gz                  
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    gbs                                          
8   restriction_overhang        ('CWGC', 'CWGC')                             
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        10                                           
12  mindepth_majrule            10     

## Steps 1,2,3

In [None]:
OL.run("1")

In [7]:
#Number of samples
len(OL.samples.keys())

231

In [8]:
OL.run("2")

Assembly: OL-c85-t10
[####################] 100%  processing reads      | 0:56:02 | s2 | 


In [None]:
OL.run("3")

Assembly: OL-c85-t10
[####################] 100%  dereplicating         | 0:00:03 | s3 | 
[###                 ]  15%  clustering            | 1:21:19 | s3 | 

### Filtering out samples with < 15,000 cluster of 10 or more reads

In [12]:
## get list of samples with >15000 consens clusters 
skeep = OL.stats.index[OL.stats.clusters_hidepth > 15000].tolist()
print len(skeep)
## print who was excluded
print "excluded samples:\t\tclusters_hidepth"
for name, dat in OL.samples.items():
    if name not in skeep:
        print "  {:<22}\t{}".format(name, int(dat.stats.clusters_hidepth))

182
excluded samples:		clusters_hidepth
  BC3_12_C4             	5674
  OR3_12_C3             	2352
  CA7_16_C4             	1508
  CA6_15_C4             	5941
  WA1_12_C3             	5673
  BC1_19_C5             	14033
  CA1_17_C1             	9561
  BC3_2_C6              	2951
  CA4_12_C1             	7125
  OR2_7_C5              	5732
  WA12_2_C6             	12274
  WA13_7_C5             	4373
  WA9_6_C3              	2275
  CA1_1_C4              	3298
  CA5_12_C3             	4088
  OR3_3_C7              	4473
  WA12_8_C9             	12413
  BC4_14_C5             	9925
  CA4_18_C6             	1920
  WA13_6_C3             	3614
  WA1_16_C4             	9348
  BC3_15_C5             	2767
  WA13_4_C8             	8453
  BC2_2_C2              	13099
  WA13_5_C4             	7465
  CA3_9_C9              	8076
  CA7_6_C1              	3477
  CA2_11_C6             	12403
  BC1_5_C2              	9637
  BC2_5_C3              	2666
  CA1_10_C7             	2428
  CA1_18_C4             	

In [13]:
#Drop samples with < 15,000 clusters
s3filt = OL.branch("OL-s3filt-c85-t10", subsamples=skeep)

## Steps 4, 5, 6, 7

In [None]:
nsamps = len(s3filt.samples.keys())/2
s3filt.set_params("min_samples_locus", nsamps)
s3filt.run("4567")

Assembly: OL-s3filt-c85-t10
[####################] 100%  inferring [H, E]      | 0:22:21 | s4 | 
[####################] 100%  calculating depths    | 0:02:26 | s5 | 
[####################] 100%  chunking clusters     | 0:02:24 | s5 | 
[################### ]  97%  consens calling       | 1:20:53 | s5 | 

In [15]:
cat $s3filt.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci             389577              0         389577
filtered_by_rm_duplicates           27114          27114         362463
filtered_by_max_indels               2399           2399         360064
filtered_by_max_snps                 8161           1991         358073
filtered_by_max_shared_het              0              0         358073
filtered_by_min_sample             357970         336207          21866
filtered_by_max_alleles             88372           8369          13497
total_filtered_loci                 13497              0          13497


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

            sample_coverage
BC1_10_C6             12419
BC1_11_C2              4414
BC1_12_C4              3701
BC1_1_C2               5017


## Filtering out individuals found in less than 45% of loci

In [24]:
loci45= int(s3filt.stats_dfs.s7_filters.total_filters.total_filtered_loci *0.45)
loci45

6073

In [26]:
## get list of samples found in at least 45% of loci after Step 7
skeep = s3filt.stats.index[s3filt.stats_dfs.s7_samples.sample_coverage >= loci45].tolist()
print len(skeep)
## print who was excluded
print "excluded samples:\t\tsample_coverage"
for name, dat in s3filt.stats_dfs.s7_samples.iterrows():
    if name not in skeep:
        print "  {:<22}\t{}".format(name, int(dat['sample_coverage']))

117
excluded samples:		sample_coverage
  BC1_11_C2             	4414
  BC1_12_C4             	3701
  BC1_1_C2              	5017
  BC1_2_C2              	2185
  BC2_14_C2             	3039
  BC2_1_C2              	2871
  BC2_3_C7              	3193
  BC2_6_C2              	4815
  BC2_9_C4              	2526
  BC3_11_C2             	3803
  BC3_18_C5             	6036
  BC3_1_C2              	5414
  BC3_4_C2              	2891
  BC4_21_C1             	2061
  BC4_2_C4              	5266
  BC4_6_C2              	4032
  BC4_7_C2              	4223
  CA1_21_C1             	3078
  CA1_5_C5              	3589
  CA1_9_C7              	4087
  CA2_5_C9              	2257
  CA3_11_C9             	3876
  CA3_3_C1              	2461
  CA3_4_C1              	4829
  CA4_13_C1             	3983
  CA4_15_C9             	3787
  CA4_16_C5             	3203
  CA4_7_C8              	6052
  CA5_13b_C3            	2967
  CA5_16_C1             	3208
  CA5_8_C1              	4069
  CA6_13_C1             	4819
 

In [27]:
for name, dat in s3filt.stats_dfs.s7_samples.iterrows():
    if name in skeep:
        print "  {:<22}\t{}".format(name, int(dat['sample_coverage']))

  BC1_10_C6             	12419
  BC1_20_C6             	11823
  BC1_22_C7             	12964
  BC1_4_C3              	11858
  BC1_7_C5              	7637
  BC1_8_C4              	9577
  BC1_9_C5              	10987
  BC2_10_C5             	11967
  BC2_11_C5             	7228
  BC2_12_C6             	13220
  BC2_13_C4             	11733
  BC2_16_C6             	13268
  BC2_17_C7             	13114
  BC2_18_C7             	12728
  BC2_7_C4              	13082
  BC3_16_C7             	6410
  BC3_17_C3             	9280
  BC3_20_C7             	13241
  BC3_3_C6              	11732
  BC3_9_C9              	12994
  BC4_12_C5             	12540
  BC4_15_C5             	10934
  BC4_19b_C6            	12398
  BC4_3_C3              	6649
  BC4_9_C2              	6191
  CA1_15_C3             	11601
  CA1_16_C6             	8686
  CA1_19_C6             	12854
  CA1_20_C8             	6737
  CA1_22_C5             	10985
  CA1_2_C4              	10948
  CA1_4_C5              	7591
  CA2_10_C8       

## Rerunning steps 4-7 for the final samples

In [None]:
OL_s7filt45 = s3filt.branch("OL-s7filt45-c85-t10", subsamples = skeep)
nsamps = len(OL_s7filt45.samples.keys())*.7
OL_s7filt45.set_params("min_samples_locus", nsamps)
OL_s7filt45.set_params("output_formats", ('l','v','a'))
OL_s7filt45.run("4567", force = True)

Assembly: OL-s7filt45-c85-t10
[############        ]  61%  inferring [H, E]      | 0:14:17 | s4 | 

In [31]:
len(OL_s7filt45.samples)

117

In [32]:
cat $OL_s7filt45.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci             352818              0         352818
filtered_by_rm_duplicates           24993          24993         327825
filtered_by_max_indels               2020           2020         325805
filtered_by_max_snps                 6500           1578         324227
filtered_by_max_shared_het              0              0         324227
filtered_by_min_sample             322966         302399          21828
filtered_by_max_alleles             76761           7904          13924
total_filtered_loci                 13924              0          13924


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

            sample_coverage
BC1_10_C6             13178
BC1_20_C6             12790
BC1_22_C7             13503
BC1_4_C3              12673


In [33]:
## Try again...
## create a branch with at least 1 loci per population and outgroups removed
## Cannot get to work!
OL_pops = OL_s7filt45.branch("OL-s7filt45-c85-t10-pops")
OL_pops.populations = {
    "San_Diego": (1, [i for i in OL_pops.samples if "CA1" in i]),
    "Mugu_Lagoon": (1, [i for i in OL_pops.samples if "CA7" in i]),
    "Elkhorn_Slough": (1, [i for i in OL_pops.samples if "CA5" in i]),
    "San_Francisco": (1, [i for i in OL_pops.samples if any(s in i for s in ["CA2","CA3"])]),
    "Tomales": (1, [i for i in OL_pops.samples if "CA4" in i]),
    "Humboldt": (1, [i for i in OL_pops.samples if "CA6" in i]),
    "Coos": (1, [i for i in OL_pops.samples if "OR1" in i]),
    "Yaquina": (1, [i for i in OL_pops.samples if "OR2" in i]),
    "Netarts": (1, [i for i in OL_pops.samples if "OR3" in i]),
    "Willapa": (1, [i for i in OL_pops.samples if any(s in i for s in ["WA9","WA1_"])]),
    "Puget_Sound": (1, [i for i in OL_pops.samples if any(s in i for s in ["WA10","WA11","WA12","WA13"])]),
    "Victoria": (1, [i for i in OL_pops.samples if "BC1" in i]),
    "Ladysmith": (1, [i for i in OL_pops.samples if "BC4" in i]),
    "Barkeley": (1, [i for i in OL_pops.samples if "BC3" in i]),
    "Klaskino": (1, [i for i in OL_pops.samples if "BC2" in i])
    }
OL_pops.run("7",force=True)

Assembly: OL-s7filt45-c85-t10-pops
[####################] 100%  filtering loci        | 0:02:12 | s7 | 
[####################] 100%  building loci/stats   | 0:02:00 | s7 | 
[####################] 100%  building alleles      | 0:02:19 | s7 | 
[####################] 100%  building vcf file     | 0:02:39 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:01:52 | s7 | 
[####################] 100%  writing outfiles      | 0:00:00 | s7 | 
Outfiles written to: ~/Ostrea_Phylo/Full-c85-run/OL-s7filt45-c85-t10-pops_outfiles



In [34]:
cat $OL_pops.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci             352818              0         352818
filtered_by_rm_duplicates           24993          24993         327825
filtered_by_max_indels               2020           2020         325805
filtered_by_max_snps                 6500           1578         324227
filtered_by_max_shared_het              0              0         324227
filtered_by_min_sample             323061         302463          21764
filtered_by_max_alleles             76761           7886          13878
total_filtered_loci                 13878              0          13878


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

            sample_coverage
BC1_10_C6             13140
BC1_20_C6             12753
BC1_22_C7             13467
BC1_4_C3              12636
