## Parallelized *STRUCTURE* analyses on unlinked neutral SNPs

Part of the `ipyrad.analysis` toolkit. Here I am taking a structure file creating by the radiator R package from a filtered vcf, which requires a little file formatting. Otherwise it should work with an ipyrad created Structure file.

### Required software
You can easily install the required software for this notebook with a locally installed `conda` environment. Just run the commented code below in a terminal. If you are working on an HPC cluster you **do not need** administrator privileges to install the software in this way, since it is only installed locally.

In [1]:
## conda install ipyrad -c ipyrad
## conda install structure -c ipyrad
## conda install clumpp -c ipyrad
## conda install toytree -c eaton-lab

### Import Python libraries

In [1]:
import ipyrad.analysis as ipa      ## ipyrad analysis toolkit
import ipyparallel as ipp          ## parallel processing
import toyplot                     ## plotting library

  from ._conv import register_converters as _register_converters


In [2]:
print "ipyparallel v.{}".format(ipp.__version__)

ipyparallel v.6.1.1


### Parallel cluster setup
Start an `ipcluster` instance in a separate terminal. An easy way to do this in a jupyter-notebook running on an HPC cluster is to go to your Jupyter dashboard, and click [new], and then [terminal], and run '`ipcluster start`' in that terminal. This will start a local cluster on the compute node you are connected to. See our [ipyparallel tutorial] (coming soon) for further details. 

In [3]:
##
## ipcluster start --n=24 --profile='n24'
##

In [15]:
## get parallel client
ipyclient = ipp.Client(profile="n24")
print(len(ipyclient), 'cores')

(24, 'cores')


### Enter input and output file locations
The `.str` file is a structure formatted file that includes all SNPs present in the data set. Here I needed to do a little file formatting to make it work with the .str file output by radiator. The `.snps.map` file is an optional file that maps which loci each SNP is from. If this file is used then each replicate analysis will *randomly* sample a single SNP from each locus in each rep. The results from many reps therefore will represent variation across unlinked SNP data sets, as well as variation caused by uncertainty. The `workdir` is the location where you want output files to be written and will be created if it does not already exist. 

In [3]:
%%sh
sed -r 's/^([^\t]*\t[^\t]*)(.*)$/\1\t\t\t\2/' OL-c85t10-x45m75-maf025-neutI2-filt.str | tail -n +2 > OL-c85t10-x45m75-maf025-neutI2-filt_head.str  
head OL-c85t10-x45m75-maf025-neutI2-filt_head.str | cut -f 1-10

BC2-10-C5	1				4	2	1	4	3
BC2-10-C5	1				4	2	1	1	1
BC2-11-C5	1				-9	2	-9	-9	3
BC2-11-C5	1				-9	2	-9	-9	3
BC2-12-C6	1				4	2	1	4	3
BC2-12-C6	1				4	2	1	1	3
BC2-13-C4	1				4	2	1	4	3
BC2-13-C4	1				4	2	1	4	3
BC2-16-C6	1				4	2	1	4	3
BC2-16-C6	1				4	2	1	4	3


In [4]:
## the structure formatted file
strfile = "./OL-c85t10-x45m75-maf025-neutI2-filt_head.str"

## an optional mapfile, to sample unlinked SNPs
mapfile = "./OL-c85t10-x45m75-maf025-neutI2-filt.snps.map"

## the directory where outfiles should be written
workdir = "./OL_t10x45m75_maf025_neutI2_filt/"

In [5]:
#Create mapfile
IN = open("OL-c85t10-x45m75-maf025-neutI2-filt.str","r")
OUT = open(mapfile,"w")
loci = IN.readline().strip().split()
IN.close()
snp = 1
span = 1
OUT.write(str(span)+"\t"+loci[0]+"\t0\t"+str(snp)+"\n")

for l in range(1,len(loci)):
    snp+=1
    tag = loci[l].split("_")[1]
    if loci[l-1].split("_")[1] != tag:
        span += 1
    OUT.write(str(span)+"\t"+loci[l]+"\t0\t"+str(snp)+"\n")
 
OUT.close()

In [6]:
%%sh
tail OL-c85t10-x45m75-maf025-neutI2-filt.snps.map

6154	locus_9983__38__38	0	13182
6154	locus_9983__39__39	0	13183
6154	locus_9983__46__46	0	13184
6154	locus_9983__47__47	0	13185
6154	locus_9983__74__74	0	13186
6155	locus_99894__13__13	0	13187
6155	locus_99894__28__28	0	13188
6155	locus_99894__37__37	0	13189
6156	locus_99918__38__38	0	13190
6157	locus_99985__67__67	0	13191


### Create a *Structure* Class object
Structure is kind of an old fashioned program that requires creating quite a few input files to run, which makes it not very convenient to use in a programmatic and reproducible way. To work around this we've created a convenience wrapper object to make it easy to submit Structure jobs and to summarize their results. 

In [7]:
## create a Structure object
structN = ipa.structure(name="OL_t10x45m75maf025_neutI2",
                       data=strfile,
                       mapfile=mapfile,
                       workdir=workdir)

### Set parameter options for this object
Our Structure object will be used to submit jobs to the cluster. It has associated with it a name, a set of input files, and a large number of parameter settings. You can modify the parameters by setting them like below. You can also use tab-completion to see all of the available options, or print them like below. See the [full structure docs here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0ahUKEwjt9tjpkszYAhWineAKHZ4-BxAQFgg4MAI&url=https%3A%2F%2Fwww.researchgate.net%2Ffile.PostFileLoader.html%3Fid%3D591c636cdc332d78a46a1948%26assetKey%3DAS%253A495017111953409%25401495032684846&usg=AOvVaw0WjG0uD0MXrs5ResMIHnik) for further details on the function of each parameter. In support of reproducibility, it is good practice to print both the mainparams and extraparams so it is clear which options you used. 

In [10]:
## set mainparams for object
structN.mainparams.burnin = 50000
structN.mainparams.numreps = 200000
structN.mainparams.popdata = 1
structN.mainparams.popflag = 1
# I don't want the popflag used as a prior, so set that to zero in the Structure object
structN.popflag = [0] * structN.ntaxa
## see all mainparams
print structN.mainparams

## see or set extraparams
print structN.extraparams

burnin             50000               
extracols          0                   
label              1                   
locdata            0                   
mapdistances       0                   
markernames        0                   
markovphase        0                   
missing            -9                  
notambiguous       -999                
numreps            200000              
onerowperind       0                   
phased             0                   
phaseinfo          0                   
phenotype          0                   
ploidy             2                   
popdata            1                   
popflag            1                   
recessivealleles   0                   

admburnin           500                 
alpha               1.0                 
alphamax            10.0                
alphapriora         1.0                 
alphapriorb         2.0                 
alphapropsd         0.025               
ancestdist          0            

### Submit jobs to run on the cluster
The function `run()` distributes jobs to run on the cluster and load-balances the parallel workload. It takes a number of arguments. The first, `kpop`, is the number of populations. The second, `nreps`, is the number of replicated runs to perform. Each rep has a different random seed, and if you entered a mapfile for your Structure object then it will subsample unlinked snps independently in each replicate. The `seed` argument can be used to make the replicate analyses reproducible. The `extraparams.seed` parameter will be generated from this for each replicate. And finally, provide it the `ipyclient` object that we created above. The structure object will store an *asynchronous results object* for each job that is submitted so that we can query whether the jobs are finished yet or not. Using a simple for-loop we'll submit 20 replicate jobs to run at four different values of K. 

In [11]:
## a range of K-values to test
tests = [8,7,6,5,4,3,2,1,9,10]

In [12]:
## submit batches of 5 replicate jobs for each value of K 
for kpop in tests:
    structN.run(
        kpop=kpop,  
        nreps=5, 
        seed=12345,
        ipyclient=ipyclient)

submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-8]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-7]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-6]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-5]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-4]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-3]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-2]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-1]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-9]
submitted 5 structure jobs [OL_t10x45m75maf025_neutI2-K-10]


### Track progress until finished
You can check for finished results by using the `get_clumpp_table()` function, which tries to summarize the finished results files. If no results are ready it will simply print a warning message telling you to wait. If you want the notebook to block/wait until all jobs are finished then execute the `wait()` function of the ipyclient object, like below. 

In [16]:
## see submitted jobs (we query first 10 here)
structN.asyncs[:20]

[<AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>,
 <AsyncResult: _call_structure:finished>]

In [17]:
## query a specific job result by index
if struct.asyncs[1].ready():
    print struct.asyncs[1].result()

NameError: name 'struct' is not defined

In [None]:
## block/wait until all jobs finished
ipyclient.wait() 