# msbayes buffering empirical test (neotropical butterflies)

These are the steps i followed to gather, convert, and reanalyse the very large mtDNA butterfly dataset from Garzon-Orduna 2014.

[Google sheet with all the species pairs](https://docs.google.com/spreadsheets/d/1QZFwnSzb4Y-K8dPvWfImJ3yuo3VAHGCMMWtJ1clAQi4/edit#gid=1642624377)

Garzón‐Orduña, I. J., Benetti‐Longhini, J. E., & Brower, A. V. (2014). Timing the diversification of the Amazonian biota: butterfly divergences are consistent with Pleistocene refugia. Journal of Biogeography, 41(9), 1631-1638.


First, several of the files had windows style line-breaks, which is a drag if you're 
trying to do command line parsing. To clean the ^M characters I opened each file in vi and did:
```
:%s/\r/\r/g
```
There's probably a better way to do this for all files, but this worked cuz i only had to fix a couple.

All the files were multi-line fasta and I needed to get them into single line, so I used this awk 
script to join all the lines for each sequence:
```
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < ../Taygetis.fas > Taygetis.fas
```


Next step is to convert fasta data for each species pair into .im format. I received ~15 fasta files, one for each genus, but inside each fasta were many dozens of seqs for many species, many of which did not have a properly identified sibling species which made them useless for this analysis. I started doing this by hand, grepping through files, and copy-pasting into new .im files, it quickly became tedious so I wrote a little script to do most of the dirty work. The script takes 2 arguments, the 2 sibling species to collect sequence for separated by a '-' and the fasta file to search for sequences in. 

```
./dopops.py erithalion-vertumnus Parides.fas
```

This outputs a file called `parides-erithalion-vertumnus.im`, which you still have to manually curate a bit, make sure the alignment is good, etc, etc.

I checked questionable alignments with muscle:
```
# -A1 Shows one line of context after what you're searching for
# "wat|watdo" in egrep searches for _either_ string in the file
egrep -A1 "erithalion|vertumnus" Parides.fas > tmp.fa
muscle -in tmp.fa -out tmp.ali
cat tmp.ali
```

In [None]:
#!/usr/bin/env python2

## dopops.py script to pull individuals for each taxon out of a larger fasta file

import sys


pop1 = sys.argv[1].split("-")[0]
pop2 = sys.argv[1].split("-")[1]

fa = sys.argv[2]

outfile = fa.split(".")[0].lower()+"-"+sys.argv[1]+".im"
print(outfile)

with open(outfile, 'w') as out:

    ## write header
    out.write(fa.split(".")[0]+"\t\t\t"+pop1+"/"+pop2+"\n")
    out.write("pop1 pop2"+"\n")
    out.write("1"+"\n")
    out.write("gene1  ")

    # get counts for pop1 and pop2
    with open(fa, 'r') as fasta:
        seqs = fasta.readlines()

        npop1 = 0
        npop2 = 0
        pop1seqs=[]
        pop2seqs=[]
        maxlen = 0
        for i, seq in enumerate(seqs):
            if pop1 in seq:
                npop1 +=1
                pop1seqs.extend([seqs[i+1]])
                if len(seqs[i+1]) > maxlen:
                    maxlen = len(seqs[i+1])     
            elif pop2 in seq:
                npop2 +=1
                pop2seqs.extend([seqs[i+1]])
                if len(seqs[i+1]) > maxlen:
                    maxlen = len(seqs[i+1])
        out.write(str(npop1)+" "+str(npop2)+" "+str(maxlen)+" I 0.25\n")

        for i, seq in enumerate(pop1seqs):
            out.write("pop1_"+str(i+1)+"\t"+seq)
        for i, seq in enumerate(pop2seqs):
            out.write("pop2_"+str(i+1)+"\t"+seq)

# Create the master msbayes file and the properly formatted fasta
Ok, so now I have a directory that's FULL of .im files and I need to convert them 
into the format that msbayes undertands. Two step process:

1) Get the infile.list of all input .im files to convert
```
cd <working directory>
ls -1 > infile.list
```
2) Convert to msbayes format
```
<path_to_msbayes_src>/convertIM.pl infile.list
```
When you run this command you'll see a lot of complaining about gaps removed from analysis, this is because msbayes doesn't like sites with missing data, so just do the best you can. It's also going to replace all ambiguous characters, all N's and all gaps ('-') with ?.

When it's done you'll have a file called `batch.masterIn.fromIM` and a directory called `fastaFromIM` with all your fasta files.

# Create the observed summary stats vector
```
<path_to_msbayes_src>/obsSumStats.pl -T obsSS.table batch.masterIn.fromIM > obsSS.txt
```
This creates a single observed summary stats vector called **obsSS.txt**, and an **obsSS.table** with all the stats broken out per stat type per species pair (more human readable). The human readable version is nice because I went through and checked values of pi.b for each taxon pair, to verify reasonable values. Luckily I identified _several_ issues, including 2 sequences that just had bad alignments and **all** the sequences with '+' in their names had been munged for some reason. convertIM.pl assigns a suffix to each sample name, as you see below, this problem only impacts taxon pairs that convert.IM decides to use 3 letter suffixes for. Taxon pairs with 2 letter suffixes are fine.

Something messed up during the converIM.pl phase and many of the output fasta files had to be hand curated (all these below had pi.b >> 10%, upon inspection the seqs had been munged so i had to fix them). In all cases the munged seqs looked like this, apparently the convert script is not properly chomping the head of the sequence, messes the alignment:

>baeotus-aelius+beotus-deucalion_pop1_gene1_pop1_2_TGA
>GCTGGTATAGTAGGAACTTCACTTAGTTTATTAATTCGAACTGAATTAGGAAATCCAGGATCATTAATTGGTGATGATCAAATTTATAATACAATTGTAACAGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCAATTATAAT
>baeotus-aelius+beotus-deucalion_pop2_gene1_pop2_2_TG
>AGCTGGTATAGTGGGGACTTCACTTAGTCTATTAATTCGAACTGAGTTAGGAAATCCAGGATCATTAATTGGAGATGATCAAATTTATAATACAATTGTTACAGCTCATGCTTTTATTATAATTTTCTTTATAGTTATACCAATTATAAT

Right now guessing it doesn't like something about the '+' in the file name... It's definitely the '+' in the file name, and it impacts all the members of pop2.

* BAEOTUS-AELIUS+BEOTUS-DEUCALION
* BLEPOLENIS-CATHARINAE+BATEA-BASSUS
* DASYOPHTHALMA-RUSINA+GERAENSIS-CREUSA
* ERESIA-PERNA+CLIO-LETITIA
* EUEIDES-LYBIA+TALES-ALIPHERA
* EUEIDES-VIBILIA+LAMPETO-PAVANA  0.662798
* FORSTERINARIA_SUBCLADE-PILOSA+PICHITA-GUANILOI
* HAMADRYAS-BELLADONNA+AMPHINOME-ARINOME
* HARJESIA-OBSCURA+SP-BLANDA
* HELICONIUS-BURNEYI+WALLACEI-EGERIA
* HELICONIUS-CLYSONIMUS+TELESIPHE-HORTENSE
* HELICONIUS-ELEUCHIA+CONGENER-SAPHO
* HELICONIUS-HIERAX+XANTHOCLES-DORIS
* HYPANARTIA-LETHE+GODMANII-BELLA
* JANATELLA-LEUCODESMA+HERA-FELLULA
* JUNONIA-VESTINA+GENOVEVA-EVARETE
* LYMANOPODA-ALBOMACULATA+APULIA-AFFINEOLA
* MORPHO-ACHILLES+HELENOR-GRANADENSIS
* NAPEOGENES-APULIA+GRACILIS-INACHIA
* NAPEOGENES-GLYCERA+SP1-PHARO
* NAPEOGENES-SODALIS+BENIGNA-SULPHUREOPHILA
* OLERIA-ATTALIA+CYRENE-BIOCULATA
* OLERIA-BOYERI+DERONDA-DERONDINA
* OLERIA-RUBESCENS+PAULA-ZELICA
* OLERIA-SP511+ESTELLA-GUNILLA
* PSEUDODEBIS-CELIA+PURITANA-MARPESSA
* TAYGETINA_SUBCLADE-KEREA+PERIBAEA-WEYMERI
* TAYGETIS-RUFOMARGINATA+VIRGILIA-ACUTA
* TAYGETIS-THAMYRA+SOSIS-SPPM04_04
* TAYGETIS-UNCINATA+SPUN0261-LACHES

These two were bad for different reasons:
* heliconius-numata-ismenius
* gnathrotiche-exclamationis-mundina

After I fixed these in the fastaFromIM directory I regenerated the sumstats:
```


# Simulate the prior
Now i copied my obsSS.txt and batch.master.In.fromIM to the msbayes/src directory and renamed them **butterfly.obsSS.txt** and **butterfly.buffer0.conf** (since i intend to test multiple buffering values).
```
cd /Volumes/WorkDrive/msbayes-buffering/bfly-im-files
cp obsSS.txt ../hickerlab-repository/msbayes-buffering/src/butterfly.obsSS.txt
cp batch.masterIn.fromIM ../hickerlab-repository/msbayes-buffering/src/butterfly.buffer0.conf
```
Now generate the prior:
```


# Code for generating infiles, converting, and generating the priors

In [26]:
## Define the msbayes priors command
## This command should be called inside a multiprocess Process() call, inside a for loop

## do_priors expects 2 args
##   - num is the number of replicates to do
##   - outname is the file to write to

# Sorting. 7 is default (with sorting). 0 is no sorting. 
SORTING_FLAG=7

def do_priors(num, outname, conffile):
    cmd = MSBAYES_BIN \
        + " -s " + str(SORTING_FLAG)\
        + " -r " + str(num) \
        + " -c " + conffile \
        + " -o " + outname
    try:
        print(cmd)
        with open("/tmp/watt", 'a') as outfile:
            outfile.write(cmd)
        time.sleep(2)
        os.system(cmd)
        #subprocess.check_output([MSBAYES_BIN, "-h"],
        #                        stderr=subprocess.STDOUT)
    except Exception as inst:
        print(inst)

In [1]:
## Define directories and file paths

from __future__ import print_function

import os
import time
import subprocess

#msbayes paths
MSBAYES_ROOTDIR="/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/"
MSBAYES_EXECDIR=MSBAYES_ROOTDIR+"src/"

# Binaries
MSBAYES_BIN=MSBAYES_EXECDIR+"msbayes.pl"
MSCONVERT_PL=MSBAYES_EXECDIR+"convertIM.pl"
MSOBSSS_PL=MSBAYES_EXECDIR+"obsSumStats.pl"
MSREJECT_PL=MSBAYES_EXECDIR+"acceptRej.pl"

# Results directories
MSBAYES_DATADIR=MSBAYES_ROOTDIR+"data/"

# Specific paths for the butterfly data
BFLY_OUT=MSBAYES_DATADIR+"bfly/"
BFLY_CONF_DIR=BFLY_OUT+"conf/"
BFLY_PRIORS=BFLY_OUT+"priors/"
BFLY_RESULTS=BFLY_OUT+"results/"
BFLY_IMFILES_DIR="/Volumes/WorkDrive/msbayes-buffering/bfly-im-files/"

## PATHS and files for the full dataset
BFLY_FULL_CONF_DIR=BFLY_CONF_DIR+"full/"
BFLY_FULL_INFILE=BFLY_FULL_CONF_DIR+"bfly-full-infile.txt"
BFLY_FULL_CONF=BFLY_FULL_CONF_DIR+"conf_bfly_full_buffer0.txt"
BFLY_FULL_PRIORS_DIR=BFLY_PRIORS+"full/"
BFLY_FULL_RESULTS_DIR=BFLY_RESULTS+"full/"

BFLY_OBS_SS=BFLY_FULL_CONF_DIR+"bfly_full_obsSS.txt"

## PATHS and files for the subset of data with a larger number of 
## samples per taxon pair
BFLY_SUBSET_CONF_DIR=BFLY_CONF_DIR+"subset/"
## Don't use this one, iterating over different values of maxn to keep
#BFLY_SUBSET_INFILE=BFLY_SUBSET_CONF_DIR+"bfly-subset-infile.txt"
BFLY_SUBSET_CONF=BFLY_SUBSET_CONF_DIR+"butterfly.subset.conf"
BFLY_SUBSET_PRIORS_DIR=BFLY_PRIORS+"subset/"
BFLY_SUBSET_RESULTS_DIR=BFLY_RESULTS+"subset/"

os.chdir(MSBAYES_EXECDIR)

OSError: [Errno 2] No such file or directory: '/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/'

## Full dataset - Making input files

In [19]:
## Make the infile, conf file and fastaFromIM dir/fastas for the
## the `grep -v +` pulls out all the non-sibling species, the composites of multiple species
cmd = "ls -1 " + BFLY_IMFILES_DIR + "*.im |  grep -v + > " + BFLY_FULL_INFILE
print(cmd)
print(os.system(cmd))

## Do the conversion
os.chdir(BFLY_FULL_CONF_DIR)

## Set the mutation scalar to 1 since this is all single locus mtdna
cmd = MSCONVERT_PL + " -m 1 -o "+ BFLY_FULL_CONF + " " + BFLY_FULL_INFILE
print(cmd)
print(os.system(cmd))

## Now you have to go add these lines to the bfly_full_conf by hand cuz i'm too lazy to code it
"""
# Ensure all tau classes are constrained to be at least this far apart
bufferTauClasses = 0
"""

ls -1 /Volumes/WorkDrive/msbayes-buffering/bfly-im-files/*.im |  grep -v + > /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/bfly-full-infile.txt
0
/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/convertIM.pl -m 1 -o /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.txt /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/bfly-full-infile.txt
0


In [14]:
# Make the conf files for different buffer values
BUFFER_VALUES = ["0.1", "0.05", "0.01"]
## Do the conversion
os.chdir(BFLY_FULL_CONF_DIR)

buffer_conf_dict={}
for buff in BUFFER_VALUES:
    BFLY_BUFF_CONF=BFLY_FULL_CONF.split("0")[0]+buff+".txt"
    print("Making - " + BFLY_BUFF_CONF)
    with open(BFLY_FULL_CONF, 'r') as infile:
        lines = infile.readlines()
        with open(BFLY_BUFF_CONF, 'w') as outfile:
            for line in lines:
                if "bufferTauClasses = 0" in line:
                    outfile.write("bufferTauClasses = " + buff + "\n")
                else:
                    outfile.write(line)
        buffer_conf_dict[buff]=BFLY_BUFF_CONF

Making - /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.1.txt
Making - /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.05.txt
Making - /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.01.txt


## Full dataset - Generating observed summary stats

In [23]:
# This only needs to be done once

# Be sure you're in the directory with the fastaFromIM directory
os.chdir(BFLY_FULL_CONF_DIR)

cmd = MSOBSSS_PL + " -T " + BFLY_FULL_CONF_DIR+"bfly_obsSS.table " + BFLY_FULL_CONF + " > " + BFLY_OBS_SS

print(cmd)
print(os.system(cmd))

/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/obsSumStats.pl -T /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/bfly_obsSS.table /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.txt > /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/bfly_full_obsSS.txt
0


# Generate reference tables (sloowwww)

In [31]:
import multiprocessing

## Go to the msbayes working directory
os.chdir(MSBAYES_EXECDIR)

## Generate chunks of the reference table
## This takes the better part of a day, so don't run it unless you're _sure_
## you want it. If you need to kill open a term and `killall -9 perl`
BUFFER_VALUES = ["0", "0.1", "0.05", "0.01"]

for buff in BUFFER_VALUES:
    print("Doing reference tables for - ", buff)
    BFLY_BUFF_PRIORS_DIR=BFLY_FULL_PRIORS_DIR+"buffer"+buff+"/"
    directory=BFLY_BUFF_PRIORS_DIR
    if not os.path.exists(directory):
        os.makedirs(directory)

    ## Generate the priors
    ## Generate a chunk of the reference table
    PRIORS_SIZE=3000000
    NPROC=10
    CHUNK_SIZE=PRIORS_SIZE/NPROC
    print("chunk size = "+str(CHUNK_SIZE))
    for i in range(NPROC):
        outfile = BFLY_BUFF_PRIORS_DIR+"bfly-full-sorted-buffer"+buff+"-"+str(i)+".prior"
        conffile = BFLY_FULL_CONF.split("0")[0]+buff+".txt"
        print("Using - ", outfile, conffile)
        p = multiprocessing.Process(target=do_priors, args=(CHUNK_SIZE,outfile,conffile))
        p.start()
        time.sleep(2)


('Doing reference tables for - ', '0')
chunk size = 300000
('Using - ', '/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0/bfly-full-sorted-buffer0-0.prior', '/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.txt')
/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/msbayes.pl -s 7 -r 300000 -c /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.txt -o /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0/bfly-full-sorted-buffer0-0.prior
('Using - ', '/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0/bfly-full-sorted-buffer0-1.prior', '/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/conf_bfly_full_buffer0.txt')
/Volu

In [None]:
## Concatenate all the chunks of the reference tables into one
## ginormous file
#BUFFER_VALUES = ["0", "0.1", "0.05", "0.01"]
#BUFFER_VALUES = ["0"]
BUFFER_VALUES = ["0.1", "0.05", "0.01"]

for buff in BUFFER_VALUES:
    print("Concatenating all chunks for - ", buff)
    ## I ended up doing this part by hand because of disk space limitations
    ## cp BFLY_BUFF_PRIORS_DIR+"bfly-full-sorted-buffer"+buff+"-"+"-0.prior"
    for i in xrange(1,10):
        BFLY_BUFF_PRIORS_DIR=BFLY_FULL_PRIORS_DIR+"buffer"+buff+"/"
        BUFF_CHUNK = BFLY_BUFF_PRIORS_DIR+"bfly-full-sorted-buffer"+buff+"-"+str(i)+".prior"
        
        FULL_BUFF_TMPDIR="/Users/iovercast/Desktop/tmp/"
        FULL_BUFF=FULL_BUFF_TMPDIR + "bfly-full-sorted-buffer" + buff + ".prior"

        #cmd = "ls -l " + BUFF_CHUNK
        #print(cmd)
        #print(os.system(cmd))
        
        ## The tail +2 removes the header from all but the first file
        cmd = "tail -n +2 " + BUFF_CHUNK + " >> " + FULL_BUFF
        print(cmd)
        print(os.system(cmd))
    

Concatenating all chunks for -  0.1
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0.1/bfly-full-sorted-buffer0.1-1.prior >> /Users/iovercast/Desktop/tmp/bfly-full-sorted-buffer0.1.prior
0
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0.1/bfly-full-sorted-buffer0.1-2.prior >> /Users/iovercast/Desktop/tmp/bfly-full-sorted-buffer0.1.prior
0
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0.1/bfly-full-sorted-buffer0.1-3.prior >> /Users/iovercast/Desktop/tmp/bfly-full-sorted-buffer0.1.prior
0
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0.1/bfly-full-sorted-buffer0.1-4.prior >> /Users/iovercast/Desktop/tmp/bfly-full-sorted-buffer0.1.prior
0
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buf

## Now do the rejection step (Full dataset)

In [21]:
## I had to manually install several R packages at this point
##     install.packages("VGAM")
##     install.packages("locfit")

## Make sure you are in the msbayes working directory
os.chdir(MSBAYES_EXECDIR)

## This is the command we'll eventually run
## ./acceptRej.pl -s 'pi.b' -p test.pdf ../data/bfly/conf/full/bfly_full_obsSS.txt ../data/bfly/priors/full/buffer0/bfly-full-sorted-buffer0-0.prior > test.out

directory=BFLY_FULL_RESULTS_DIR
if not os.path.exists(directory):
    os.makedirs(directory)

BUFFER_VALUES = ["0", "0.1", "0.05", "0.01"]

for buff in BUFFER_VALUES:
    print("Doing rejection step for - ", buff)

    ## create a results dir for each buffer value
    BUFF_OUT=BFLY_FULL_RESULTS_DIR + "buffer" + buff + "/"
    if not os.path.exists(BUFF_OUT):
        os.makedirs(BUFF_OUT)
    
    BFLY_BUFF_PRIORS_DIR=BFLY_FULL_PRIORS_DIR+"buffer"+buff+"/"
    BFLY_BUFF_PRIOR=BFLY_BUFF_PRIORS_DIR+"bfly-full-sorted-buffer"+buff+".prior"
    B_OUT_FILE=BUFF_OUT + "Results-" + buff + ".out"
    B_OUT_PDF=BUFF_OUT + "Results-" + buff + ".pdf"

    cmd = MSREJECT_PL + " -s 'pi.b' -p " + B_OUT_PDF + " "\
            + " -t 0.003 "\
            + BFLY_OBS_SS + " " + BFLY_BUFF_PRIOR \
            + " > " + B_OUT_FILE
    print(cmd)
    print(os.system(cmd))
    print("\n")


Doing rejection step for -  0
/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/acceptRej.pl -s 'pi.b' -p /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/results/full/buffer0/Results-0.pdf  -t 0.00034 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/full/bfly_full_obsSS.txt /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/full/buffer0/bfly-full-sorted-buffer0.prior > /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/results/full/buffer0/Results-0.out
0


Doing rejection step for -  0.1
/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/acceptRej.pl -s 'pi.b' -p /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/results/full/buffer0.1/Results-0.1.pdf  -t 0.00034 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/

# Subset dataset (only taxon pairs with a reasonable number of inds per pop)

In [23]:
## Make the infile, conf file and fastaFromIM dir/fastas

import glob

# Minimum number of samples per taxon pair
MIN_N=[3,4,5]

for n in MIN_N:
    maxn_str = "maxn-"+str(n)
    pops_to_keep = []
    ## Scan through each input file and pull out the ones with >= MIN_N samples per taxon
    for imfile in glob.glob(BFLY_IMFILES_DIR+"*.im"):
        #print(imfile)
        with open(imfile, 'r') as infile:
            indat = infile.readlines()
            for line in indat:
                if "gene1" in line:
                    npop1 = line.split()[1]
                    npop2 = line.split()[2]
                    if int(npop1) >= n and int(npop2) >= n:
                        pops_to_keep.extend([imfile])
                        print(imfile.split("/")[-1] + "\tnpop1 - " + npop1 + "\tnpop2 - " + npop2)

    ## Do the conversion
    os.chdir(BFLY_SUBSET_CONF_DIR)

    BFLY_SUBSET_INFILE=maxn_str+"-bfly-subset-infile.txt"
    
    ## Write out the infile for the subset data
    with open(BFLY_SUBSET_INFILE, 'w') as outfile:
        for pop in pops_to_keep:
            print(pop)
            outfile.write(pop+"\n")


    cmd = MSCONVERT_PL + " " + BFLY_SUBSET_INFILE
    print(cmd)
    print(os.system(cmd))

    ## Generate the observed summary stats for this subset of the data
    cmd = MSOBSSS_PL + " -T " + maxn_str+"-bfly-subset-obsSS.table" + " batch.masterIn.fromIM > " + maxn_str+"-bfly-subset-obsSS.txt"
    print(cmd)
    print(os.system(cmd))
    
    ## Rename the batch.masterIn.fromIM and fastaFromIM dirs
    os.rename("batch.masterIn.fromIM", maxn_str+"-bfly-subset.conf")
    os.rename("fastaFromIM", maxn_str+"-bfly-subset-fastaFromIM")


achaeoprepona-demophon+camilla-phaedra.im	npop1 - 5	npop2 - 3
hamadryas-belladonna+amphinome-arinome.im	npop1 - 6	npop2 - 4
hyposcada-zarepha-anchiala.im	npop1 - 6	npop2 - 8
ithomia-iphianassa-salapia.im	npop1 - 6	npop2 - 6
ithomia-lagusa-xenos.im	npop1 - 3	npop2 - 3
oleria-sp511+estella-gunilla.im	npop1 - 3	npop2 - 17
parides-childrenae-sesostris.im	npop1 - 7	npop2 - 29
parides-erithalion+vertumnus-anchises.im	npop1 - 5	npop2 - 3
parides-eurimedes+zacynthus-neophilus.im	npop1 - 15	npop2 - 4
/Volumes/WorkDrive/msbayes-buffering/bfly-im-files/achaeoprepona-demophon+camilla-phaedra.im
/Volumes/WorkDrive/msbayes-buffering/bfly-im-files/hamadryas-belladonna+amphinome-arinome.im
/Volumes/WorkDrive/msbayes-buffering/bfly-im-files/hyposcada-zarepha-anchiala.im
/Volumes/WorkDrive/msbayes-buffering/bfly-im-files/ithomia-iphianassa-salapia.im
/Volumes/WorkDrive/msbayes-buffering/bfly-im-files/ithomia-lagusa-xenos.im
/Volumes/WorkDrive/msbayes-buffering/bfly-im-files/oleria-sp511+estella-gunilla.

## Generate reference tables for subsets

In [29]:
import multiprocessing

## Go to the msbayes working directory
os.chdir(MSBAYES_EXECDIR)

## Generate chunks of the reference table
## This takes the better for fucking ever on the butterfly data
## you want it. If you need to kill open a term and `killall -9 perl`
#BUFFER_VALUES = ["0", "0.1", "0.05", "0.01"]
BUFFER_VALUES = ["0.01", "0.05"]
SUBSET_MAXN_5_CONF = BFLY_SUBSET_CONF_DIR + "maxn-5-bfly-subset.conf"

for buff in BUFFER_VALUES:
    print("Doing reference tables for - ", buff)
    BFLY_BUFF_PRIORS_DIR=BFLY_SUBSET_PRIORS_DIR+"buffer"+buff+"/"
    directory=BFLY_BUFF_PRIORS_DIR
    if not os.path.exists(directory):
        os.makedirs(directory)

    ## Generate the priors
    ## Generate a chunk of the reference table
    PRIORS_SIZE=3000000
    NPROC=10
    CHUNK_SIZE=PRIORS_SIZE/NPROC
    print("chunk size = "+str(CHUNK_SIZE))
    for i in range(NPROC):
        outfile = BFLY_BUFF_PRIORS_DIR+"bfly-subset-sorted-buffer"+buff+"-"+str(i)+".prior"
        conffile = SUBSET_MAXN_5_CONF
        print("Using - ", outfile, conffile)
        p = multiprocessing.Process(target=do_priors, args=(CHUNK_SIZE,outfile,conffile))
        p.start()
        time.sleep(2)



Doing reference tables for -  0.01
chunk size = 300000
Using -  /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0.01/bfly-subset-sorted-buffer0.01-0.prior /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/subset/maxn-5-bfly-subset.conf
/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/msbayes.pl -s 7 -r 300000 -c /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/subset/maxn-5-bfly-subset.conf -o /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0.01/bfly-subset-sorted-buffer0.01-0.prior
Using -  /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0.01/bfly-subset-sorted-buffer0.01-1.prior /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/subset/maxn-5-bfly-subset.conf
/V

In [36]:
import os

## Concatenate all the chunks of the reference tables into one
## ginormous file
#BUFFER_VALUES = ["0", "0.1", "0.05", "0.01"]
#BUFFER_VALUES = ["0"]
BUFFER_VALUES = ["0", "0.1", "0.05", "0.01"]

for buff in BUFFER_VALUES:
    print("Doing buffer - {}".format(buff))
    ## The subset directory with all the priors file partial chunks
    BFLY_BUFF_PRIORS_DIR=BFLY_SUBSET_PRIORS_DIR+"buffer"+buff+"/"
    BFLY_FIRST_CHUNK = BFLY_BUFF_PRIORS_DIR+"bfly-subset-sorted-buffer"+buff+"-0.prior"
    ## Rename the first chunk to be the full prior file, then we'll just cat the
    ## rest of the files to the end of this one
    FULL_BUFF=BFLY_BUFF_PRIORS_DIR + "bfly-subset-sorted-buffer" + buff + ".prior"

    cmd = "mv " + BFLY_FIRST_CHUNK + " " + FULL_BUFF
    print(cmd)
    print(os.system(cmd))
    
    print("Concatenating all chunks for - ", buff)
    ## I ended up doing this part by hand because of disk space limitations
    ## cp BFLY_BUFF_PRIORS_DIR+"bfly-full-sorted-buffer"+buff+"-"+"-0.prior"
    for i in xrange(1,10):

        BUFF_CHUNK = BFLY_BUFF_PRIORS_DIR+"bfly-subset-sorted-buffer"+buff+"-"+str(i)+".prior"
        
        #cmd = "ls -l " + BUFF_CHUNK
        #print(cmd)
        #print(os.system(cmd))
        
        ## The tail +2 removes the header from all but the first file
        cmd = "tail -n +2 " + BUFF_CHUNK + " >> " + FULL_BUFF
        print(cmd)
        print(os.system(cmd))
    

Doing buffer - 0
mv /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0/bfly-subset-sorted-buffer0-0.prior /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0/bfly-subset-sorted-buffer0.prior
256
Concatenating all chunks for -  0
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0/bfly-subset-sorted-buffer0-1.prior >> /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0/bfly-subset-sorted-buffer0.prior
0
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0/bfly-subset-sorted-buffer0-2.prior >> /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0/bfly-subset-sorted-buffer0.prior
0
tail -n +2 /Volumes/WorkDrive/msbayes-buffering/hickerlab-

## Do Rejection step for subset

In [44]:
## I had to manually install several R packages at this point
##     install.packages("VGAM")
##     install.packages("locfit")

## Make sure you are in the msbayes working directory
os.chdir(MSBAYES_EXECDIR)

## This is the command we'll eventually run
## ./acceptRej.pl -s 'pi.b' -p test.pdf ../data/bfly/conf/full/bfly_full_obsSS.txt ../data/bfly/priors/full/buffer0/bfly-full-sorted-buffer0-0.prior > test.out

directory=BFLY_SUBSET_RESULTS_DIR
if not os.path.exists(directory):
    os.makedirs(directory)

BFLY_OBS_SS = BFLY_SUBSET_CONF_DIR + "maxn-5-bfly-subset-obsSS.txt"

BUFFER_VALUES = ["0", "0.1", "0.05", "0.01"]

for buff in BUFFER_VALUES:
    print("Doing rejection step for - ", buff)

    ## create a results dir for each buffer value
    BUFF_OUT=BFLY_SUBSET_RESULTS_DIR + "buffer" + buff + "/"
    if not os.path.exists(BUFF_OUT):
        os.makedirs(BUFF_OUT)
    
    BFLY_BUFF_PRIORS_DIR=BFLY_SUBSET_PRIORS_DIR+"buffer"+buff+"/"
    BFLY_BUFF_PRIOR=BFLY_BUFF_PRIORS_DIR+"bfly-subset-sorted-buffer"+buff+".prior"
    B_OUT_FILE=BUFF_OUT + "Results-" + buff + ".out"
    B_OUT_PDF=BUFF_OUT + "Results-" + buff + ".pdf"

    ## Since we are using the subset with > 5 inds per population we
    ## can use all the summary stats, not just pi.b
    cmd = MSREJECT_PL + " -p " + B_OUT_PDF + " "\
            + " -t 0.00034 "\
            + BFLY_OBS_SS + " " + BFLY_BUFF_PRIOR \
            + " > " + B_OUT_FILE
    print(cmd)
    print(os.system(cmd))
    
    ## Keep a copy of the posterior_table file
    os.rename(MSBAYES_EXEC_DIR+"posterior_table", BUFF_OUT+"posterior_table.buffer"+buff+".txt")
    print("\n")


Doing rejection step for -  0
/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/acceptRej.pl -p /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/results/subset/buffer0/Results-0.pdf  -t 0.00034 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/conf/subset/maxn-5-bfly-subset-obsSS.txt /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/priors/subset/buffer0/bfly-subset-sorted-buffer0.prior > /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/results/subset/buffer0/Results-0.out
0


Doing rejection step for -  0.1
/Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/src/acceptRej.pl -p /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering/data/bfly/results/subset/buffer0.1/Results-0.1.pdf  -t 0.00034 /Volumes/WorkDrive/msbayes-buffering/hickerlab-repository/msbayes-buffering