## Stacks batch 2, new Populations run with write_random_snp model

The main difference between Kristen's and my data analysis was that she chose a random SNP at a locus to genotype, whereas I used the entire locus as a haplotype and then filtered for biallelic haplotypes. This may be causing the discrepancy in the number of loci that she and I are getting from stacks (she is getting far more loci). 

In this notebook:

1. Re-run my Alaskan cod stacks pipeline with the `write_random_snp` option in populations, which will choose a random snp to genotype. Then compare the number of loci in that output to the number of loci after filtering for biallelic haplotypes. 
2. Also make a comparison with the number of loci that come out of populations NOT using the `write_random_snp` option, but also NOT doing biallelic filtering. 
3. Re-run my Korean cod stacks pipeline with the `write_random_snp` option in populations, which will choose a random snp to genotype. Then compare the number of loci in that output to the number of loci after filtering for biallelic haplotypes.


<br>
<br>
### ONE: Choosing a single random SNP per locus, Alaskan data
#### 7/25/2017


I am going to use the batch 2 files to re-run populations with the write_random_snp model, to see how many loci I end up with compared to Kristen's run. But first, I need to 

**unzip the `.matches` files**

In [1]:
cd ../stacks_b2_wgenome/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/stacks_b2_wgenome


In [2]:
!gzip -d *.matches.tsv.gz

Then I have to put all of the old populations files into a folder (for now).

In [3]:
!mkdir populations_original

In [4]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo


**run `populations` using the write_random_snp model**

In [7]:
cd scripts

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/scripts


In [8]:
!populations -b 2 -P ../stacks_b2_wgenome \
-M PopMap.txt \
-t 36 -r 0.75 -p 4 -m 10 \
--write_random_snp \
--genepop --fasta \
2>> populations_out_batch2_wgenome_randomSNP

#### 7/26/2017
**convert the `populations` genepop output to a fasta file**

`populations` puts out a file where each locus is technically a locus_snp. I use the script below to turn this into a fasta file, which allows me to count the number of true loci in the file. 

In [33]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/scripts'

In [34]:
!head genBOWTIEfasta_fromGENEPOP.py

### This python script will create a list of loci from the `populations` output genepop file ###

## ARGUMENTS: 
#ARG 1 - genepop file from `populations`. 
#ARG 2 - the .catalog file output from `cstacks` (unzipped)

### output will appear in the same folder as this script and will automatically be named "seqsforBOWTIE.fa"

import sys



In [36]:
!python genBOWTIEfasta_fromGENEPOP.py \
../stacks_b2_wgenome/populations_randomSNP/batch_2.genepop \
../stacks_b2_wgenome/batch_2.catalog.tags.tsv

-----
Reading loci from file:
../stacks_b2_wgenome/populations_randomSNP/batch_2.genepop
Stacks version 1.44; Genepop version 4.1.3; July 25, 2017

Done reading loci

Using sequences from catalog file:
../stacks_b2_wgenome/batch_2.catalog.tags.tsv

Writing new fasta file...
Done.


In [37]:
!mv seqsforBOWTIE.fa ../stacks_b2_wgenome/populations_randomSNP/batch_2_wgenome_randomSNP.fa

**Count the number of loci right out of populations:**

In [38]:
!grep ">" ../stacks_b2_wgenome/populations_randomSNP/batch_2_wgenome_randomSNP.fa | wc -l

10715


**Compare to the number of unfiltered biallelic loci:** 

*(after Marine's script to re-call genotypes based on stack depth and to filter for biallelic loci)*

In [39]:
cd ../stacks_b2_wgenome/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/stacks_b2_wgenome


In [40]:
cd populations_original/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/stacks_b2_wgenome/populations_original


In [41]:
infile = open("batch_2.CorrectedGenotypes_biallelic.txt", "r")
loci_line = infile.readline()
infile.close()

loci_list = loci_line.strip().split("\t")
print len(loci_list)

6377


<br>
<br>

**Using the `write_random_snp` model, the number of unfiltered loci out of `populations` if 10,715.**

**Using the haplotype method and filtering for biallelic loci, the number of unfiltered loci out of `populations` is 6,377**



<br>
<br>
### TWO: Is this a result of using haplotypes instead of single SNPs, or is it a result of filtering for biallelic?

I found how many loci were in the genepop file *right* out of populations without using the `write_random_snp` option, before I filtered for biallelic loci.

In [42]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/stacks_b2_wgenome/populations_original'

In [43]:
!python ../../scripts/genBOWTIEfasta_fromGENEPOP.py \
batch_2.genepop \
../batch_2.catalog.tags.tsv

-----
Reading loci from file:
batch_2.genepop
Stacks version 1.44; Genepop version 4.1.3; May 25, 2017

Done reading loci

Using sequences from catalog file:
../batch_2.catalog.tags.tsv

Writing new fasta file...
Done.


In [45]:
!mv seqsforBOWTIE.fa batch_2_wgenome_prefilter.fa

In [47]:
!grep ">" batch_2_wgenome_prefilter.fa | wc -l

10729


** Using and not using the `write_random_snp` model, with NO BIALLELIC FILTERING, both provides me with ~10,000 loci**
<br>
<br>
<br>

### THREE: Running the write_random_snp model on the Korean data, batch 6

In [48]:
cd ../../

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo


In [49]:
cd ../PCod-Korea-repo/stacks_b6_wgenome/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/stacks_b6_wgenome


Temporarily moved `populations` output files into a new folder. 

In [50]:
!mkdir populations_orig

In [51]:
cd ../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/scripts


In [52]:
#run populations
!populations -b 6 -P ../stacks_b6_wgenome \
-M PopMap_L1-4.txt \
-t 36 -r 0.75 -p 4 -m 10 \
--write_random_snp \
--genepop --fasta \
2>> populations_out_batch6_wgenome_randomSNP

I put the new populations files into a folder:

In [56]:
cd ../stacks_b6_wgenome/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/stacks_b6_wgenome


In [57]:
!mkdir populations_randomSNP

In [58]:
cd ../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/scripts


In [60]:
#convert populations output to fasta file
!python genBOWTIEfasta_fromGENEPOP.py \
../stacks_b6_wgenome/populations_randomSNP/batch_6.genepop \
../stacks_b6_wgenome/batch_6.catalog.tags.tsv

-----
Reading loci from file:
../stacks_b6_wgenome/populations_randomSNP/batch_6.genepop
Stacks version 1.44; Genepop version 4.1.3; July 26, 2017

Done reading loci

Using sequences from catalog file:
../stacks_b6_wgenome/batch_6.catalog.tags.tsv

Writing new fasta file...
Done.


In [61]:
cd ../stacks_b6_wgenome/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/stacks_b6_wgenome


In [62]:
cd populations_randomSNP/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/stacks_b6_wgenome/populations_randomSNP


In [64]:
!mv ../../scripts/seqsforBOWTIE.fa batch_6_wgenome_randomSNP.fa

In [65]:
!grep ">" batch_6_wgenome_randomSNP.fa | wc -l

22026


**Before filtering, using the `write_random_snp` model gives me 22,026 loci right out of `populations`. So my Korean data still gives me more loci than the Alaskan data (about twice as much)**

<br>
<br>
<br>


### Kristen's stacks parameters compared to my stacks parameters

*This is according to the February 2017 draft manuscript*

| Parameter/Filter    | Kristen    | Mary    | Comments    |
|:------:|:------:|:------:|:-------|
| M | 2 | 3 | M - max # of nucleotide differences between sequences in a single stack. Since I used larger M value, would expect fewer stacks. But in parameter testing runs, M did not produce such a larger difference in # of loci as is seen here. |
| m | 3 | 5 / 10 | m - minimum # sequences required to make a stack. This would definitely lead to more loci in Kristen's data than in my data. However, I am hesitant to put this parameter so low. | 
| N | 4 | 4 | default | 
| n | 3 | 3 | default | 
| max_locus_stacks | 3 | 3 | default |
| SNPs present in x% of samples per site | >= 80% | >= 75% | Kristen's is more stringent here.|
| choosing SNPs | write_random_snp | haplotypes, biallelic snps only | See above; this could be a big source of the discrepancy |
| Minor allele freq | MAF < 0.05 | MAF < 0.05 in all populations | Not sure if Kristen did MAF < 0.05 in all populations, in any population; assuming she did the same as our lab, which is what I did. | 
| HWE | uncorrected p vales <= 0.05 | multiple testing method | Wouldn't expect this to lead to large difference in retained loci. |