# Preliminary results of population structure in *Crassadoma gigantea*

I wanted to briefly summarize what I have found so far while analyzing my data at the SNP and haplotype level.

### Genotyping error

I sequenced four scallops twice in the lane with unique barcodes so that I could quantify genotyping error. I calculated genotyping error by counting the number of genotypes between replicates that did not perfectly match, excluding replicates where either had missing data. Thus, a genotyping error includes both a one allele match and neither allele match (e.g., 0102 & 0202 would be genotyping error with one allele match, and 0202 & 0303 would be neither allele match).

SNP-level genotyping error: **.38%**
<br>
Haplotype-level genotyping error: **1.76%**

### DAPCs

#### SNP level DAPC

To produce a SNP level DAPC, I used the Genepop file that ``population`` outputs at the end of the second round of Stacks. This is after I've made a "reference genome" by ultra-filtering a catalog of de novo loci. I then filtered for missing values, filtering out out any loci where the frequency of missing values within any population exceeded .2 (retained 86.5% of loci). Then, I filtered for a minor allele frequency of .05 across populations (retained 21.1% of loci). In the future, I'll write a script that can filter for minor allele frequency based on population allele frequencies, but I didn't have time, and figured it would be a useful preliminary look to filter across populations given that my population sizes are pretty small and even. After these two filtering stages, I retained **15,079 loci**.

I made a DAPC, script [here](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Scripts/CRAGIG_RUN1_DAPC_snps_20170324.R):

![img](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/dapc_snp_20170324.png?raw=true)


#### Haplotype level DAPC

To produce a haplotype level DAPC, I first had to produce a haplotype Genepop, which is not an output of any step in the Stacks pipeline. I used the ``populations`` haplotypes file output and wrote a script that builds a Genepop file. Then, I filtered for missing values, using again a threshold of .2 to remove any locus where in any population the missing value frequency exceeded .2 (retained 68.1% of loci). Then, I tried to filter for a low minor allele frequency, but it kept filtering out pretty much all of my data. So, I plotted a histogram of haplotype frequencies, to try and see how common different frequencies were. This means I plotted the frequencies of each haplotype across individuals. And this is what it looked like:

![img](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/hap_freq_dist_20170324.png?raw=true)

No surprise almost everything was getting filtered out with a minor allele frequency threshold of .03 and .05. For comparison, I made the same plot on Mary's cod data, and found a different distribution of haplotype frequencies, below. Not sure exactly what the difference means, but interesting nonetheless!

![image]()


So, I skipped the minor allele frequency filtering because it didn't make any sense with this data. With just filtering for missing data, I retained **5,282 loci**. Then I plotted a DAPC, script [here](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Scripts/CRAGIG_RUN1_tags_DAPC_20170324.R):

![img](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/dapc_tags_noleg_20170324.png?raw=true)

### Fis

#### SNP level Fis distribution

I used the ``hierfstat`` package in R to calculate locus Fis, and plotted a histogram here using the SNP level Genepop. Script [here](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Scripts/HIERFSTAT_snps_cragigrun1_20170324.R).

Here's the distribution **across** populations:
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/snp_fis_across_pops.png?raw=true)

Here's the distribution **within** populations to see variation:
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/snp_fis_wa_strait_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/snp_fis_WA_dabob_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/snp_fis_WA_sanjuans_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/snp_fis_CA_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/snp_fis_AK_20170324.png?raw=true)

#### Haplotype level Fis distribution

I did the same, but with the haplotype-level Genepop file. R script [here](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Scripts/CRAGIG_RUN1_tags_DAPC_20170324.R).

Here's the distribution **across** populations:
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/hap_fis_across_pops.png?raw=true)

Here's the distribution **within** populations to see variation:
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/hap_fis_WA_Strait_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/hap_fis_WA_dabob_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/hap_fis_WA_sanjuans_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/hap_fis_CA_20170324.png?raw=true)
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Analyses/Fis/hap_fis_AK_20170324.png?raw=true)