# De novo, ref, and ref + de novo

**20180618**

Another RADseq paper on *Parastichopus californicus* came out this year and they aligned to a closely related genome, *Parastichopus parvimensis*. I wanted to compare my de novo assembly with an assembly using that genome as a reference. Additionally, ipyrad has a reference + de novo option. So, I made three ipyrad assemblies that only differ in whether they are assembled de novo, with a reference, or both. I'd like to know,

<br>[1] Which produces the most loci? 
<br>*Answer: 3 methods produce almost the same number of loci*
<br>
<br>[2] Are the loci produced in each method mostly the same loci? 
<br>*Answer: Yes. About 98% of loci map uniquely across methods.*
<br>
<br>[3] How does assembly method affect % loci lost at HWE filtering? 
<br>*Answer: It doesn't. HWE methods differ in how many loci they boot out, but number of loci booted out is not affected by assembly method.*
![line](https://cdn-images-1.medium.com/max/1600/1*IH10jlQEJ7GW1_oq8s7WPw.png)


### **[1]** Number of loci

**Takeaway: number of loci almost the same across 3 methods**

Before filtering for just one SNP per locus and minor allele frequency of 0.05, this is how they compare.

- de novo -> 609 loci
- reference -> 613 loci
- ref + de novo -> 613

After filtering for just one SNP per locus and minor allele frequency of 0.05, this is how they compare.

- de novo -> 520
- reference -> 523
- ref + de novo -> 523


### **[2]** Comparing loci with bowtie2

**Takeaway: loci from 3 methods were ~98% identical**

All assembly methods produced almost the same loci. When aligning the fasta from reference and reference + de novo assemblies to the de novo assembly, 97.55% of loci mapped only once, and the remaining 15 loci did not map at all. When aligning the fasta from the de novo assembly to the reference and reference + de novo assembles, 98.19% of loci mapped only once, and the remaining 11 loci did not map at all. No loci mapped to more than one location.

### **[3]** HWE Filtering

**Takeaway:** Assembly method does not affect the number of loci lost due to HWE filtering

At the core of all three HWE filtering methods is first running the exact pairwise HW test in Genepop. Then...

(1) Jon Purtiz's method is booting out loci that are out of HWE (with alpha of less than 0.01) in more than half of the populations. This method is expected to boot out few loci. Jon's reasoning was that we expect a lot of loci to be out of HWE, and that we're not actually trying to remove those loci. Instead we're trying to remove poorly assembled loci (loci that incorrectly combined repeat units, or paralogous loci, etc.), which should be very few. 

(2) Mary's method is using Fisher's combination of probabilities from indepenedent tests of significance, which is essentially a multiple corrections test. Then, Mary removes loci if they are out of HWE in at least 2 populations.

(3) Charlie uses a multiple corrections test that generates a false discovery rate to calculate qvalues of significance per locus in each population. I removed loci if their qvalue was <0.05 in at least half of the populations. 

I ran all three HWE filtering techniques on the output of all three assembly methods. Mary's method removes most of my loci. Jon and Charlie's methods remove almost none of my loci. The percent of loci removed did not vary significantly across assembly method types.

![HWE_table](https://github.com/nclowell/SeaCukes/blob/master/Imgs_for_Notebooks/HWE_assembly_method_table_20180620.png?raw=true)

#### DAPCs

Assembly method: de novo

![DAPC](https://github.com/nclowell/SeaCukes/blob/master/Imgs_for_Notebooks/PC_allR_1M_20K_002_biall_maf_oneSNP_inames_wpops_fHWE_noCA_DAPC.png?raw=true)
<br>![DAPC](https://github.com/nclowell/SeaCukes/blob/master/Imgs_for_Notebooks/PC_allR_1M_20K_002_biall_maf_oneSNP_inames_wpops_noCA_DAPC.png?raw=true)
<br>![DAPC](https://github.com/nclowell/SeaCukes/blob/master/Imgs_for_Notebooks/PC_allR_1M_20K_002_biall_maf_oneSNP_inames_wpops_post_HWEchisquare_fgp_noCA_DAPC.png?raw=true)

Assembly method: reference 

![DAPC]()
<br>![DAPC]()
<br>![DAPC]()
