---
title: "PhD report Defining the Genome-wide Landscape in Rapeseed"
author: jmontero
date: 2024-11-06
format:
  html:
    toc: true
    toc-location: left
    self-contained: true
    embed-resources: true
execute:
  freeze: auto
  cache: true
---


# Variant calling

I repeated variant calling, this time considering the [**missing genotype format issue**](https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531-GenotypeGVCFs-and-the-death-of-the-dot-obsolete-as-of-GATK-4-6-0-0). I did **combined joint genotyping** on 100 founders + 214 RILs combined, kept only biallelic, converted sites with **DP<4** and heterozygous sites to ./. (acounts for all missing and low-depth variants), subset by 16-way families, filtered VCF to keep only sites with **missing call rate <= 0.3** (less than 5 family members) and polymorphic and converted the VCF file to PED.

# haploRILs

Then I run [haploRILs](https://github.com/GoliczGenomeLab/haploRILs) on the PED file on different parameters (window size, step, filtering) and selected the best parameters window size = 10 SNPs, step = 2, context-window size = 3 (see [haploRILs GitHub](https://github.com/GoliczGenomeLab/haploRILs) for explanation), based on the number of crossovers per individual expected (~8) and resolution. Given how I am defining the target, the exact parameters should not not play a big role because only one crossover can be counted by region, and crossovers tend to overlap.

![haploRILs-detected crossovers (nSnp = 10, step = 2, K = 3) across 16-way families on chromosome chrA01.](variant_calling/karyoploter_chrA01_10_2_3.pdf){width=1000px height=800px fig-align="center"}

![haploRILs-detected crossovers (nSnp = 20, step = 2, K = 3) across 16-way families on chromosome chrA01. Points indicate variants with height relative to MAF ](variant_calling/karyoploter_chrA01_20_2_3.jpg){width=1000px height=800px fig-align="center"}

![haploRILs-detected crossovers (nSnp = 10, step = 2, K = 3) across 16-way families on chromosome chrA06.](variant_calling/karyoploter_chrA06_10_2_3.pdf){ width=1000px height=800px fig-align="center"}

![haploRILs-detected crossovers (nSnp = 20, step = 2, K = 3) across 16-way families on chromosome chrA06.](variant_calling/karyoploter_chrA06_20_2_3.pdf){width=1000px height=800px fig-align="center"}

![haploRILs-detected crossovers (nSnp = 20, step = 2, K = 3) across chromosomes on one individual of the 16-way families. Points indicate variants with height relative to MAF](variant_calling/karyoploter_RIL1_20_2_3.jpg){width=1000px height=800px fig-align="center"}

# Features and target

View [preprocessing report](preprocessing/preprocessing.html). Also:

- About feature preparation (I need to update LabArchives):


```{bash}
lummerland: /vol/agcpgl/jmontero/features/Features/*/README.md
```


- About the target and its preparation: [LabArchives](https://mynotebook.labarchives.com/share/Notebook%2520Jose%2520Montero/MTQwLjR8OTQ2MDc3LzEwOC0yNTUvVHJlZU5vZGUvMjA1ODUxMTQxNXwzNTYuNA==)

# Machine learning (preliminary)

Perform

![Performance metrics obtained with the random forest classifier](machine_learning/haploRILs_10_2_3.rand_forest.performance.pdf){width=1000px height=800px fig-align="center"}

![Performance metrics obtained with the decision tree classifier](machine_learning/haploRILs_10_2_3.decision_tree.performance.pdf){width=1000px height=800px fig-align="center"}

![Performance metrics obtained with the logistic regression classifier](machine_learning/haploRILs_10_2_3.logistic_reg.performance.pdf){width=1000px height=800px fig-align="center"}

![Feature importance obtained the random forest classifier](machine_learning/feature_importance.png){width=1000px height=800px fig-align="center"}