---
title: Calculating Br rat haplotype probabilities from HS founders
date: 11/8/2023
author: Sabrina
---

I am using Dan Munro's scripts to compute probabilities across the 8 founders per locus per individual. His code uses the R qtl2 package.

Github: [https://github.com/daniel-munro/qtl2-founder-haps/tree/main](https://github.com/daniel-munro/qtl2-founder-haps/tree/main)

The `genetic_map` subdirectory contains genetic mapping files from the older build rn6. I downloaded rn7 genetic markers from his preprint, [A revamped rat reference genome improves the discovery of genetic diversity in laboratory rats](https://www.biorxiv.org/content/10.1101/2023.04.13.536694v2.article-info), [Supplementary Table S2](https://www.biorxiv.org/content/biorxiv/early/2023/09/29/2023.04.13.536694/DC2/embed/media-2.xlsx?download=true).

I first saved Table S2 as a CSV, then split it by chromosome.

## Write Genetic Mapping Files

In [1]:
import pandas as pd
import gzip

# Read the original CSV file
df = pd.read_csv('/Users/sabrinami/Desktop/MAP.csv')
for name, group in df.groupby('CHR'):
    map = group[['POS', 'CHR','cM']]
    map.to_csv(f'/Users/sabrinami/Github/qtl2-founder-haps/rn7_genetic_map/MAP4chr{name}.txt.gz', sep=' ', index=False, header=False, compression='gzip')

## Process VCFs

I need to filter out X and Y chromosomes, along with small formatting changes.

**Br Rats**
```
cd ~/Desktop/Sabrina/2022-23/tutorials/enformer_pipeline_test/rn7_data
wget https://ratgtex.org/data/geno/genotypes/Brain.rn7.vcf.gz
gunzip genotypes/Brain.rn7.vcf.gz
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' genotypes/Brain.rn7.vcf > output.vcf
bgzip output.vcf
bcftools index -t output.vcf.gz
bcftools view -i 'CHROM!="chrX" && CHROM!="chrY"' output.vcf.gz -o filtered.Brain.rn7.vcf.gz -Oz
bcftools index -t filtered.Brain.rn7.vcf.gz

```

**Founder Rats**

Most of the processing is documented [here](https://sabrina-dl.hakyimlab.org/posts/2023-09-12-processing-hs-founder-rat-genotypes/), I only needed to filter out X and Y chromosomes.

```
bcftools view -i 'CHROM!="chrX" && CHROM!="chrY"' Palmer_HS_founders_mRatBN7_biallelic_snps.vcf.gz -o Palmer_HS_founders_mRatBN7_filtered.vcf.gz -Oz
bcftools index -t Palmer_HS_founders_mRatBN7_filtered.vcf.gz

```

## Run Python Code

```
conda activate genomics
cd ~/Github/qtl2-founder-haps
DATA_DIR=~/Desktop/Sabrina/2022-23/tutorials/enformer_pipeline_test/rn7_data
python qtl2-founder-haps.py $DATA_DIR/filtered.Brain.rn7.vcf.gz $DATA_DIR/Palmer_HS_founders_mRatBN7_filtered.vcf.gz probs.rds --gmap-dir rn7_genetic_map

```

The `make_qtl_inputs` function in line 160 of `qtl2-founder-haps.py` writes input files for the R package qtl2, specifically the `calc_genoprob` function that computes haplotype probabilities for each individual at each loci. I was able to run through the `make_qtl_inputs` locally, but the process was killed during the R computations:

```
> library(qtl2); cross <- read_cross2("tmp-qtl2-founder-haps/control.yaml"); pr <- calc_genoprob(cross, error_prob = 0.01, cores = 1); pr <- genoprob_to_alleleprob(pr); saveRDS(pr, "probs.rds")
Error: vector memory exhausted (limit reached?)
Execution halted

```

I copied over the input files in `~/Github/qtl2-founder-haps/tmp-qtl2-founder-haps` to polaris and started an interactive job: `qsub -I -A AIHPC4EDU -l select=1:ncpus=64 -l walltime=2:00:00 -l filesystems=home -q preemptable`

```
cd ~/Github/qtl2-founder-haps/tmp-qtl2-founder-haps
module load conda
conda activate genomics
```
Then in R:

```
library(qtl2)
cross <- read_cross2("control.yaml")
pr <- calc_genoprob(cross, error_prob = 0.01, cores = 64)
pr <- genoprob_to_alleleprob(pr)
saveRDS(pr, "~/Github/qtl2-founder-haps/probs.rds")

```

**Debugging:** I am stuck at the step calling `calc_genoprob`; after 25 minutes of running, the process gets killed.

![R on Polaris Compute Node with 64 CPUs](polaris_calc_genoprob_bug.png)


