---
title: Computing Haplotype for Probabilities 8K HS Rats
author: Sabrina Mi
date: 7/31/2025
---
# Preparing 8K rat genotypes with body lenght, BMI data

## Write samples file

First, we obtain a list of samples with body length and BMI data, then intersect with genotypes rats.

In [None]:
import pandas as pd
all_pheno = pd.read_csv("/home/s1mi/enformer_rat_data/phenotypes/ALLTRAITSALLNORMALIZES_19jul24.csv", 
                        usecols = ['rfid', 'dissection:regressedlr_length_w_tail_cm', 'dissection:regressedlr_bmi_w_tail'],
                        index_col = 'rfid')
all_pheno.columns = ['bodylen', 'bmi']
samples = list(all_pheno[all_pheno['bodylen'].notna()].index)
# with open("bodylen_bmisamples.txt", "w") as f:
#     f.write("\n".join(pheno_rats))

  all_pheno = pd.read_csv("/home/s1mi/enformer_rat_data/phenotypes/ALLTRAITSALLNORMALIZES_19jul24.csv",



I downloaded the [bed, bim, fam files](https://library.ucsd.edu/dc/object/bb5610743d), then make basic modifications to the files.

## Filter Samples

Filter to the 8K rats with phenotype data using `plink`.

```
plink2 --bfile bb5610743d --keep samples.txt --export vcf bgz --out bodylen_bmi_8K_samples
```

`plink2` automatically generates new IDs of the format `FID_IID`, so we first need to remove the duplicate ID in the VCF header.

```
bcftools query -l bodylen_bmi_8K_samples.vcf.gz > samples_FID_IID.txt
awk -F'_' '{print $0 "\t" $1}' samples_FID_IID.txt > rename_samples.txt
bcftools reheader -s rename_samples.txt -o renamed.vcf.gz bodylen_bmi_8K_samples.vcf.gz
```

```
bcftools view renamed.vcf.gz | \
awk 'BEGIN{FS=OFS="\t"} /^#/ {print; next} {$1="chr"$1; $3="chr"$3; print}' | \
bgzip > bodylen_bmi.vcf.gz
bcftools index -t bodylen_bmi.vcf.gz
rm renamed.vcf.gz
```

Some of the samples with phenotype data are not in the genotype file, so `plink2` keeps only the overlapping samples. We generate a list of current samples.

```
bcftools query -l bodylen_bmi_8K_samples.vcf.gz > /home/s1mi/Github/deep-learning-in-genomics/posts/2025-07-22-bmi-bodylen-rats-haplotype-probabilities/samples.txt
```

## Split VCF by Chromosome
```
mkdir ~/enformer_rat_data/genotypes/bodylen_bmi_VCFs
# Split VCF by chromosome
vcf_in=bodylen_bmi.vcf.gz

vcf_out_prefix=~/enformer_rat_data/genotypes/bodylen_bmi_VCFs/chr

for i in {1..20}
do
    echo "Working on chromosome ${i}..."
    bcftools view ${vcf_in} --regions chr${i} -o ${vcf_out_prefix}${i}.vcf.gz -Oz
done
```

## Index VCFs

```
for i in {1..20}
do
    echo "Indexing chromosome ${i}..."
    bcftools index -t ${vcf_out_prefix}${i}.vcf.gz
done
```

# Compute Haplotype Probabilities

Because the genotype files are so large, we need to split them into batches of 15 individuals in order to run qtl2.

In [8]:
import os
with open("samples.txt", "r") as f:
    samples = f.read().splitlines()
batch_size = 20
n_batches = len(samples)//batch_size
data_dir = '/eagle/AIHPC4Edu/sabrina/scratch/bodylen_bmi_VCFs_for_qtl2'
with open(f'{data_dir}/batches.txt', "w") as f:
    for i in range(n_batches):
        batch_str = ",".join(samples[batch_size*i:batch_size*(i+1)])
        f.write(f'{batch_str}\t-\tbatch{i}\n')
        

## Split VCF by batch

```
module use /soft/modulefiles; module load conda; conda activate genomics
output_dir="/eagle/AIHPC4Edu/sabrina/scratch/bodylen_bmi_VCFs_for_qtl2"
for i in {1..20}; do mkdir -p ${output_dir}/chr${i}_by_batch; done
log_file="split_VCF_by_batch.log"
for i in {1..20}
do
    start_time=$(date +%s)
    echo "Working on chromosome ${i}..." >> "$log_file"
    vcf_in=~/enformer_rat_data/genotypes/bodylen_bmi_VCFs/chr${i}.vcf.gz

    bcftools +split ${vcf_in} -S ${output_dir}/batches.txt --output-type z --output ${output_dir}/chr${i}_by_batch

    end_time=$(date +%s)
    duration=$((end_time - start_time))
    echo "Completed chromosome ${i} in ${duration}s" >> "$log_file"
done
```

## Run qtl2

This is example code for batch 0 (all chromosomes), that I ran in an interactive job.

```
j=0

module use /soft/modulefiles; module load conda
data_dir=/home/s1mi/qtl2_data
scratch=/eagle/AIHPC4Edu/sabrina/scratch
samples_dir=${scratch}/bodylen_bmi_VCFs_for_qtl2
founders_dir=/home/s1mi/enformer_rat_data/genotypes/FounderVCFs
output_dir=${scratch}/qtl2_outputs
code_dir=/home/s1mi/Github/deep-learning-in-genomics/posts/2025-07-22-bmi-bodylen-rats-haplotype-probabilities

for i in {16..20}
do
    echo "Working on chromosome ${i}..."
    start_time=$(date +%s)
    conda run -n ml-python python ${code_dir}/make_qtl2_inputs.py ${samples_dir}/chr${i}_by_batch/batch${j}.vcf.gz ${founders_dir}/chr${i}.vcf.gz ${output_dir}/chr${i}_by_batch/batch${j}_probs.rds --working-dir ${scratch}/chr${i}_qtl2_inputs/batch${j} --gmap-dir ${data_dir}/genetic_map --cores 32

    end_time=$(date +%s)
    duration=$((end_time - start_time))
    echo "Prepared chromosome ${i} inputs for qtl2 in ${duration}s"

    cd ${scratch}/chr${i}_qtl2_founder_haps
    conda run -n genomics Rscript ${scratch}/chr${i}_qtl2_founder_haps/qtl2_calculate_prob.R

    end_time=$(date +%s)
    duration=$((end_time - start_time))
    echo "Computed chromosome ${i} haplotype probabilities in ${duration}s"
done

```

I ran this in two interactive job, 2 hours total, sequentially. Since the writing input files take much less time, I put it in it's own PBS script, `prepare_qtl2_inputs.pbs`, that runs the Python code in parallel by chromosome and batch. Then `run_qtl2.pbs` executes the R code in parallel.

First create a file of all chromosome, batch pairs.

```
job_list=${PBS_O_WORKDIR}/job_list.txt
for i in {1..20}; do
  for j in {1..434}; do
    echo $i $j >> $job_list
  done
done
```
`qsub prepare_qtl2_inputs.pbs`


`qsub -I -A AIHPC4EDU -l walltime=1:00:00 -l filesystems=home:eagle -q debug`
