---
title: Splitting chr20 Brain rats into batches for qtl2 analysis
author: Sabrina Mi
date: 11/19/2023
---

We'll split the 340 Br rats into batches of 10. The goal is to do this without creating a new VCF for each batchThe following code subsets the chr20 input files, specifically geno.csv and covar.csv files, that were generated with [qtl2-founder-haps code](https://sabrina-dl.hakyimlab.org/posts/2023-11-14-calculating-genotype-probabilities-by-chromosome/). I set up by copying over the other input files that should stay the same.

```
cd /Users/sabrinami/Desktop/qtl2_data
mkdir tmp-chr20-qtl2-founder-haps
cp chr20-qtl2-founder-haps/control.yaml tmp-chr20-qtl2-founder-haps/control.yaml
cp chr20-qtl2-founder-haps/founder_geno.csv tmp-chr20-qtl2-founder-haps/founder_geno.csv
cp chr20-qtl2-founder-haps/gmap.csv tmp-chr20-qtl2-founder-haps/gmap.csv
cp chr20-qtl2-founder-haps/pmap.csv tmp-chr20-qtl2-founder-haps/pmap.csv
```

In [22]:
import pandas as pd
import time
import subprocess
data_dir = '/Users/sabrinami/Desktop/qtl2_data/chr20-qtl2-founder-haps'
tmp_dir = '/Users/sabrinami/Desktop/qtl2_data/tmp-chr20-qtl2-founder-haps'
output_dir = '/Users/sabrinami/Desktop/qtl2_data/chr20-qtl2-outputs'
geno = pd.read_csv(f'{data_dir}/geno.csv', index_col='id')
batch_size = 10
n_batches = len(geno.columns) // batch_size

In [23]:
def qtl_command(output_file, n_cores=1):
    cmd = (
        'library(qtl2); '
        f'cross <- read_cross2("{tmp_dir}/control.yaml"); '
        f'pr <- calc_genoprob(cross, error_prob = 0.01, cores = {n_cores}); '
        f'pr <- genoprob_to_alleleprob(pr); saveRDS(pr, "{output_file}")'
    )
    return cmd

In [27]:
for i in range(n_batches):
    tic = time.perf_counter()
    geno_df = geno.iloc[:, i:i+batch_size]
    samples = geno_df.columns.to_list()
    covar_df = pd.DataFrame({'id': samples, 'generations': [90] *  len(samples)})
    geno_df.to_csv(f'{tmp_dir}/geno.csv', index=True)
    covar_df.to_csv(f'{tmp_dir}/covar.csv', index=False)
    cmd = qtl_command(f'{output_dir}/batch{i}_prob.rds', n_cores = 2)
    subprocess.run(f"R -e '{cmd}'", shell=True)
    toc = time.perf_counter()
    print("Batch:", i+1, "...", (toc-tic)/60, "minutes")


R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(qtl2); cross <- read_cross2("/Users/sabrinami/Desktop/qtl2_data/tmp-chr20-qtl2-founder-haps/control.yaml"); pr <- calc_genoprob(cross, error_prob = 0.01, cores = 2); pr <- genoprob_to_alleleprob(pr); saveRDS(pr, "/Users/sabrinami/Desktop/qtl2_data/chr20-qtl2-outputs/batch0_prob.rds")
