Skip to content

CALCROH - Estimating inbreeding using ROH summary statistics

License

Notifications You must be signed in to change notification settings

jd-wall/CALCROH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

CALCROH

CALCROH - Estimating inbreeding using ROH summary statistics

A workflow for estimating degree of consanguinity from the numbers and genetic lengths of runs of homozygosity (ROH).

Overview

Mating between close genetic relatives leads to consanguineous (i.e., inbred) offspring, characterized by long ROH that are inherited from the same recent ancestor and an increase in rare, recessive (and sometimes deleterious) variants. This package, which accompanies the Wall et al. (2022, Nature Communications) paper describing South Asian genomes, provides a framework for estimating the degree of consanguinity in a sample from the distribution of genetic lengths of ROH. Input is in the form of the genetic lengths of all ROH within a sample (which can be obtained using a wide variety of existing methods). We then summarize the input using two summary statistics, and estimate the likelihood of obtaining these summaries from simulations under different known levels of consanguinity. The two summary statistics include

  1. The number of ROH ≥ 10 cM in length (called here N)
  2. The sum of the genetic lengths of the 10 longest ROH (called here S)

but it is straightforward to swap them out for other ones. This package includes “Calcroh”, a program for simulating the lengths of ROHs caused by recent inbreeding, as well as an example workflow “In1” that shows how to calculate the (summary) likelihood of a particular example data set.

Compilation

Programs are written in C. They can be compiled using the following commands:

cc -O -o Calcroh Calcroh.c rand1new.c -lm cc -O -o filtsum2b filtsum2b.c

In addition, the random number generator uses a seed file “seedms” consisting of 3 short ints.

Simulating ROH lengths

Our approach for simulating the genetic lengths of ROHs follows the approach of Clark (1999, AJHG 65:1489-1492), which assumes that the genetic lengths of chromosomal segments inherited from particular paternal and maternal ancestors follows an exponential distribution with mean equal to 100 cM divided by the total number of generations in the path from the proband back to the particular ancestors. For example, a proband whose parents are 1st cousins has four paternal great grandparents, who had eight total paternal autosomal chromosomes three generations ago, and eight total maternal chromosomes. At any specific autosomal location in the proband, there are 8 x 8 = 64 possible inheritance patterns, of which four result in consanguinity. We model the proband’s autosomes as a series of blocks of ancestry, each with genetic length exponentially distributed with mean 100 / 6 cM. Any individual ancestry block has a 4 / 64 = 6.25% chance of being autozygous, but with the additional constraint that neighboring blocks will not come from the same pair of ancestral chromosomes. Calcroh simulates ancestral blocks sequentially from one end of a chromosome to the other. If a given block is autozygous, the probability that the next one will also be is 3 / 63 = 4.76%; if a block is not, the probability that the next one will be is 4 / 63 = 6.35% Note that when multiple contiguous ancestry blocks are autozygous, the ROH length is the sum of the genetic lengths of the contiguous autozygous blocks. Genetic map lengths for the autosomes follows the GRCh38 PLINK-format genetic map generated by Xiaowen Tian tianx3@uw.edu and available from the BEAGLE website. This map was lifted over from a genetic map generated by Adam Auton based on patterns of LD and using the GRCh37 human genome assembly. Calcroh requires two command line arguments: g, the number of generations in the inbreeding loop, and f, the inbreeding coefficient. For the example proband whose parents are 1st cousins, g = 6 and f = 0.0625.

Calculating likelihoods

Given a summarized data set D = {N, S}, we repeatedly simulate ROH lengths under a specific inbreeding model and tabulate how many of these simulations have the # of ROH ≥ 10 cM equal to N and the sum of the 10 longest ROH within 1% of S. The summary likelihood estimate is just the number of simulations fitting these criteria divided by the total number of simulations that were run. As an example, suppose we have an individual with D = {7, 150}. The script “example_workflow” shows how you could simulate the ROH length distribution from 10,000 individuals whose parents are 1st cousins, calculate D for each simulation (output to the file “example_likelihoods”), and tabulate how many simulated D are approximately the same as {7, 150}. The estimated likelihood is then the output of “example_workflow” divided by the number of simulated individuals (i.e., 10,000). A couple additional points: It is straightforward to consider other summary statistics besides the one implemented here by modifying the workflow. In practice, if multiple individuals are being analyzed, it is more efficient to simulate likelihood distributions first and to reuse them for estimating likelihoods for all individuals. Finally, we recommend running at least 106 simulations for each potential degree of inbreeding.

Contact information

Please direct any questions or comments to Jeff Wall jeffwall.genetics@gmail.com

About

CALCROH - Estimating inbreeding using ROH summary statistics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages