Skip to content

Commit

Permalink
Added LeveneHaldane with tex doc, HWEPerVariant, and Utils.D_*
Browse files Browse the repository at this point in the history
Added stats.LeveneHaldane and tests

Created doc folder, LeveneHaldane.tex, and bibfile.bib
Added docs/.gitignore

Added HWEPerVariant, test/resources/HWE_test.vcf and tests

Added Utils.time, now replaced by Utils.printTime

Used Option in "r*" (ratio) methods for missing values, now abstracted with Utils.divOption and Utils.someIf

Added rounding-error-tolerant comparison operators Utils.D_* and used where appropriate

replaced closeEnough with D_==
  • Loading branch information
alexb-3 authored and cseed committed Nov 12, 2015
1 parent 8f7c9d3 commit cc0064e
Show file tree
Hide file tree
Showing 12 changed files with 647 additions and 71 deletions.
6 changes: 6 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*.aux
*.bbl
*.blg
*.log
*.pdf
*.synctex.gz
127 changes: 127 additions & 0 deletions docs/LeveneHaldane.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
\documentclass[letterpaper, 10pt]{article}
\usepackage[letterpaper,margin=1.25 in]{geometry}
\pdfoutput=1
\usepackage{amsmath, amsthm, amssymb}
\usepackage{graphicx}
\usepackage{epstopdf}
\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
\usepackage{natbib}
\bibpunct[ ]{(}{)}{,}{a}{}{,}

%\usepackage[color]{showkeys}

%\usepackage{setspace}
%\onehalfspacing
%\setstretch{1.2}

\usepackage{enumerate}

\allowdisplaybreaks[3]

\newcommand{\e}{\varepsilon}
\newcommand{\ph}{\varphi}
\newcommand{\kap}{\kappa}
\newcommand{\del}{\partial}
\newcommand{\nequiv}{\not\equiv}
\newcommand{\N}{\mathbb N}
\newcommand{\Z}{\mathbb Z}
\newcommand{\Q}{\mathbb Q}
\newcommand{\R}{\mathbb R}
\newcommand{\C}{\mathbb C}
\newcommand{\Cc}{\mathcal C}
\newcommand{\F}{\mathcal F}
\newcommand{\Hc}{\mathcal H}
\newcommand{\Dc}{\mathcal D}
\newcommand{\Lc}{\mathcal L}
\newcommand{\Pc}{\mathcal P}
\newcommand{\io}{\ \text{i.o.}}
\newcommand{\as}{\ \text{a.s.}}
\newcommand{\half}{\tfrac{1}{2}}
\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\Abs}[1]{\bigl\lvert#1\bigr\rvert}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\Norm}[1]{\bigl\lVert#1\bigr\rVert}
\newcommand{\ip}[1]{\left\langle#1\right\rangle}
\newcommand{\Ip}[1]{\bigl\langle#1\bigr\rangle}
\DeclareMathOperator{\supp}{supp}
\DeclareMathOperator{\diag}{diag}
\newcommand{\loc}{\mathrm{loc}}
\DeclareMathOperator{\vol}{vol}
\DeclareMathOperator{\Spec}{Spec}
\renewcommand{\P}{\operatorname{\mathbf{P}}}
\newcommand{\1}{\mathbf{1}}
\DeclareMathOperator{\E}{\mathbf E}
\DeclareMathOperator{\Var}{\mathbf{Var}}
\renewcommand{\Re}{\operatorname{Re}}
\renewcommand{\Im}{\operatorname{Im}}

\newcommand{\A}{\mathrm A}
\newcommand{\B}{\mathrm B}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{proposition}{Proposition}
\newtheorem{conjecture}{Conjecture}
\newtheorem{fact}{Fact}
\newtheorem*{st}{Spectral theorem}

\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem*{problem}{Problem}
\theoremstyle{remark}
\newtheorem*{solution}{Solution}


\title{The Levene-Haldane Distribution}
\author{Alex Bloemendal}
\date{September 30, 2015}

\begin{document}
\maketitle
\thispagestyle{empty}

Consider a sample of $n$ diploid individuals at a biallelic autosomal locus. The sample includes $2n$ alleles, $n_\A$ copies of the rarer allele A and $n_\B$ copies of the common allele B. Under the assumption of Hardy-Weinberg equilibrium the alleles are allocated to the individuals uniformly at random; what is the resulting distribution on genotypes, i.e.\ on the numbers of individuals $n_{\A\A}, n_{\A\B}, n_{\B\B}$ carrying $\A\A, \A\B, \B\B$ respectively? Note that the number of heterozygotes $n_{\A\B}$ determines the numbers of homozygotes $n_{\A\A}, n_{\B\B}$ by the relations $2n_{\A\A} + n_{\A\B} = n_\A$, $2n_{\B\B} + n_{\A\B} = n_\B$.

There are
\[
\begin{pmatrix}
2n\\
n_{\A}
\end{pmatrix}
\;=\; \frac{(2n)!}{n_{\A}! n_{\B}!}
\]
possible arrangements for the alleles in the sample, of which
\[
2^{n_{\A\B}}
\begin{pmatrix}
& n &\\
n_{\A\A} & n_{\A\B} & n_{\B\B}
\end{pmatrix} \;=\;
\frac{2^{n_{\A\B}} n!}{n_{\A\A}! n_{\A\B}! n_{\B\B}!}
\]
yield exactly $n_{\A\B}$ heterozygotes. The resulting conditional distribution
\[
p(n_{\A\B} \mid n, n_\A) \; = \; \frac{2^{n_{\A\B}} n!}{n_{\A\A}! n_{\A\B}! n_{\B\B}!}\cdot\frac{n_{\A}! n_{\B}!}{(2n)!}
\]
is known as the Levene-Haldane distribution \citep{Weir, HWE_Wigginton, HWE_Graffelman}.

The distribution is supported on those $n_{\A\B}$ such that $0\le n_{\A\B}\le n_\A$ and $n_\A - n_{\A\B}$ is even. It is unimodal, with mode equal to the integer of the correct parity nearest
\[
\frac{(n_\A + 1)(n_\B + 1)}{2n + 3}.
\]
One can show the latter by considering the ratio of probabilities at adjacent values, which in particular is monotonic. It also follows that the tails decay faster than geometrically. The mean and variance are
\[
\frac{n_\A n_\B}{2n - 1} \qquad \text{and}\qquad \frac{n_\A n_\B}{2n - 1}\left(1 + \frac{(n_\A - 1)(n_\B - 1)}{2n - 3} - \frac{n_\A n_\B}{2n - 1}\right)
\]
respectively \citep{HWE_Okamoto}.

The implementation is based on \cite{HWE_Wigginton}. The mid-$p$-value correction for the left, right and two-sided exact tests is proposed in \cite{HWE_Graffelman}. \cite{HWE_Rohlfs} study the behavior of these discrete statistics under null and alternative hypotheses. The extension to multiallelic loci is computationally expensive; \cite{HWE_Engels} describes several proposed Monte-Carlo methods. Finally, \cite{HWE_Wakefield} argues for a Bayesian approach, reviewing and extending existing work in this direction.

\bigskip
\bibliographystyle{dcu}
\bibliography{bibfile}

\end{document}

77 changes: 77 additions & 0 deletions docs/bibfile.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
%% This BibTeX bibliography file was created using BibDesk.
%% http://bibdesk.sourceforge.net/
%% Created for Alex Bloemendal at 2015-10-06 12:35:19 -0400
%% Saved with string encoding Unicode (UTF-8)
@book{Weir,
Author = {Weir, Bruce S},
Date-Added = {2015-10-06 16:30:46 +0000},
Date-Modified = {2015-10-06 16:35:09 +0000},
Publisher = {Sinauer Associates},
Title = {Genetic Data Analysis II},
Year = {1996}}

@article{HWE_Wakefield,
Author = {Wakefield, Jon},
Date-Added = {2015-09-30 19:29:13 +0000},
Date-Modified = {2015-09-30 19:29:33 +0000},
Journal = {Biometrics},
Pages = {257--265},
Title = {Bayesian methods for examining {H}ardy--{W}einberg equilibrium},
Volume = {66},
Year = {2010}}

@article{HWE_Okamoto,
Author = {Okamoto, Masashi and Ishii, Goro},
Date-Added = {2015-09-30 19:26:32 +0000},
Date-Modified = {2015-09-30 19:27:12 +0000},
Journal = {Biometrika},
Pages = {181--190},
Title = {Test of independence in intraclass 2 x 2 tables},
Volume = {48},
Year = {1961}}

@article{HWE_Graffelman,
Author = {Graffelman, Jan and Moreno, Victor},
Date-Added = {2015-09-30 19:24:11 +0000},
Date-Modified = {2015-09-30 19:24:32 +0000},
Journal = {Statistical Applications in Genetics and Molecular Biology},
Pages = {433--448},
Title = {The mid p-value in exact tests for {H}ardy--{W}einberg equilibrium},
Volume = {12},
Year = {2013}}

@article{HWE_Engels,
Author = {Engels, William R},
Date-Added = {2015-09-30 19:22:30 +0000},
Date-Modified = {2015-09-30 19:23:00 +0000},
Journal = {Genetics},
Pages = {1431--1441},
Title = {Exact tests for {H}ardy--{W}einberg proportions},
Volume = {183},
Year = {2009}}

@article{HWE_Wigginton,
Author = {Wigginton, Janis E and Cutler, David J and Abecasis, Gon{\c{c}}alo R},
Date-Added = {2015-09-30 19:18:58 +0000},
Date-Modified = {2015-09-30 19:19:46 +0000},
Journal = {The American Journal of Human Genetics},
Pages = {887--893},
Title = {A note on exact tests of {H}ardy--{W}einberg equilibrium},
Volume = {76},
Year = {2005}}

@article{HWE_Rohlfs,
Author = {Rohlfs, Rori V and Weir, Bruce S},
Date-Added = {2015-09-30 19:17:12 +0000},
Date-Modified = {2015-09-30 19:18:20 +0000},
Journal = {Genetics},
Pages = {1609--1616},
Title = {Distributions of {H}ardy--{W}einberg equilibrium test statistics},
Volume = {180},
Year = {2008}}
30 changes: 25 additions & 5 deletions src/main/scala/org/broadinstitute/hail/Utils.scala
Original file line number Diff line number Diff line change
Expand Up @@ -411,11 +411,6 @@ object Utils {
if (!p) throw new AssertionError
}

// FIXME This should be replaced by AB's version that assesses relative difference as well
def closeEnough(a: Double, b: Double, cutoff: Double = 0.0001) = {
math.abs(a - b) < cutoff
}

// FIXME Would be nice to have a version that averages three runs, perhaps even discarding an initial run. In this case the code block had better be functional!
def printTime[T](block: => T) = {
val timed = time(block)
Expand Down Expand Up @@ -469,4 +464,29 @@ object Utils {

implicit def toRichStringBuilder(sb: mutable.StringBuilder): RichStringBuilder =
new RichStringBuilder(sb)

def D_epsilon(a: Double, b: Double, tolerance: Double = 1.0E-6): Double =
math.max(java.lang.Double.MIN_NORMAL, tolerance * math.max(math.abs(a), math.abs(b)))

def D_==(a: Double, b: Double, tolerance: Double = 1.0E-6): Boolean =
math.abs(a - b) <= D_epsilon(a, b, tolerance)

def D_!=(a: Double, b: Double, tolerance: Double = 1.0E-6): Boolean =
math.abs(a - b) > D_epsilon(a, b, tolerance)

def D_<(a: Double, b: Double, tolerance: Double = 1.0E-6): Boolean =
a - b < -D_epsilon(a, b, tolerance)

def D_<=(a: Double, b: Double, tolerance: Double = 1.0E-6): Boolean =
a - b <= D_epsilon(a, b, tolerance)

def D_>(a: Double, b: Double, tolerance: Double = 1.0E-6): Boolean =
a - b > D_epsilon(a, b, tolerance)

def D_>=(a: Double, b: Double, tolerance: Double = 1.0E-6): Boolean =
a - b >= -D_epsilon(a, b, tolerance)


def flushDouble(a: Double):Double =
if (math.abs(a) < java.lang.Double.MIN_NORMAL) 0.0 else a
}
49 changes: 24 additions & 25 deletions src/main/scala/org/broadinstitute/hail/driver/SampleQC.scala
Original file line number Diff line number Diff line change
Expand Up @@ -12,29 +12,29 @@ import org.kohsuke.args4j.{Option => Args4jOption}
import scala.collection.mutable

object SampleQCCombiner {
val header = "nCalled" + "\t" +
"nNotCalled" + "\t" +
"nHomRef" + "\t" +
"nHet" + "\t" +
"nHomVar" + "\t" +
"alleleBalance" + "\t" +
"nSNP" + "\t" +
"nInsertion" + "\t" +
"nDeletion" + "\t" +
"nSingleton" + "\t" +
"nTransition" + "\t" +
"nTransversion" + "\t" +
"dpMean" + "\t" + "dpStDev" + "\t" +
"dpMeanHomRef" + "\t" + "dpStDevHomRef" + "\t" +
"dpMeanHet" + "\t" + "dpStDevHet" + "\t" +
"dpMeanHomVar" + "\t" + "dpStDevHomVar" + "\t" +
"gqMean" + "\t" + "gqStDev" + "\t" +
"gqMeanHomRef" + "\t" + "gqStDevHomRef" + "\t" +
"gqMeanHet" + "\t" + "gqStDevHet" + "\t" +
"gqMeanHomVar" + "\t" + "gqStDevHomVar" + "\t" +
"nNonRef" + "\t" +
"rTiTv" + "\t" +
"rHetHomVar" + "\t" +
val header = "nCalled\t" +
"nNotCalled\t" +
"nHomRef\t" +
"nHet\t" +
"nHomVar\t" +
"alleleBalance\t" +
"nSNP\t" +
"nInsertion\t" +
"nDeletion\t" +
"nSingleton\t" +
"nTransition\t" +
"nTransversion\t" +
"dpMean\tdpStDev\t" +
"dpMeanHomRef\tdpStDevHomRef\t" +
"dpMeanHet\tdpStDevHet\t" +
"dpMeanHomVar\tdpStDevHomVar\t" +
"gqMean\tgqStDev\t" +
"gqMeanHomRef\tgqStDevHomRef\t" +
"gqMeanHet\tgqStDevHet\t" +
"gqMeanHomVar\tgqStDevHomVar\t" +
"nNonRef\t" +
"rTiTv\t" +
"rHetHomVar\t" +
"rDeletionInsertion"
}

Expand Down Expand Up @@ -221,7 +221,7 @@ class SampleQCCombiner extends Serializable {
sb.append(nHet + nHomVar)
sb += '\t'

// nTiTv
// nTiTvf
sb.tsvAppend(divOption(nTi, nTv))
sb += '\t'

Expand All @@ -231,7 +231,6 @@ class SampleQCCombiner extends Serializable {

// rDeletionInsertion
sb.tsvAppend(divOption(nDel, nIns))

}
}

Expand Down

0 comments on commit cc0064e

Please sign in to comment.