### 4.1 Conditional Likelihood

In this section, we derive the probability distribution for the **number and lengths of IBD segments**, under the condition that **at least one segment is observed**. This conditional framing reflects the reality that we are only studying pairs who share detectable IBD—so the distribution is implicitly filtered on IBD presence.

We follow the notation of Ko and Nielsen (2017). Let a genealogical relationship be defined as:

$$
R = (u, v, a)
$$

where:

- $a \in \{1, 2\}$ indicates whether the pair shares one or two common ancestors,
- $u$ is the number of meioses from individual $i$ to the common ancestor(s),
- $v$ is the number of meioses from individual $j$ to the common ancestor(s).

From this, the total number of meioses separating the two individuals is:

$$
m = u + v
$$

and the **degree of relationship** is defined as:

$$
d = m - a + 1
$$

Now let $n$ denote the number of observed IBD segments between $i$ and $j$.

Some of these segments are inherited **through the common ancestor(s)** relevant to the relationship $R$—those are the segments we aim to model. Others may come from background shared ancestry with more distant individuals and are treated as noise.

We define:

- $n_d$: the number of IBD segments that descend from the ancestor(s) relevant to $R$,
- $n_b$: the number of IBD segments that arise from other ancestors in the broader pedigree.

By definition:

$$
n = n_d + n_b
$$

Our goal is to derive the distribution of $n$ and of the segment lengths, **conditional on observing $n \geq 1$**. This will allow us to construct likelihood functions for different relationship types, based only on the observed segment patterns between the pair.

Let $\{\ell_1, \ldots, \ell_n\}$ represent the lengths of the $n = n_d + n_b$ IBD segments observed between individuals $i$ and $j$, measured in centimorgans (cM).

We define the event $O$ to be: “$i$ and $j$ share at least one IBD segment.”  
Our goal is to compute the probability distribution:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O; m, a)
$$

That is, we want the **joint probability of observing these segment lengths**, conditional on having **at least one shared IBD segment** and given:
- $m$: the number of meioses separating $i$ and $j$
- $a$: the number of most-recent common ancestors (1 or 2)

We assume that the $n_d$ segments relevant to the relationship $R$ (i.e., from the targeted common ancestor(s)) are the ones we are modeling.

---

We follow the approach developed in Huff et al. (2011), who derived a similar distribution for IBD segment lengths in the **unconditional case** (i.e., without assuming at least one segment is observed).

To simplify the derivation, we adopt a key assumption:

> The $n_d$ segments transmitted through the most recent IBD-contributing ancestor(s) are the **longest segments** observed.

This assumption lets us **ignore the background segments** ($n_b$), which might come from more distant ancestors. It also **removes the need to sum over all possible subsets** of observed segments that could have originated from the focal ancestor(s).

With this simplifying assumption in place, the probability distribution of segment lengths becomes more tractable and can be derived directly from the genealogical parameters $(m, a)$.


### Interpreting the Conditional Probability Expression

The term

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O;\ a, m)
$$

represents the **joint probability** of observing the specific set of IBD segment lengths $\ell_1, \ldots, \ell_n$ between two individuals, **given**:

- $O$: the event that _at least one_ IBD segment is observed,
- $a$: the number of shared common ancestors (either 1 or 2),
- $m$: the total number of meioses separating the individuals.

---

This probability expression reflects a **conditional model** of how IBD segments arise from a genealogical relationship. It incorporates:

- The constraint that we are only analyzing pairs who **share detectable IBD** (i.e., $O$ has occurred),
- The genealogical structure of the relationship (number of shared ancestors and meioses),
- And the assumption that **segment lengths are informative** about the underlying relationship.

The conditioning on $O$ is important. In real-world applications, we **only observe pairs who share at least one segment**, so all inferences must be made under that constraint. Without conditioning on $O$, we would be modeling a space that includes pairs with no shared segments—an irrelevant and misleading scenario for most IBD-based inference.

This term therefore forms the **foundation** for likelihood-based methods that estimate the most probable relationship between individuals based on the **number and lengths of shared IBD segments**.


We now approximate the conditional probability of observing the segment lengths $\ell_1, \ldots, \ell_n$ as follows:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O; a, m)
\approx
\sum_{n_d = 1}^{n} \mathbb{P}(\ell_1, \ldots, \ell_n \mid n_d = i,\ n_b = n - i,\ O;\ a, m) \cdot
\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)
\tag{1a}
$$

This expression reflects a **mixture model**, where we marginalize over all possible values of $n_d$—the number of IBD segments that originate from the ancestor(s) relevant to the relationship $R$.

Each term in the sum includes:

- The **likelihood** of observing the given segment lengths assuming $n_d = i$ of them are from the genealogical relationship of interest and the rest ($n_b = n - i$) are background.
- The **probability** of having exactly $n_d = i$ direct segments (and $n_b$ background), given that at least one segment is observed and the genealogical parameters $(a, m)$.

In practice, this approach allows us to model the segment length distribution **without needing to know exactly which segments came from the target ancestor(s)**, by treating the counts probabilistically.


### Why This Is Interpreted as a Mixture Model

Although we apply the **law of total probability** to sum over possible values of $n_d$, the structure of the equation also reflects what is known in statistics as a **mixture model**.

In this context, we assume that the $n$ observed IBD segments may come from two biologically distinct sources:
- Some segments ($n_d$) are inherited through the genealogical relationship we are trying to infer (i.e., from the closest shared ancestor(s)),
- The remaining segments ($n_b = n - n_d$) arise from other, more distantly shared ancestors.

Because we do not observe the source of each segment, we construct the overall probability distribution by:
- Considering every possible way the $n$ segments could be split into “direct” and “background,” and
- Weighting the likelihood of each split by how probable that configuration is under the model.

This structure — summing over unobserved categories, each with its own likelihood and weight — is characteristic of a **mixture model**.

> While the **law of total probability** gives us the mathematical license to sum over hidden variables like $n_d$,  
> the interpretation as a **mixture model** comes from the biological assumption that segment lengths are generated by **two underlying sources**.  
> It is the presence of these separate sources that gives the model its mixture-like structure.


### Law of Total Probability Interpretation

The summation in the expression 1a above is an application of the **law of total probability**.

Because we don’t observe which of the $n$ segments came from the shared ancestor(s) and which came from background, we treat the number of direct segments $n_d$ as a latent (hidden) variable. The law of total probability lets us express the full probability as a **weighted average** over all possible values of $n_d$.

Here’s how each component contributes:

---

- **$\mathbb{P}(\ell_1, \ldots, \ell_n \mid n_d = i,\ n_b = n - i,\ O;\ a, m)$**  
  This is the likelihood of observing the specific set of segment lengths $\ell_1, \ldots, \ell_n$  
  **assuming exactly $i$ segments** came from the relationship of interest and the remaining $n - i$ came from background ancestry.  
  It is a **conditional likelihood** under a fixed configuration of segment origin.

---

- **$\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)$**  
  This is the **probability of that configuration occurring**, given that we observed at least one segment ($O$) and given genealogical parameters $(a, m)$.  
  It captures the uncertainty in how many of the $n$ observed segments actually reflect the genealogical relationship of interest.

---

- **$\sum_{n_d = 1}^{n} \cdots$**  
  This summation **marginalizes** over all possible values of $n_d$, from 1 to $n$, allowing the full likelihood to reflect all plausible ways the observed segments could have arisen.

---

In effect, we are computing a **mixture distribution**, where each term in the sum corresponds to one possible way of splitting the $n$ segments into “direct” and “background” sources.

By applying the law of total probability, we:
- Avoid having to specify which segments are which,
- Integrate over the hidden variable $n_d$,
- And obtain a valid, tractable expression for the joint probability of the observed data under genealogical model assumptions.

This approach enables principled inference, even in the presence of unobservable ancestral origins for each segment.

>### Note: Deriving $n_b = n - i$ from Basic Algebra
>
>We observe a total of $n$ IBD segments. In each term of the summation, we assume exactly $n_d = i$ of those segments came from the genealogical relationship of interest.
>
>We can derive the number of background segments $n_b$ as follows:
>
>$$
>\begin{array}{rcll}
n & = & n_d + n_b & \text{(total segments = direct + background)} \\
n - n_d & = & n_d + n_b - n_d & \text{(subtract $n_d$ from both sides)} \\
n - n_d & = & n_b & \\
n - i & = & n_b & \text{(substitute $n_d = i$)}
\end{array}
>$$
>
>This confirms that for each assumed value of $n_d = i$, the number of background segments must be $n - i$.



Expanding the previous approximation in 1a, we can break the segment lengths into two components: those coming from the **target relationship** and those from **background ancestry**. This gives:

$$
= \sum_{i=1}^{n}
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)}; a, m) \cdot
\mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)}) \cdot
\mathbb{P}(n_d = i,\ n_b = n - i \mid O; a, m)
\tag{1b}
$$

Where:

- $\ell^{(1)}, \ldots, \ell^{(n_d)}$ are the $n_d$ longest observed IBD segments, assumed to come from the ancestor(s) specified by the relationship $R$.
- $\ell^{(n_d+1)}, \ldots, \ell^{(n)}$ are the remaining $n_b = n - n_d$ segments, treated as arising from **background shared ancestry**.
- $\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)}; a, m)$ is the probability of the direct segments under the relationship model (e.g., Huff et al.).
- $\mathbb{P}_b(\cdot)$ is the probability model for background segments.
- $\mathbb{P}(n_d = i, n_b = n - i \mid O; a, m)$ is the conditional probability that exactly $i$ of the observed segments came from the focal ancestor(s), given that at least one IBD segment was observed.

This form explicitly separates the **signal (direct segments)** from the **noise (background segments)** and weights each mixture component according to its probability under the genealogical model.

### Segment Partitioning Assumption and Product Rule

To simplify the joint probability over all observed segment lengths, we assume the first $n_d$ **longest IBD segments** originate from the most recent IBD-contributing ancestor(s), while the remaining $n_b = n - n_d$ shorter segments are due to **other (distant) ancestors**.

This **segment-length ordering** allows us to deterministically assign the top $n_d$ segments to the focal relationship and the remaining $n_b$ to older genealogical sources, without needing to marginalize over all possible segment assignments.

The factorization above follows from the **product rule** of probability. The product rule states that if two sets of variables are conditionally independent given some parameters, their joint probability is equal to the product of their marginal probabilities:

$$
\mathbb{P}(A, B \mid \theta) = \mathbb{P}(A \mid \theta) \cdot \mathbb{P}(B \mid \theta)
$$

In our case, we assume that the segment lengths inherited from the most recent ancestor(s) and those inherited from other ancestors are conditionally independent given the genealogical parameters $(a, m)$. This allows us to factor the joint likelihood:

$$
\underbrace{
\mathbb{P}(\ell_1, \ldots, \ell_n \mid n_d = i,\ n_b = n - i,\ O;\ a, m)
}_{\text{from 1a}}
=
\underbrace{
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)} \mid a, m) \cdot \mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)})
}_{\text{factorization introduced in 1b}}
$$


This use of the product rule allows us to treat the segments from the two sources separately in the likelihood function.

**To recap the three terms of statement 1b,**

**Direct segment component**:  
$$\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)}; a, m)$$  
The likelihood of observing the $n_d$ longest segments under a model defined by generational distance $a$ and number of meioses $m$ to the recent shared ancestor(s).

**Other ancestor segment component**:  
$$\mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)})$$  
The likelihood of observing the remaining $n_b = n - n_d$ shorter segments arising from IBD with other, more distant ancestors.

**Split probability component (unchanged from Equation 1a):**  
$$\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)$$  
This term captures the uncertainty in how the observed $n$ IBD segments are partitioned between the recent ancestor(s) and other ancestors, given that at least one segment was observed.

Now that we have completed the factorization of the likelihood term using the segment partitioning assumption, we turn next to modeling this **split probability** and understanding how it depends on $(a, m)$.

### Summary of Equation (1b) Components

The likelihood of observing $n$ IBD segments is approximated as:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid a, m)
\approx \sum_{i=1}^n
\underbrace{
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)} \mid a, m)
}_{\text{(1b) direct segment likelihood}}
\cdot
\underbrace{
\mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)})
}_{\text{(1b) other segment likelihood}}
\cdot
\underbrace{
\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)
}_{\text{(1a) split probability}}
$$

| Component                              | Description                                                                                       | Origin in Equation |
|----------------------------------------|---------------------------------------------------------------------------------------------------|--------------------|
| $\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)} \mid a, m)$ | Probability of observing the $n_d$ longest segments under the focal relationship | Introduced in 1b   |
| $\mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)})$       | Probability of observing the remaining $n_b = n - n_d$ segments from other ancestors              | Introduced in 1b   |
| $\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)$        | Probability of this partition between direct and other ancestors, conditioned on at least one IBD segment | Carried from 1a     |

This decomposition reflects both a **segment-length-based partitioning assumption** and a use of the **product rule**, which allows separation of the segment likelihoods due to presumed conditional independence. In the next section, we will model the split probability term to complete the evaluation of the full expression.


We now normalize the full expression by dividing by the probability of observing **at least one IBD segment**, given the genealogical parameters $(a, m)$. This gives:

$$
= \sum_{i=1}^{n}
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)}; a, m) \cdot
\mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)}) \cdot
\frac{
\mathbb{P}(n_d = i,\ n_b = n - i,\ O; a, m)
}{
\mathbb{P}(O;\ a, m)
}
\tag{1c}
$$

This is the **fully conditional probability** of observing segment lengths $\ell_1, \ldots, \ell_n$, given that at least one IBD segment is present. It includes:

- A **model for the direct segments** (first $i$ segments, assumed longest),
- A **model for background segments** (remaining $n - i$ segments),
- A **joint probability** of the segment count breakdown $(n_d, n_b)$ **and** the event $O$,
- Normalization by $\mathbb{P}(O;\ a, m)$ to ensure the distribution is properly scaled under the condition that at least one segment is observed.

This form helps isolate the contribution of the target genealogical relationship from other genealogical relationships, and it adheres to proper probabilistic conditioning.



### From Split Probability to Fully Conditional Likelihood

In the previous expression (1b), we summed over possible segment partitions to compute the likelihood of observing the segment lengths $\ell_1, \ldots, \ell_n$, given the observation of at least one IBD segment:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O;\ a, m)
=
\sum_{i=1}^n
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)} \mid a, m)
\cdot \mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)})
\cdot
\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)
\tag{1b}
$$

We now express the split probability $\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)$ using the definition of conditional probability:

$$
\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A, B)}{\mathbb{P}(B)}
$$

Applying this to our case:

$$
\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)
=
\frac{
\mathbb{P}(n_d = i,\ n_b = n - i,\ O;\ a, m)
}{
\mathbb{P}(O;\ a, m)
}
$$

Substituting this into Equation (1b) yields a fully normalized expression for the likelihood of observing the segment lengths $\ell_1, \ldots, \ell_n$, conditioned on the event $O$ that at least one segment is present:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O;\ a, m)
=
\sum_{i=1}^{n}
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)} \mid a, m) \cdot
\mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)}) \cdot
\frac{
\mathbb{P}(n_d = i,\ n_b = n - i,\ O;\ a, m)
}{
\mathbb{P}(O;\ a, m)
}
\tag{1c}
$$

This normalization step is useful for several reasons:

1. **It makes the conditioning explicit** — Rather than embedding the observation event $O$ implicitly into each term, it shows exactly how $O$ enters into the probability expression.

2. **It isolates the normalization constant** $\mathbb{P}(O;\ a, m)$, ensuring that the final likelihood is properly scaled.

3. **It sets the stage for modeling** both the numerator and denominator:
   - The numerator $\mathbb{P}(n_d = i,\ n_b = n - i,\ O;\ a, m)$ captures the probability of a specific segment partition *and* the observation event.
   - The denominator $\mathbb{P}(O;\ a, m)$ gives the overall probability of observing at least one IBD segment under parameters $(a, m)$.

We continue by making the **segment partitioning assumption**: the processes that generate segments  
from **direct** and **background** ancestors are treated as conditionally independent given the partition.

This allows us to treat:

- **Direct segments**, $n_d$, as governed by the genealogical parameters $(a, m)$, and  
- **Background segments**, $n_b = n - i$, as drawn from a separate, fixed distribution independent of $(a, m)$.

With this assumption, we factor the joint probability as follows:

$$
\mathbb{P}(n_d = i,\ n_b = n - i,\ O;\ a, m)
=
\mathbb{P}(n_d = i;\ a, m) \cdot \mathbb{P}(n_b = n - i)
$$

We emphasize that the **observation event $O$ remains embedded** in the structure of the problem—it is  
still enforced by the overall normalization in the denominator $\mathbb{P}(O;\ a, m)$.

---

Putting everything together, the fully normalized form becomes:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O;\ a, m)
=
\sum_{i=1}^{n}
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)};\ a, m) \cdot
\mathbb{P}_b(\ell^{(n_d+1)}, \ldots, \ell^{(n)}) \cdot
\frac{
\mathbb{P}(n_d = i;\ a, m) \cdot \mathbb{P}(n_b = n - i)
}{
\mathbb{P}(O;\ a, m)
}
\tag{1}
$$

Where:

- $\mathbb{P}_b(\cdot)$ represents the probability distribution over **IBD segment lengths that originate from background ancestors**—i.e., segments **not** linked to the focal genealogical relationship.
- $\mathbb{P}(n_d = i;\ a, m)$ is the probability of observing $i$ direct IBD segments from the ancestor(s) in relationship $R$.
- $\mathbb{P}(n_b = n - i)$ is the background probability of observing the remaining segments from other sources.
- The denominator $\mathbb{P}(O;\ a, m)$ ensures that the total probability is **conditioned on the observation of at least one IBD segment**.

This decomposition allows us to **model direct and background contributions separately**, while still integrating them into a coherent, conditional probability model.

This final form cleanly separates the contributions of the **direct genealogical model**,  
the **background process**, and the **observed segment count** under the constraint that $O$ has occurred.


### Recap: Building the Fully Conditional Likelihood

We began by expressing the probability of observing the segment lengths $\ell_1, \ldots, \ell_n$, conditioned on the presence of at least one IBD segment, as a sum over possible partitions between direct and background segments. Each term in the sum accounts for a different number $i$ of segments attributed to the direct genealogical relationship, with the remaining $n - i$ segments treated as background. This gave rise to **Equation (1a)**:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O;\ a, m)
\approx
\sum_{n_d = 1}^{n}
\mathbb{P}(\ell_1, \ldots, \ell_n \mid n_d = i,\ n_b = n - i,\ O;\ a, m)
\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)
\tag{1a}
$$

Under the **segment partitioning assumption**, we assume that the direct and background segments are generated independently given their counts. This allows us to factor the first term in each summand as a product of two independent likelihoods: one for the direct segments and one for the background segments. This leads to **Equation (1b)**:

$$
=
\sum_{i = 1}^{n}
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)};\ a, m)
\cdot
\mathbb{P}_b(\ell^{(n_d + 1)}, \ldots, \ell^{(n)})
\cdot
\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)
\tag{1b}
$$

To clarify how conditioning on the observation event $O$ affects the expression, we apply the definition of conditional probability:

$$
\mathbb{P}(n_d = i,\ n_b = n - i \mid O;\ a, m)
=
\frac{
\mathbb{P}(n_d = i,\ n_b = n - i,\ O;\ a, m)
}{
\mathbb{P}(O;\ a, m)
}
$$

Substituting this into Equation (1b) yields **Equation (1c)**:

$$
=
\sum_{i = 1}^{n}
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)};\ a, m)
\cdot
\mathbb{P}_b(\ell^{(n_d + 1)}, \ldots, \ell^{(n)})
\cdot
\frac{
\mathbb{P}(n_d = i,\ n_b = n - i,\ O;\ a, m)
}{
\mathbb{P}(O;\ a, m)
}
\tag{1c}
$$

Finally, we make use of the **conditional independence assumption** between direct and background segment counts:

- The count of direct segments depends on genealogical parameters $a$ and $m$,
- The background segment count is modeled separately and is independent of $a, m$.

With that, we arrive at the fully factorized and normalized **Equation (1)**:

$$
\mathbb{P}(\ell_1, \ldots, \ell_n \mid O;\ a, m)
=
\sum_{i = 1}^{n}
\mathbb{P}(\ell^{(1)}, \ldots, \ell^{(n_d)};\ a, m)
\cdot
\mathbb{P}_b(\ell^{(n_d + 1)}, \ldots, \ell^{(n)})
\cdot
\frac{
\mathbb{P}(n_d = i;\ a, m)
\cdot
\mathbb{P}(n_b = n - i)
}{
\mathbb{P}(O;\ a, m)
}
\tag{1}
$$

This equation cleanly decomposes the likelihood into a sum over direct-background segment partitions, with each term reflecting:

- The segment length distributions for direct and background segments,
- The respective probabilities of observing $n_d = i$ and $n_b = n - i$,
- And normalization over the conditioning event $O$.

This expression serves as the foundation for modeling the likelihood using explicit models such as Huff et al. (2011) for segment lengths and segment counts.
