# Description

This notebook reads a PR from a manuscript and matches original paragraphs with modified ones.

# Modules

In [1]:
from pathlib import Path

import pandas as pd
from github import Auth, Github
from IPython.display import display
from proj import conf
from proj.utils import process_paragraph

# Settings/paths

In [2]:
REPO = "pivlab/manubot-ai-editor-code-test-mutator-epistasis-manuscript"
PR = (2, "gpt-3.5-turbo")

OUTPUT_FILE_PATH = None
REVERSED_OUTPUT_FILE_PATH = None

In [3]:
# Parameters
OUTPUT_FILE_PATH = "/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/paragraph_match/epistasis-manuscript--gpt-3.5-turbo.pkl"
REVERSED_OUTPUT_FILE_PATH = "/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/paragraph_match/epistasis-manuscript--gpt-3.5-turbo--reversed.pkl"

In [4]:
OUTPUT_FILE_PATH = Path(OUTPUT_FILE_PATH).resolve()
OUTPUT_FILE_PATH.parent.mkdir(parents=True, exist_ok=True)
display(OUTPUT_FILE_PATH)

PosixPath('/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/paragraph_match/epistasis-manuscript--gpt-3.5-turbo.pkl')

In [5]:
REVERSED_OUTPUT_FILE_PATH = Path(REVERSED_OUTPUT_FILE_PATH).resolve()
REVERSED_OUTPUT_FILE_PATH.parent.mkdir(parents=True, exist_ok=True)
display(REVERSED_OUTPUT_FILE_PATH)

PosixPath('/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/paragraph_match/epistasis-manuscript--gpt-3.5-turbo--reversed.pkl')

# Get Repo

In [6]:
auth = Auth.Token(conf.github.API_TOKEN)

In [7]:
g = Github(auth=auth)

In [8]:
repo = g.get_repo(REPO)

# Get Pull Request

In [9]:
pr = repo.get_pull(PR[0])

In [10]:
list(pr.get_files())

[File(sha="1b8737b53295e36d2865468636d4d701e3c023e9", filename="content/01.abstract.md"),
 File(sha="28bc2a3e44a83ef0fea21c5401b3e250ccaab8c8", filename="content/02.introduction.md"),
 File(sha="586426518c1666b04cf8c379844f07adf7a9bd68", filename="content/03.results.md"),
 File(sha="8e1d41d794fb25731e79ee5dd7f189b7191df039", filename="content/04.discussion.md"),
 File(sha="5f65333ddefa649a64e03cd20e5fb2941aefd97c", filename="content/05.methods.md")]

In [11]:
pr_commits = list(pr.get_commits())

In [12]:
pr_commits[0].parents

[Commit(sha="5c98a032259b13a812dc490a7a13edfa78768ba9")]

In [13]:
pr_prev = pr_commits[0].parents[0].sha
print(pr_prev)

5c98a032259b13a812dc490a7a13edfa78768ba9


In [14]:
pr_curr = pr_commits[0].sha
print(pr_curr)

203ac6b47f445ca4b1fbe6ea0abc19e13ab0cff4


# Get file list

In [15]:
pr_files = [f for f in pr.get_files() if f.filename.endswith(".md")]
display(pr_files)

[File(sha="1b8737b53295e36d2865468636d4d701e3c023e9", filename="content/01.abstract.md"),
 File(sha="28bc2a3e44a83ef0fea21c5401b3e250ccaab8c8", filename="content/02.introduction.md"),
 File(sha="586426518c1666b04cf8c379844f07adf7a9bd68", filename="content/03.results.md"),
 File(sha="8e1d41d794fb25731e79ee5dd7f189b7191df039", filename="content/04.discussion.md"),
 File(sha="5f65333ddefa649a64e03cd20e5fb2941aefd97c", filename="content/05.methods.md")]

# Sections

In [16]:
paragraph_matches = []

## Abstract

In [17]:
section_name = "abstract"

In [18]:
pr_filename = pr_files[0].filename
assert section_name in pr_filename
print(pr_filename)

content/01.abstract.md


### Original

In [19]:
# get content
orig_section_content = repo.get_contents(pr_filename, pr_prev).decoded_content.decode(
    "utf-8"
)
print(orig_section_content[:50])

## Abstract {.page_break_before}

Maintaining germ


In [20]:
# split by paragraph
orig_section_paragraphs = orig_section_content.split("\n\n")
display(len(orig_section_paragraphs))

3

### Modified

In [21]:
# get content
mod_section_content = repo.get_contents(pr_filename, pr_curr).decoded_content.decode(
    "utf-8"
)
print(mod_section_content[:50])

## Abstract {.page_break_before}

The essential an


In [22]:
# split by paragraph
mod_section_paragraphs = mod_section_content.split("\n\n")
display(len(mod_section_paragraphs))

3

### Match

In [23]:
orig_section_paragraphs[0]

'## Abstract {.page_break_before}'

In [24]:
mod_section_paragraphs[0]

'## Abstract {.page_break_before}'

####  Paragraph 00

In [25]:
par0 = process_paragraph(orig_section_paragraphs[1])
print(par0)

Maintaining germline genome integrity is essential and enormously complex. Hundreds of proteins are involved in DNA replication and proofreading, and hundreds more are mobilized to repair DNA damage [@PMID:28485537]. While loss-of-function mutations in any of the genes encoding these proteins might lead to elevated mutation rates, *mutator alleles* have largely eluded detection in mammals. DNA replication and repair proteins often recognize particular sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of *de novo* mutations &mdash; that is, the frequency of each individual mutation type (C>T, A>G, etc.) &mdash; will differ between genomes that harbor either a mutator or wild-type allele at a given locus. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene *Mutyh* that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs [@PMID:35545679;@PMID:334

In [26]:
par1 = process_paragraph(mod_section_paragraphs[1])
print(par1)

The essential and immensely complex issue of maintaining germline genome integrity involves hundreds of proteins responsible for DNA replication, proofreading, and repair [@PMID:28485537]. While loss-of-function mutations in genes encoding these proteins can result in increased mutation rates, the detection of *mutator alleles* in mammals has been challenging. DNA replication and repair proteins often target specific sequence motifs or excise lesions at particular nucleotides, suggesting that the spectrum of *de novo* mutations (such as C>T, A>G, etc.) may vary between genomes with mutator or wild-type alleles at a specific locus. Previous research utilized quantitative trait locus mapping to identify potential mutator alleles in the DNA repair gene *Mutyh*, which elevated the C>A germline mutation rate in the BXD inbred mouse family [@PMID:35545679;@PMID:33472028]. In this study, a novel method called "aggregate mutation spectrum distance" was developed to identify alleles linked to m

In [27]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [28]:
display(paragraph_matches[-1])

('abstract',
 'Maintaining germline genome integrity is essential and enormously complex. Hundreds of proteins are involved in DNA replication and proofreading, and hundreds more are mobilized to repair DNA damage [@PMID:28485537]. While loss-of-function mutations in any of the genes encoding these proteins might lead to elevated mutation rates, *mutator alleles* have largely eluded detection in mammals. DNA replication and repair proteins often recognize particular sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of *de novo* mutations &mdash; that is, the frequency of each individual mutation type (C>T, A>G, etc.) &mdash; will differ between genomes that harbor either a mutator or wild-type allele at a given locus. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene *Mutyh* that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs [@PMID:355

## Introduction

In [29]:
section_name = "introduction"

In [30]:
pr_filename = pr_files[1].filename
assert section_name in pr_filename
print(pr_filename)

content/02.introduction.md


### Original

In [31]:
# get content
orig_section_content = repo.get_contents(pr_filename, pr_prev).decoded_content.decode(
    "utf-8"
)
print(orig_section_content[:50])

## Introduction

Germline mutation rates reflect t


In [32]:
# split by paragraph
orig_section_paragraphs = orig_section_content.split("\n\n")
display(len(orig_section_paragraphs))

8

### Modified

In [33]:
# get content
mod_section_content = repo.get_contents(pr_filename, pr_curr).decoded_content.decode(
    "utf-8"
)
print(mod_section_content[:50])

## Introduction

Germline mutation rates are influ


In [34]:
# split by paragraph
mod_section_paragraphs = mod_section_content.split("\n\n")
display(len(mod_section_paragraphs))

8

### Match

In [35]:
orig_section_paragraphs[0]

'## Introduction'

In [36]:
mod_section_paragraphs[0]

'## Introduction'

####  Paragraph 00

In [37]:
par0 = process_paragraph(orig_section_paragraphs[1])
print(par0)

Germline mutation rates reflect the complex interplay between DNA proofreading and repair pathways, exogenous sources of DNA damage, and life-history traits. For example, parental age is an important determinant of mutation rate variability; in many mammalian species, the number of germline *de novo* mutations observed in offspring increases as a function of paternal and maternal age [@PMID:28959963;@PMID:31549960;@PMID:35771663;@PMID:32804933;@PMID:31492841]. Rates of germline mutation accumulation are also variable across human families [@PMID:26656846;@PMID:31549960], likely due to either genetic variation or differences in environmental exposures. Although numerous protein-coding genes contribute to the maintenance of genome integrity, genetic variants that increase germline mutation rates, known as *mutator alleles*, have proven difficult to discover in mammals.


In [38]:
par1 = process_paragraph(mod_section_paragraphs[1])
print(par1)

Germline mutation rates are influenced by DNA proofreading and repair pathways, external sources of DNA damage, and life-history traits. Parental age is a key factor affecting mutation rate variability, with the number of new mutations in offspring increasing as parents age in many mammalian species (Jones et al., 2017; Smith et al., 2019; Brown et al., 2020; White et al., 2020; Black et al., 2019). Mutation accumulation rates vary among human families, likely due to genetic differences or environmental exposures (Green et al., 2015; Smith et al., 2019). While many genes play a role in maintaining genome integrity, identifying genetic variants that raise germline mutation rates, called mutator alleles, has been challenging in mammals.


In [39]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [40]:
display(paragraph_matches[-1])

('introduction',
 'Germline mutation rates reflect the complex interplay between DNA proofreading and repair pathways, exogenous sources of DNA damage, and life-history traits. For example, parental age is an important determinant of mutation rate variability; in many mammalian species, the number of germline *de novo* mutations observed in offspring increases as a function of paternal and maternal age [@PMID:28959963;@PMID:31549960;@PMID:35771663;@PMID:32804933;@PMID:31492841]. Rates of germline mutation accumulation are also variable across human families [@PMID:26656846;@PMID:31549960], likely due to either genetic variation or differences in environmental exposures. Although numerous protein-coding genes contribute to the maintenance of genome integrity, genetic variants that increase germline mutation rates, known as *mutator alleles*, have proven difficult to discover in mammals.',
 'Germline mutation rates are influenced by DNA proofreading and repair pathways, external sources 

####  Paragraph 01

In [41]:
par0 = process_paragraph(orig_section_paragraphs[2])
print(par0)

The dearth of observed germline mutators in mammalian genomes is not necessarily surprising, since alleles that lead to elevated germline mutation rates would likely have deleterious consequences and be purged by negative selection if their effect sizes are large [@PMID:27739533]. Moreover, germline mutation rates are relatively low, and direct mutation rate measurements require whole-genome sequencing data from both parents and their offspring. As a result, large-scale association studies &mdash; which have been used to map the contributions of common genetic variants to many complex traits &mdash; are not currently well-powered to investigate the polygenic architecture of germline mutation rates [@PMID:31964835].


In [42]:
par1 = process_paragraph(mod_section_paragraphs[2])
print(par1)

The scarcity of observed germline mutators in mammalian genomes is not surprising. Mutations that increase germline mutation rates would likely have harmful effects and be eliminated by negative selection if they have significant impacts. Additionally, germline mutation rates are generally low, and accurately measuring mutation rates requires whole-genome sequencing data from both parents and their offspring. Therefore, large-scale association studies, which have been effective in identifying common genetic variants associated with many complex traits, currently lack the power to explore the polygenic nature of germline mutation rates.


In [43]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [44]:
display(paragraph_matches[-1])

('introduction',
 'The dearth of observed germline mutators in mammalian genomes is not necessarily surprising, since alleles that lead to elevated germline mutation rates would likely have deleterious consequences and be purged by negative selection if their effect sizes are large [@PMID:27739533]. Moreover, germline mutation rates are relatively low, and direct mutation rate measurements require whole-genome sequencing data from both parents and their offspring. As a result, large-scale association studies &mdash; which have been used to map the contributions of common genetic variants to many complex traits &mdash; are not currently well-powered to investigate the polygenic architecture of germline mutation rates [@PMID:31964835].',
 'The scarcity of observed germline mutators in mammalian genomes is not surprising. Mutations that increase germline mutation rates would likely have harmful effects and be eliminated by negative selection if they have significant impacts. Additionally,

####  Paragraph 02

In [45]:
par0 = process_paragraph(orig_section_paragraphs[3])
print(par0)

Despite these challenges, less traditional strategies have been used to identify a small number of mutator alleles in humans, macaques [@doi:10.1101/2023.03.27.534460], and mice. By focusing on families with rare genetic diseases, a recent study discovered two mutator alleles that led to significantly elevated rates of *de novo* germline mutation in human genomes [@PMID:35545669]. Other groups have observed mutator phenotypes in the germlines and somatic tissues of adults who carry cancer-predisposing inherited mutations in the POLE/POLD1 exonucleases [@PMID:34594041;@PMID:37336879]. Candidate mutator loci were also found by identifying human haplotypes from the Thousand Genomes Project with excess counts of derived alleles in genomic windows [@PMID:28095480].


In [46]:
par1 = process_paragraph(mod_section_paragraphs[3])
print(par1)

Despite facing challenges, researchers have utilized unconventional methods to discover a small number of mutator alleles in humans, macaques (Smith et al., 2023), and mice. For instance, a recent study focused on families with rare genetic diseases and identified two mutator alleles that significantly increased *de novo* germline mutation rates in human genomes (Jones et al., 2021). Additionally, other studies have observed mutator phenotypes in both germline and somatic tissues of adults carrying cancer-predisposing inherited mutations in the POLE/POLD1 exonucleases (Brown et al., 2022; White et al., 2023). Furthermore, candidate mutator loci were identified by analyzing human haplotypes from the Thousand Genomes Project, which showed an excess of derived alleles in specific genomic windows (Black et al., 2016).


In [47]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [48]:
display(paragraph_matches[-1])

('introduction',
 'Despite these challenges, less traditional strategies have been used to identify a small number of mutator alleles in humans, macaques [@doi:10.1101/2023.03.27.534460], and mice. By focusing on families with rare genetic diseases, a recent study discovered two mutator alleles that led to significantly elevated rates of *de novo* germline mutation in human genomes [@PMID:35545669]. Other groups have observed mutator phenotypes in the germlines and somatic tissues of adults who carry cancer-predisposing inherited mutations in the POLE/POLD1 exonucleases [@PMID:34594041;@PMID:37336879]. Candidate mutator loci were also found by identifying human haplotypes from the Thousand Genomes Project with excess counts of derived alleles in genomic windows [@PMID:28095480].',
 'Despite facing challenges, researchers have utilized unconventional methods to discover a small number of mutator alleles in humans, macaques (Smith et al., 2023), and mice. For instance, a recent study foc

####  Paragraph 03

In [49]:
par0 = process_paragraph(orig_section_paragraphs[4])
print(par0)

In mice, a germline mutator allele was recently discovered by sequencing a large family of inbred mice [@PMID:35545679]. Commonly known as the <u>B</u>X<u>D</u>s, these recombinant inbred lines (RILs) were derived from either F2 or advanced intercrosses of C57<u>B</u>L/6J and <u>D</u>BA/2J, two laboratory strains that exhibit significant differences in their germline mutation spectra [@PMID:33472028;@PMID:30753674]. The BXDs were maintained via brother-sister mating for up to 180 generations, and each BXD therefore accumulated hundreds or thousands of germline mutations on a nearly-homozygous linear mosaic of parental <u>B</u> and <u>D</u> haplotypes. Due to their husbandry in a controlled laboratory setting, the BXDs were largely free from confounding by environmental heterogeneity, and the effects of selection on *de novo* mutations were attenuated by strict inbreeding [@doi:10.1146/annurev.ecolsys.39.110707.173437].


In [51]:
par1 = process_paragraph(mod_section_paragraphs[4])
print(par1)

In a recent study, researchers identified a germline mutator allele in mice by analyzing a large family of inbred mice. These mice, known as the BXDs, were created from crosses between C57BL/6J and DBA/2J laboratory strains, which have different germline mutation patterns. The BXDs were bred through sibling mating for many generations, resulting in the accumulation of hundreds or thousands of germline mutations on a mosaic of parental haplotypes. The controlled laboratory environment in which the BXDs were raised minimized the impact of environmental factors and reduced the effects of selection on new mutations.


In [52]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [53]:
display(paragraph_matches[-1])

('introduction',
 'In mice, a germline mutator allele was recently discovered by sequencing a large family of inbred mice [@PMID:35545679]. Commonly known as the <u>B</u>X<u>D</u>s, these recombinant inbred lines (RILs) were derived from either F2 or advanced intercrosses of C57<u>B</u>L/6J and <u>D</u>BA/2J, two laboratory strains that exhibit significant differences in their germline mutation spectra [@PMID:33472028;@PMID:30753674]. The BXDs were maintained via brother-sister mating for up to 180 generations, and each BXD therefore accumulated hundreds or thousands of germline mutations on a nearly-homozygous linear mosaic of parental <u>B</u> and <u>D</u> haplotypes. Due to their husbandry in a controlled laboratory setting, the BXDs were largely free from confounding by environmental heterogeneity, and the effects of selection on *de novo* mutations were attenuated by strict inbreeding [@doi:10.1146/annurev.ecolsys.39.110707.173437].',
 'In a recent study, researchers identified a 

####  Paragraph 04

In [54]:
par0 = process_paragraph(orig_section_paragraphs[5])
print(par0)

In this previous study, whole-genome sequencing data from the BXD family were used to map a quantitative trait locus (QTL) for the C>A mutation rate [@PMID:35545679]. Germline C>A mutation rates were nearly 50% higher in mice with *D* haplotypes at the QTL, likely due to genetic variation in the DNA glycosylase *Mutyh* that reduced the efficacy of oxidative DNA damage repair. Pathogenic variants of *Mutyh* also appear to act as mutators in normal human germline and somatic tissues [@PMID:35803914;@PMID:30753674]. Importantly, the QTL did not reach genome-wide significance in a scan for variation in overall germline mutation rates, which were only modestly higher in BXDs with *D* alleles, demonstrating the utility of mutation spectrum analysis for mutator allele discovery. Close examination of the mutation spectrum is likely to be broadly useful for detecting mutator alleles, as genes involved in DNA proofreading and repair often recognize particular sequence motifs or excise specific t

In [55]:
par1 = process_paragraph(mod_section_paragraphs[5])
print(par1)

In a recent study, researchers used whole-genome sequencing data from the BXD family to identify a genetic region associated with an increase in C>A mutation rates in the germline of mice. This mutation rate was about 50% higher in mice with specific genetic markers at this region, known as a quantitative trait locus (QTL), potentially due to variations in a gene called *Mutyh*, which is involved in repairing oxidative DNA damage. Mutations in *Mutyh* have also been linked to increased mutation rates in both human germline and somatic cells. Although the QTL did not show strong associations with overall germline mutation rates, which were only slightly elevated in mice with specific genetic markers, analyzing the mutation spectrum proved valuable in identifying mutator alleles. By examining the specific types of mutations present in the DNA sequences, researchers can pinpoint genes involved in DNA repair and proofreading, which often have preferences for certain DNA sequence patterns o

In [56]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [57]:
display(paragraph_matches[-1])

('introduction',
 'In this previous study, whole-genome sequencing data from the BXD family were used to map a quantitative trait locus (QTL) for the C>A mutation rate [@PMID:35545679]. Germline C>A mutation rates were nearly 50% higher in mice with *D* haplotypes at the QTL, likely due to genetic variation in the DNA glycosylase *Mutyh* that reduced the efficacy of oxidative DNA damage repair. Pathogenic variants of *Mutyh* also appear to act as mutators in normal human germline and somatic tissues [@PMID:35803914;@PMID:30753674]. Importantly, the QTL did not reach genome-wide significance in a scan for variation in overall germline mutation rates, which were only modestly higher in BXDs with *D* alleles, demonstrating the utility of mutation spectrum analysis for mutator allele discovery. Close examination of the mutation spectrum is likely to be broadly useful for detecting mutator alleles, as genes involved in DNA proofreading and repair often recognize particular sequence motifs o

####  Paragraph 05

In [58]:
par0 = process_paragraph(orig_section_paragraphs[6])
print(par0)

Although mutation spectrum analysis can enable the discovery of mutator alleles that affect the rates of specific mutation types, early implementations of this strategy have suffered from a few drawbacks. For example, performing association tests on the rates or fractions of every $k$-mer mutation type can quickly incur a substantial multiple testing burden. Since germline mutation rates are generally quite low, estimates of $k$-mer mutation type frequencies from individual samples can also be noisy and imprecise. Moreover, inbreeding duration can vary considerably across samples in populations of RILs; for example, some BXDs were inbred for only 20 generations, while others were inbred for nearly 200. As a result, the variance of individual $k$-mer mutation rate estimates in those populations will be much higher than if all samples were inbred for the same duration. We were therefore motivated to develop a statistical method that could overcome the sparsity of *de novo* mutation spect

In [59]:
par1 = process_paragraph(mod_section_paragraphs[6])
print(par1)

Mutation spectrum analysis is a valuable tool for identifying mutator alleles that impact specific mutation rates. However, early applications of this approach have faced challenges. For instance, conducting association tests on the rates or proportions of every $k$-mer mutation type can lead to a significant burden of multiple testing. Additionally, due to the typically low germline mutation rates, estimates of $k$-mer mutation type frequencies from individual samples may be noisy and imprecise. Furthermore, the duration of inbreeding can vary widely among samples in populations of recombinant inbred lines (RILs); some BXDs were inbred for just 20 generations, while others underwent nearly 200 generations of inbreeding. Consequently, the variability in individual $k$-mer mutation rate estimates within these populations is much higher than if all samples had been inbred for the same duration. This motivated us to develop a statistical method that could address the sparse nature of *de 

In [60]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [61]:
display(paragraph_matches[-1])

('introduction',
 'Although mutation spectrum analysis can enable the discovery of mutator alleles that affect the rates of specific mutation types, early implementations of this strategy have suffered from a few drawbacks. For example, performing association tests on the rates or fractions of every $k$-mer mutation type can quickly incur a substantial multiple testing burden. Since germline mutation rates are generally quite low, estimates of $k$-mer mutation type frequencies from individual samples can also be noisy and imprecise. Moreover, inbreeding duration can vary considerably across samples in populations of RILs; for example, some BXDs were inbred for only 20 generations, while others were inbred for nearly 200. As a result, the variance of individual $k$-mer mutation rate estimates in those populations will be much higher than if all samples were inbred for the same duration. We were therefore motivated to develop a statistical method that could overcome the sparsity of *de n

####  Paragraph 06

In [62]:
par0 = process_paragraph(orig_section_paragraphs[7])
print(par0)

Here, we present a new mutation spectrum association test, called "aggregate mutation spectrum distance," that minimizes multiple testing burdens and mitigates the challenges of sparsity in *de novo* mutation datasets. We leverage this method to re-analyze germline mutation data from the BXD family and find compelling evidence for a second mutator allele that was not detected using previous approaches. The new allele appears to interact epistatically with the mutator that was previously discovered in the BXDs, further augmenting the C>A germline mutation rate in a subset of inbred mice. Our observation of epistasis suggests that mild DNA repair deficiencies can compound one another, as mutator alleles chip away at the redundant systems that collectively maintain germline integrity.


In [63]:
par1 = process_paragraph(mod_section_paragraphs[7])
print(par1)

In this study, we introduce a novel test called the "aggregate mutation spectrum distance" to analyze mutation spectra more effectively while reducing the need for multiple tests and addressing data sparsity issues in *de novo* mutation datasets. Using this method, we re-examined germline mutation data from the BXD family and identified a previously undetected mutator allele that interacts with a known mutator allele in the BXDs. This interaction enhances the C>A germline mutation rate in certain inbred mice. Our findings suggest that mild DNA repair deficiencies can have a cumulative effect, with mutator alleles compromising the redundant systems that safeguard germline integrity.


In [64]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [65]:
display(paragraph_matches[-1])

('introduction',
 'Here, we present a new mutation spectrum association test, called "aggregate mutation spectrum distance," that minimizes multiple testing burdens and mitigates the challenges of sparsity in *de novo* mutation datasets. We leverage this method to re-analyze germline mutation data from the BXD family and find compelling evidence for a second mutator allele that was not detected using previous approaches. The new allele appears to interact epistatically with the mutator that was previously discovered in the BXDs, further augmenting the C>A germline mutation rate in a subset of inbred mice. Our observation of epistasis suggests that mild DNA repair deficiencies can compound one another, as mutator alleles chip away at the redundant systems that collectively maintain germline integrity.',
 'In this study, we introduce a novel test called the "aggregate mutation spectrum distance" to analyze mutation spectra more effectively while reducing the need for multiple tests and a

## Results

In [66]:
section_name = "results"

In [67]:
pr_filename = pr_files[2].filename
assert section_name in pr_filename
print(pr_filename)

content/03.results.md


### Original

In [68]:
# get content
orig_section_content = repo.get_contents(pr_filename, pr_prev).decoded_content.decode(
    "utf-8"
)
print(orig_section_content[:50])

## Results

### A novel method for detecting mutat


In [69]:
# split by paragraph
orig_section_paragraphs = orig_section_content.split("\n\n")
display(len(orig_section_paragraphs))

28

### Modified

In [70]:
# get content
mod_section_content = repo.get_contents(pr_filename, pr_curr).decoded_content.decode(
    "utf-8"
)
print(mod_section_content[:50])

## Results

### A novel method for detecting mutat


In [71]:
# split by paragraph
mod_section_paragraphs = mod_section_content.split("\n\n")
display(len(mod_section_paragraphs))

28

### Match

In [72]:
orig_section_paragraphs[0]

'## Results'

In [73]:
mod_section_paragraphs[0]

'## Results'

####  Paragraph 00

In [77]:
par0 = process_paragraph(orig_section_paragraphs[2])
print(par0)

We developed a statistical method, termed "aggregate mutation spectrum distance" (AMSD), to detect loci that are associated with mutation spectrum variation in recombinant inbred lines (RILs) (Figure {@fig:distance-method}; *Materials and Methods*). Our approach leverages the fact that mutator alleles often leave behind distinct and detectable impressions on the *mutation spectrum*, even if they increase the overall mutation rate by a relatively small amount. Given a population of haplotypes, we assume that each has been genotyped at the same collection of biallelic loci and that each harbors *de novo* mutations which have been partitioned by $k$-mer context (Figure @fig:distance-method). At every locus, we calculate a cosine distance between the aggregate mutation spectra of haplotypes that inherited either parental allele. Using permutation tests, we then identify loci at which those distances are larger than what we'd expect by random chance. To account for polygenic effects on the 

In [78]:
par1 = process_paragraph(mod_section_paragraphs[2])
print(par1)

We introduced a statistical method, named "aggregate mutation spectrum distance" (AMSD), to identify loci linked to mutation spectrum variation in recombinant inbred lines (RILs) (Figure {@fig:distance-method}; *Materials and Methods*). Our method capitalizes on the fact that mutator alleles can produce distinct, identifiable patterns in the mutation spectrum, even if they only slightly increase the overall mutation rate. Assuming a population of genotyped haplotypes at the same set of biallelic loci, each containing *de novo* mutations categorized by $k$-mer context (Figure @fig:distance-method), we compute the cosine distance at each locus between the aggregate mutation spectra of haplotypes inheriting different parental alleles. Loci with larger distances than expected by chance are identified through permutation tests. To address polygenic influences on the mutation process shared among BXDs, we perform a regression of cosine distance against genetic similarity between haplotype gr

In [79]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [80]:
display(paragraph_matches[-1])

('results',
 'We developed a statistical method, termed "aggregate mutation spectrum distance" (AMSD), to detect loci that are associated with mutation spectrum variation in recombinant inbred lines (RILs) (Figure {@fig:distance-method}; *Materials and Methods*). Our approach leverages the fact that mutator alleles often leave behind distinct and detectable impressions on the *mutation spectrum*, even if they increase the overall mutation rate by a relatively small amount. Given a population of haplotypes, we assume that each has been genotyped at the same collection of biallelic loci and that each harbors *de novo* mutations which have been partitioned by $k$-mer context (Figure @fig:distance-method). At every locus, we calculate a cosine distance between the aggregate mutation spectra of haplotypes that inherited either parental allele. Using permutation tests, we then identify loci at which those distances are larger than what we\'d expect by random chance. To account for polygenic 

####  Paragraph 01

In [81]:
par0 = process_paragraph(orig_section_paragraphs[3])
print(par0)

Using simulated data, we find that our method's power is primarily limited by the initial mutation rate of the $k$-mer mutation type affected by a mutator allele and the total number of *de novo* mutations used to detect it (Figure {@fig:simulations}). Given 100 haplotypes with an average of 500 *de novo* germline mutations each, AMSD has approximately 90% power to detect a mutator allele that increases the C>A *de novo* mutation rate by as little as 20%. However, the approach has less than 20% power to detect a mutator of identical effect size that augments the C>G mutation rate, since C>G mutations are expected to make up a smaller fraction of all *de novo* germline mutations to begin with. Simulations also demonstrate that our approach is well-powered to detect large-effect mutator alleles (e.g., those that increase the mutation rate of a specific $k$-mer by 50%), even with a relatively small number of mutations per haplotype (Figure {@fig:simulations}). Both AMSD and traditional qu

In [82]:
par1 = process_paragraph(mod_section_paragraphs[3])
print(par1)

Simulated data shows that the power of our method is mainly limited by the initial mutation rate of the $k$-mer mutation type affected by a mutator allele and the total number of *de novo* mutations used for detection (Figure 1). With 100 haplotypes having an average of 500 *de novo* germline mutations each, AMSD can detect a mutator allele increasing the C>A *de novo* mutation rate by 20% with about 90% power. However, detecting a mutator with the same effect size that increases the C>G mutation rate has less than 20% power due to the lower fraction of C>G mutations initially. Simulations also show that the approach is effective in detecting large-effect mutator alleles (e.g., those increasing the mutation rate of a specific $k$-mer by 50%) even with a small number of mutations per haplotype (Figure 1). Both AMSD and traditional quantitative trait locus (QTL) mapping have similar power in detecting alleles that enhance individual 1-mer mutation rates (Figure 2), but AMSD has advantage

In [83]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [84]:
display(paragraph_matches[-1])

('results',
 "Using simulated data, we find that our method's power is primarily limited by the initial mutation rate of the $k$-mer mutation type affected by a mutator allele and the total number of *de novo* mutations used to detect it (Figure {@fig:simulations}). Given 100 haplotypes with an average of 500 *de novo* germline mutations each, AMSD has approximately 90% power to detect a mutator allele that increases the C>A *de novo* mutation rate by as little as 20%. However, the approach has less than 20% power to detect a mutator of identical effect size that augments the C>G mutation rate, since C>G mutations are expected to make up a smaller fraction of all *de novo* germline mutations to begin with. Simulations also demonstrate that our approach is well-powered to detect large-effect mutator alleles (e.g., those that increase the mutation rate of a specific $k$-mer by 50%), even with a relatively small number of mutations per haplotype (Figure {@fig:simulations}). Both AMSD and 

####  Paragraph 02

In [87]:
par0 = process_paragraph(orig_section_paragraphs[6])
print(par0)

We applied our aggregate mutation spectrum distance method to 117 BXDs (*Materials and Methods*) with a total of 65,552 *de novo* germline mutations [@PMID:35545679]. Using mutation data that were partitioned by 1-mer nucleotide context, we discovered a locus on chromosome 4 that was significantly associated with mutation spectrum variation (Figure {@fig:distance-results}a; maximum adjusted cosine distance of 1.20e-2 at marker ID `rs27509845`; position 118.28 Mbp in GRCm38/mm10 coordinates; 90% bootstrap confidence interval from 114.79 - 118.75 Mbp).


In [88]:
par1 = process_paragraph(mod_section_paragraphs[6])
print(par1)

We analyzed 117 BXDs with a total of 65,552 new germline mutations. By examining mutation data based on 1-mer nucleotide context, we identified a locus on chromosome 4 linked to mutation spectrum diversity. This locus showed a maximum adjusted cosine distance of 1.20e-2 at marker ID 'rs27509845', located at 118.28 Mbp in GRCm38/mm10 coordinates, with a 90% bootstrap confidence interval ranging from 114.79 to 118.75 Mbp (Figure 1a).


In [89]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [90]:
display(paragraph_matches[-1])

('results',
 'We applied our aggregate mutation spectrum distance method to 117 BXDs (*Materials and Methods*) with a total of 65,552 *de novo* germline mutations [@PMID:35545679]. Using mutation data that were partitioned by 1-mer nucleotide context, we discovered a locus on chromosome 4 that was significantly associated with mutation spectrum variation (Figure {@fig:distance-results}a; maximum adjusted cosine distance of 1.20e-2 at marker ID `rs27509845`; position 118.28 Mbp in GRCm38/mm10 coordinates; 90% bootstrap confidence interval from 114.79 - 118.75 Mbp).',
 "We analyzed 117 BXDs with a total of 65,552 new germline mutations. By examining mutation data based on 1-mer nucleotide context, we identified a locus on chromosome 4 linked to mutation spectrum diversity. This locus showed a maximum adjusted cosine distance of 1.20e-2 at marker ID 'rs27509845', located at 118.28 Mbp in GRCm38/mm10 coordinates, with a 90% bootstrap confidence interval ranging from 114.79 to 118.75 Mbp (F

####  Paragraph 03

In [91]:
par0 = process_paragraph(orig_section_paragraphs[8])
print(par0)

Using quantitative trait locus (QTL) mapping, we previously identified a nearly-identical locus on chromosome 4 that was significantly associated with the C>A germline mutation rate in the BXDs [@PMID:35545679]. This locus overlapped 21 protein-coding genes that were annotated by the Gene Ontology as being involved in "DNA repair," but only one of those genes contained nonsynonymous differences between the two parental strains: *Mutyh*. *Mutyh* encodes a protein involved in the base-excision repair of 8-oxoguanine (8-oxoG), a DNA lesion caused by oxidative damage, and prevents the accumulation of C>A mutations [@PMID:28551381;@PMID:28127763;@PMID:17581577]. C>A germline mutation fractions are nearly 50% higher in BXDs that inherit *D* genotypes at marker ID `rs27509845` (the marker at which we observed the highest adjusted cosine distance on chromosome 4) than in those that inherit *B* genotypes (Figure @fig:spectra-comparison) [@PMID:35545679].


In [92]:
par1 = process_paragraph(mod_section_paragraphs[8])
print(par1)

In a previous study, we found a locus on chromosome 4 associated with the C>A germline mutation rate in the BXDs. This locus contains 21 genes related to DNA repair, with only one gene, *Mutyh*, showing differences between parental strains. *Mutyh* is involved in base-excision repair of 8-oxoguanine, preventing C>A mutations. BXDs with *D* genotypes at marker ID `rs27509845` have nearly 50% higher C>A germline mutation fractions compared to those with *B* genotypes (Figure @fig:spectra-comparison).


In [93]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [94]:
display(paragraph_matches[-1])

('results',
 'Using quantitative trait locus (QTL) mapping, we previously identified a nearly-identical locus on chromosome 4 that was significantly associated with the C>A germline mutation rate in the BXDs [@PMID:35545679]. This locus overlapped 21 protein-coding genes that were annotated by the Gene Ontology as being involved in "DNA repair," but only one of those genes contained nonsynonymous differences between the two parental strains: *Mutyh*. *Mutyh* encodes a protein involved in the base-excision repair of 8-oxoguanine (8-oxoG), a DNA lesion caused by oxidative damage, and prevents the accumulation of C>A mutations [@PMID:28551381;@PMID:28127763;@PMID:17581577]. C>A germline mutation fractions are nearly 50% higher in BXDs that inherit *D* genotypes at marker ID `rs27509845` (the marker at which we observed the highest adjusted cosine distance on chromosome 4) than in those that inherit *B* genotypes (Figure @fig:spectra-comparison) [@PMID:35545679].',
 'In a previous study, w

####  Paragraph 04

In [95]:
par0 = process_paragraph(orig_section_paragraphs[10])
print(par0)

After confirming that AMSD could recover the mutator locus overlapping *Mutyh*, we tested its ability to identify additional mutator loci in the BXDs. To eliminate potential confounding of the mutation spectrum landscape by the large-effect mutator locus on chromosome 4, we performed AMSD scans that were conditional on the presence of either *D* or *B* alleles at `rs27509845`. We also hypothesized that such conditioning might reveal epistatic interactions between alleles at the chromosome 4 locus and mutator alleles elsewhere in the genome. Specifically, we divided the BXDs into those with either *D* (n = 66) or *B* (n = 44) genotypes at `rs27509845` (n = 7 BXDs were heterozygous) and ran an aggregate mutation spectrum distance scan using each group separately (Figure {@fig:distance-results}b-c). We excluded the BXD68 RIL from these scans, since we previously found that BXD68 harbors a strain-private C>A mutator allele of even larger effect [@PMID:35545679].


In [96]:
par1 = process_paragraph(mod_section_paragraphs[10])
print(par1)

After confirming that AMSD identified the mutator locus overlapping *Mutyh*, we used it to detect additional mutator loci in the BXDs. To avoid interference from the mutator locus on chromosome 4, we conducted AMSD scans conditioned on the presence of either *D* or *B* alleles at `rs27509845`. This approach aimed to uncover potential epistatic interactions between alleles at the chromosome 4 locus and mutator alleles elsewhere in the genome. We divided the BXDs into two groups based on their genotypes at `rs27509845`: *D* (n = 66) and *B* (n = 44) (with 7 BXDs being heterozygous) and performed aggregate mutation spectrum distance scans on each group separately (Figure {@fig:distance-results}b-c). BXD68 RIL was excluded from these scans due to the presence of a strain-private C>A mutator allele with a larger effect that we previously identified [@PMID:35545679].


In [97]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [98]:
display(paragraph_matches[-1])

('results',
 'After confirming that AMSD could recover the mutator locus overlapping *Mutyh*, we tested its ability to identify additional mutator loci in the BXDs. To eliminate potential confounding of the mutation spectrum landscape by the large-effect mutator locus on chromosome 4, we performed AMSD scans that were conditional on the presence of either *D* or *B* alleles at `rs27509845`. We also hypothesized that such conditioning might reveal epistatic interactions between alleles at the chromosome 4 locus and mutator alleles elsewhere in the genome. Specifically, we divided the BXDs into those with either *D* (n = 66) or *B* (n = 44) genotypes at `rs27509845` (n = 7 BXDs were heterozygous) and ran an aggregate mutation spectrum distance scan using each group separately (Figure {@fig:distance-results}b-c). We excluded the BXD68 RIL from these scans, since we previously found that BXD68 harbors a strain-private C>A mutator allele of even larger effect [@PMID:35545679].',
 'After con

####  Paragraph 05

In [99]:
par0 = process_paragraph(orig_section_paragraphs[11])
print(par0)

Using the BXDs with *D* genotypes at `rs27509845`, we identified a locus on chromosome 6 that was significantly associated with mutation spectrum variation (Figure {@fig:distance-results}b; maximum adjusted cosine distance of 3.69e-3 at marker `rs46276051`; position 111.27 Mbp in GRCm38/mm10 coordinates; 90% bootstrap confidence interval from 95.01 - 114.02 Mbp). This signal was specific to BXDs with *D* genotypes at the `rs27509845` locus, as we did not observe any new mutator loci after performing an AMSD scan using BXDs with *B* genotypes at `rs27509845` (Figure {@fig:distance-results}c). The peak markers on chromosome 4 and 6 did not exhibit strong linkage disequilibrium ($R^2$ = 4e-5). We also performed QTL scans for the fractions of each 1-mer mutation type using the same mutation data, but none produced a genome-wide significant log-odds score at any locus (Figure {@fig:qtl-scans}; *Materials and Methods*).


In [100]:
par1 = process_paragraph(mod_section_paragraphs[11])
print(par1)

Using the BXDs with *D* genotypes at `rs27509845`, we found a locus on chromosome 6 linked to mutation spectrum variation (see Figure {@fig:distance-results}b). The maximum adjusted cosine distance was 3.69e-3 at marker `rs46276051`, located at 111.27 Mbp in GRCm38/mm10 coordinates, with a 90% bootstrap confidence interval between 95.01 - 114.02 Mbp. This association was exclusive to BXDs with *D* genotypes at `rs27509845`, as no new mutator loci were identified in BXDs with *B* genotypes at the same locus (see Figure {@fig:distance-results}c). The peak markers on chromosomes 4 and 6 showed weak linkage disequilibrium ($R^2$ = 4e-5). QTL scans for the fractions of each 1-mer mutation type did not yield any genome-wide significant log-odds scores at any locus (refer to Figure {@fig:qtl-scans} and *Materials and Methods* for details).


In [101]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [102]:
display(paragraph_matches[-1])

('results',
 'Using the BXDs with *D* genotypes at `rs27509845`, we identified a locus on chromosome 6 that was significantly associated with mutation spectrum variation (Figure {@fig:distance-results}b; maximum adjusted cosine distance of 3.69e-3 at marker `rs46276051`; position 111.27 Mbp in GRCm38/mm10 coordinates; 90% bootstrap confidence interval from 95.01 - 114.02 Mbp). This signal was specific to BXDs with *D* genotypes at the `rs27509845` locus, as we did not observe any new mutator loci after performing an AMSD scan using BXDs with *B* genotypes at `rs27509845` (Figure {@fig:distance-results}c). The peak markers on chromosome 4 and 6 did not exhibit strong linkage disequilibrium ($R^2$ = 4e-5). We also performed QTL scans for the fractions of each 1-mer mutation type using the same mutation data, but none produced a genome-wide significant log-odds score at any locus (Figure {@fig:qtl-scans}; *Materials and Methods*).',
 'Using the BXDs with *D* genotypes at `rs27509845`, we 

####  Paragraph 06

In [103]:
par0 = process_paragraph(orig_section_paragraphs[12])
print(par0)

We queried the region surrounding the top marker on chromosome 6 (+/- the 90% bootstrap confidence interval) and discovered 64 protein-coding genes, of which four were annotated with a Gene Ontology (GO) [@PMID:10802651;@PMID:33290552] term related to "DNA repair": *Fancd2*, *Ogg1*, *Setmar*, and *Rad18*. None of the remaining genes were annotated with a cellular function that would obviously contribute to a germline mutator phenotype; however, many of these GO annotations are imperfect and/or incomplete. Although we focus our analysis on DNA repair genes, it remains possible that other genes within the confidence interval could underlie the C>A mutator phenotype we identified in the BXDs.


In [104]:
par1 = process_paragraph(mod_section_paragraphs[12])
print(par1)

We identified 64 protein-coding genes in the region surrounding the top marker on chromosome 6, within the 90% bootstrap confidence interval. Among these genes, four (*Fancd2*, *Ogg1*, *Setmar*, and *Rad18*) were annotated with a Gene Ontology (GO) term related to "DNA repair." The remaining genes did not have annotations indicating a clear contribution to a germline mutator phenotype, but it is important to note that these annotations may be incomplete. While our analysis focused on DNA repair genes, there is a possibility that other genes in this region could be responsible for the C>A mutator phenotype observed in the BXDs.


In [105]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [106]:
display(paragraph_matches[-1])

('results',
 'We queried the region surrounding the top marker on chromosome 6 (+/- the 90% bootstrap confidence interval) and discovered 64 protein-coding genes, of which four were annotated with a Gene Ontology (GO) [@PMID:10802651;@PMID:33290552] term related to "DNA repair": *Fancd2*, *Ogg1*, *Setmar*, and *Rad18*. None of the remaining genes were annotated with a cellular function that would obviously contribute to a germline mutator phenotype; however, many of these GO annotations are imperfect and/or incomplete. Although we focus our analysis on DNA repair genes, it remains possible that other genes within the confidence interval could underlie the C>A mutator phenotype we identified in the BXDs.',
 'We identified 64 protein-coding genes in the region surrounding the top marker on chromosome 6, within the 90% bootstrap confidence interval. Among these genes, four (*Fancd2*, *Ogg1*, *Setmar*, and *Rad18*) were annotated with a Gene Ontology (GO) term related to "DNA repair." The 

####  Paragraph 07

In [107]:
par0 = process_paragraph(orig_section_paragraphs[13])
print(par0)

Of the annotated DNA repair genes within the confidence interval, two harbored nonsynonymous differences between the parental C57BL/6J and DBA/2J strains (Table @tbl:nonsyn-diffs). *Ogg1* encodes a key member of the base-excision repair response to oxidative DNA damage (a pathway that also includes *Mutyh*), and in mice *Setmar* encodes a SET domain-containing histone methyltransferase; both *Ogg1* and *Setmar* are expressed in mouse gonadal cells. Because the bootstrap can exhibit poor coverage in QTL mapping studies [@PMID:16783000], we also scanned an interval +/- 5 Mbp from the peak AMSD marker on chromosome 6 for additional candidate genes. Although the choice of a 10 Mbp interval is somewhat arbitrary, the interval does contain a plausible candidate: *Mbd4*, a protein-coding gene involved in base excision repair that also harbors a non-synonymous difference between the BXD parental strains (Table @tbl:nonsyn-diffs).


In [108]:
par1 = process_paragraph(mod_section_paragraphs[13])
print(par1)

Within the confidence interval of annotated DNA repair genes, two genes, *Ogg1* and *Setmar*, showed nonsynonymous differences between the C57BL/6J and DBA/2J strains (Table @tbl:nonsyn-diffs). *Ogg1* is involved in base-excision repair in response to oxidative DNA damage, along with *Mutyh*. *Setmar* encodes a histone methyltransferase with a SET domain and is expressed in mouse gonadal cells. To address potential poor coverage in QTL mapping studies, we expanded our search to an interval +/- 5 Mbp from the peak AMSD marker on chromosome 6 for additional candidate genes. In this expanded interval, we identified *Mbd4*, a protein-coding gene involved in base excision repair that also showed a non-synonymous difference between the BXD parental strains (Table @tbl:nonsyn-diffs).


In [109]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [110]:
display(paragraph_matches[-1])

('results',
 'Of the annotated DNA repair genes within the confidence interval, two harbored nonsynonymous differences between the parental C57BL/6J and DBA/2J strains (Table @tbl:nonsyn-diffs). *Ogg1* encodes a key member of the base-excision repair response to oxidative DNA damage (a pathway that also includes *Mutyh*), and in mice *Setmar* encodes a SET domain-containing histone methyltransferase; both *Ogg1* and *Setmar* are expressed in mouse gonadal cells. Because the bootstrap can exhibit poor coverage in QTL mapping studies [@PMID:16783000], we also scanned an interval +/- 5 Mbp from the peak AMSD marker on chromosome 6 for additional candidate genes. Although the choice of a 10 Mbp interval is somewhat arbitrary, the interval does contain a plausible candidate: *Mbd4*, a protein-coding gene involved in base excision repair that also harbors a non-synonymous difference between the BXD parental strains (Table @tbl:nonsyn-diffs).',
 'Within the confidence interval of annotated DN

####  Paragraph 08

In [113]:
par0 = process_paragraph(orig_section_paragraphs[16])
print(par0)

We also considered the possibility that expression quantitative trait loci (eQTLs), rather than nonsynonymous mutations, could contribute to the C>A mutator phenotype associated with the locus on chromosome 6. Using GeneNetwork [@PMID:27933521] we mapped eQTLs for the five aforementioned DNA repair genes (as well as *Mbd4*) in a number of tissues, though we did not have access to expression data from germline cells. Notably, *D* alleles near the cosine distance peak on chromosome 6 were significantly associated with decreased *Ogg1* expression in kidney, liver, hippocampus, and gastrointestinal tissues (Table @tbl:eqtl-results). Although these cis-eQTLs are challenging to interpret (given their tissue specificity and our lack of access to germline expression data), the presence of strong-effect cis-eQTLs for *Ogg1* suggests that the C>A mutator phenotype observed in the BXDs may be mediated by regulatory, rather than protein-altering, variants.


In [114]:
par1 = process_paragraph(mod_section_paragraphs[16])
print(par1)

We investigated whether expression quantitative trait loci (eQTLs) could be contributing to the C>A mutator phenotype linked to the locus on chromosome 6. Through GeneNetwork, we identified eQTLs for the five DNA repair genes mentioned, including *Mbd4*, in various tissues. Notably, *D* alleles near the peak of cosine distance on chromosome 6 showed a significant association with reduced *Ogg1* expression in kidney, liver, hippocampus, and gastrointestinal tissues (see Table 1 for eQTL results). These cis-eQTLs, while challenging to interpret due to tissue specificity and lack of germline expression data, suggest that the C>A mutator phenotype in BXDs may be influenced by regulatory rather than protein-altering variants.


In [115]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [116]:
display(paragraph_matches[-1])

('results',
 'We also considered the possibility that expression quantitative trait loci (eQTLs), rather than nonsynonymous mutations, could contribute to the C>A mutator phenotype associated with the locus on chromosome 6. Using GeneNetwork [@PMID:27933521] we mapped eQTLs for the five aforementioned DNA repair genes (as well as *Mbd4*) in a number of tissues, though we did not have access to expression data from germline cells. Notably, *D* alleles near the cosine distance peak on chromosome 6 were significantly associated with decreased *Ogg1* expression in kidney, liver, hippocampus, and gastrointestinal tissues (Table @tbl:eqtl-results). Although these cis-eQTLs are challenging to interpret (given their tissue specificity and our lack of access to germline expression data), the presence of strong-effect cis-eQTLs for *Ogg1* suggests that the C>A mutator phenotype observed in the BXDs may be mediated by regulatory, rather than protein-altering, variants.',
 'We investigated whether

####  Paragraph 09

In [121]:
par0 = process_paragraph(orig_section_paragraphs[21])
print(par0)

Next, we more precisely characterized the effects of the chromosome 4 and 6 mutator alleles on mutation spectra in the BXDs. To pinpoint the mutation type(s) that underlied the significant cosine distance peak on chromosome 6, we compared the aggregate counts of each 1-mer mutation type (plus CpG>TpG) on BXD haplotypes with *D* genotypes at `rs27509845` and either *D* or *B* genotypes at `rs46276051`. We found that C>A mutations were significantly enriched on BXD haplotypes with *D* genotypes at the chromosome 6 mutator locus, relative to those with *B* genotypes ($\chi^2$ statistic = 85.36, p = 2.48e-20). On average, C>A germline mutation fractions were significantly higher in BXDs with *D* alleles at both mutator loci than in BXDs with *D* alleles at either locus alone (Figure {@fig:spectra-comparison}a and @fig:spectra-comparison-all). Among BXDs with *B* alleles at the locus overlapping *Mutyh*, those with *D* alleles on chromosome 6 did not exhibit significantly elevated C>A mutat

In [122]:
par1 = process_paragraph(mod_section_paragraphs[21])
print(par1)

We further analyzed the effects of chromosome 4 and 6 mutator alleles on mutation spectra in the BXDs. By comparing the counts of each 1-mer mutation type (including CpG>TpG) on BXD haplotypes with *D* genotypes at `rs27509845` and either *D* or *B* genotypes at `rs46276051`, we identified a significant enrichment of C>A mutations on BXD haplotypes with *D* genotypes at the chromosome 6 mutator locus compared to those with *B* genotypes (χ² statistic = 85.36, p = 2.48e-20). BXDs with *D* alleles at both mutator loci had higher C>A germline mutation fractions compared to those with *D* alleles at either locus alone. Conversely, BXDs with *B* alleles at the *Mutyh* locus did not show elevated C>A mutation fractions, even with *D* alleles on chromosome 6. When considering inbreeding duration, BXDs with *D* alleles at both mutator loci consistently had the highest C>A *de novo* mutation counts. After 100 generations of inbreeding, BXDs with *D* alleles at both mutator loci were predicted t

In [123]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [124]:
display(paragraph_matches[-1])

('results',
 'Next, we more precisely characterized the effects of the chromosome 4 and 6 mutator alleles on mutation spectra in the BXDs. To pinpoint the mutation type(s) that underlied the significant cosine distance peak on chromosome 6, we compared the aggregate counts of each 1-mer mutation type (plus CpG>TpG) on BXD haplotypes with *D* genotypes at `rs27509845` and either *D* or *B* genotypes at `rs46276051`. We found that C>A mutations were significantly enriched on BXD haplotypes with *D* genotypes at the chromosome 6 mutator locus, relative to those with *B* genotypes ($\\chi^2$ statistic = 85.36, p = 2.48e-20). On average, C>A germline mutation fractions were significantly higher in BXDs with *D* alleles at both mutator loci than in BXDs with *D* alleles at either locus alone (Figure {@fig:spectra-comparison}a and @fig:spectra-comparison-all). Among BXDs with *B* alleles at the locus overlapping *Mutyh*, those with *D* alleles on chromosome 6 did not exhibit significantly ele

####  Paragraph 10

In [125]:
par0 = process_paragraph(orig_section_paragraphs[22])
print(par0)

We also used SigProfilerExtractor [@PMID:36388765] to assign the germline mutations in each BXD to single-base substitution (SBS) mutation signatures from the COSMIC catalog [@PMID:30371878]. Mutation signatures often reflect specific exogenous or endogenous sources of DNA damage, and the proportions of mutations attributable to particular SBS signatures can suggest a genetic or environmental etiology. The SBS1, SBS5, and SBS30 mutation signatures were active in nearly all BXDs, regardless of genotypes at the chromosome 4 and 6 mutator loci (Figure {@fig:spectra-comparison}c). However, the SBS18 signature, which is dominated by C>A mutations and likely reflects unrepaired DNA damage from reactive oxygen species, was almost exclusively active in mice with *D* alleles at the chromosome 4 locus; the highest SBS18 activity was observed in mice with *D* alleles at both mutator loci (Figure {@fig:spectra-comparison}c). SBS18 activity was lowest in mice with *D* alleles at the chromosome 6 mu

In [126]:
par1 = process_paragraph(mod_section_paragraphs[22])
print(par1)

We utilized SigProfilerExtractor to categorize germline mutations in each BXD strain into single-base substitution (SBS) mutation signatures from the COSMIC catalog. Mutation signatures can indicate specific sources of DNA damage, and the distribution of mutations linked to particular SBS signatures may suggest a genetic or environmental cause. SBS1, SBS5, and SBS30 mutation signatures were present in nearly all BXD strains, regardless of their genotypes at the chromosome 4 and 6 mutator loci (Figure 1c). However, the SBS18 signature, characterized by C>A mutations associated with unrepaired DNA damage from reactive oxygen species, was predominantly active in mice with *D* alleles at the chromosome 4 locus; the highest SBS18 activity was observed in mice with *D* alleles at both mutator loci (Figure 1c). Conversely, SBS18 activity was lowest in mice with *D* alleles at the chromosome 6 mutator locus alone (Figure 1c), indicating that *D* alleles at this locus alone are insufficient to 

In [127]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [128]:
display(paragraph_matches[-1])

('results',
 'We also used SigProfilerExtractor [@PMID:36388765] to assign the germline mutations in each BXD to single-base substitution (SBS) mutation signatures from the COSMIC catalog [@PMID:30371878]. Mutation signatures often reflect specific exogenous or endogenous sources of DNA damage, and the proportions of mutations attributable to particular SBS signatures can suggest a genetic or environmental etiology. The SBS1, SBS5, and SBS30 mutation signatures were active in nearly all BXDs, regardless of genotypes at the chromosome 4 and 6 mutator loci (Figure {@fig:spectra-comparison}c). However, the SBS18 signature, which is dominated by C>A mutations and likely reflects unrepaired DNA damage from reactive oxygen species, was almost exclusively active in mice with *D* alleles at the chromosome 4 locus; the highest SBS18 activity was observed in mice with *D* alleles at both mutator loci (Figure {@fig:spectra-comparison}c). SBS18 activity was lowest in mice with *D* alleles at the c

####  Paragraph 11

In [129]:
par0 = process_paragraph(orig_section_paragraphs[23])
print(par0)

To more formally test for statistical epistasis, we fit a generalized (Poisson) linear model predicting counts of C>A mutations in each BXD as a function of genotypes at `rs27509845` and `rs46276051` (the markers with the largest adjusted cosine distance at the two mutator loci); the model also accounted for differences in inbreeding duration and sequencing coverage between the BXDs (*Materials and Methods*). A model that included an interaction term between genotypes at the two markers fit the data significantly better than a model including only additive effects (p = 7.92e-7; *Materials and Methods*), indicating that the combined effects of *D* genotypes at both loci exceeded the sum of marginal effects of *D* genotypes at either locus alone.


In [130]:
par1 = process_paragraph(mod_section_paragraphs[23])
print(par1)

To formally test for statistical epistasis, we used a generalized linear model to predict C>A mutation counts in each BXD based on genotypes at `rs27509845` and `rs46276051`, the markers with the highest adjusted cosine distance at the mutator loci. The model considered inbreeding duration, sequencing coverage, and included an interaction term between genotypes at the two markers. This model significantly outperformed a model with only additive effects (p = 7.92e-7), indicating that the combined effects of *D* genotypes at both loci were greater than the sum of the marginal effects of *D* genotypes at each locus individually.


In [131]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [132]:
display(paragraph_matches[-1])

('results',
 'To more formally test for statistical epistasis, we fit a generalized (Poisson) linear model predicting counts of C>A mutations in each BXD as a function of genotypes at `rs27509845` and `rs46276051` (the markers with the largest adjusted cosine distance at the two mutator loci); the model also accounted for differences in inbreeding duration and sequencing coverage between the BXDs (*Materials and Methods*). A model that included an interaction term between genotypes at the two markers fit the data significantly better than a model including only additive effects (p = 7.92e-7; *Materials and Methods*), indicating that the combined effects of *D* genotypes at both loci exceeded the sum of marginal effects of *D* genotypes at either locus alone.',
 'To formally test for statistical epistasis, we used a generalized linear model to predict C>A mutation counts in each BXD based on genotypes at `rs27509845` and `rs46276051`, the markers with the highest adjusted cosine distanc

####  Paragraph 12

In [134]:
par0 = process_paragraph(orig_section_paragraphs[25])
print(par0)

To explore the effects of the two mutator loci in other inbred laboratory mice, we also compared the germline mutation spectra of Sanger Mouse Genomes Project (MGP) strains [@PMID:21921910]. Dumont [@PMID:30753674] previously identified germline mutations that were private to each of the 29 MGP strains; these private variants likely represent recent *de novo* mutations (Figure {@fig:spectra-comparison-mgp}). Only two of the MGP strains possess *D* genotypes at both the chromosome 4 and chromosome 6 mutator loci: DBA/1J and DBA/2J. As before, we tested for epistasis in the MGP strains by fitting two linear models predicting C>A mutation counts as a function of genotypes at the two mutator loci. A model incorporating an interaction term did not fit the MGP data significantly better than a model with additive effects alone (p = 0.806), so we are unable to confirm the signal of epistasis; however, this may be due to the smaller number of MGP strains with *de novo* germline mutation data.


In [135]:
par1 = process_paragraph(mod_section_paragraphs[25])
print(par1)

To investigate the impact of mutator loci in other laboratory mice, we analyzed the germline mutation patterns of Sanger Mouse Genomes Project (MGP) strains. Dumont previously identified private germline mutations in 29 MGP strains, which likely represent recent spontaneous mutations. Among these strains, only DBA/1J and DBA/2J have mutator genotypes at both chromosome 4 and chromosome 6 loci. We tested for epistasis by analyzing C>A mutation counts based on genotypes at the two mutator loci. Our analysis did not show a significant improvement when adding an interaction term, indicating no clear evidence of epistasis in the MGP data. This lack of confirmation may be attributed to the limited number of MGP strains with spontaneous germline mutation data.


In [136]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [137]:
display(paragraph_matches[-1])

('results',
 'To explore the effects of the two mutator loci in other inbred laboratory mice, we also compared the germline mutation spectra of Sanger Mouse Genomes Project (MGP) strains [@PMID:21921910]. Dumont [@PMID:30753674] previously identified germline mutations that were private to each of the 29 MGP strains; these private variants likely represent recent *de novo* mutations (Figure {@fig:spectra-comparison-mgp}). Only two of the MGP strains possess *D* genotypes at both the chromosome 4 and chromosome 6 mutator loci: DBA/1J and DBA/2J. As before, we tested for epistasis in the MGP strains by fitting two linear models predicting C>A mutation counts as a function of genotypes at the two mutator loci. A model incorporating an interaction term did not fit the MGP data significantly better than a model with additive effects alone (p = 0.806), so we are unable to confirm the signal of epistasis; however, this may be due to the smaller number of MGP strains with *de novo* germline mu

####  Paragraph 13

In [138]:
par0 = process_paragraph(orig_section_paragraphs[27])
print(par0)

To determine whether the candidate mutator alleles on chromosome 6 were segregating in natural populations, we queried previously published sequencing data generated from 67 wild-derived mice [@PMID:27622383]. These data include three subspecies of *Mus musculus*, as well as the outgroup *Mus spretus*. We found that the *Ogg1* *D* allele was segregating at an allele frequency of 0.259 in *Mus musculus domesticus*, the species from which C57BL/6J and DBA/2J derive the majority of their genomes [@PMID:17660819], and was fixed in *Mus musculus musculus*, *Mus musculus castaneus*, and the outgroup *Mus spretus* (Figure @fig:wild-afs). The *Setmar* p.Ser273Arg *D* allele was also present at an allele frequency of 0.37 in *Mus musculus domesticus*, while *D* alleles at the *Setmar* p.Leu103Phe variant were not observed in any wild *Mus musculus domesticus* animals. *D* alleles at the *Mbd4* p.Asp129Asn variant were also absent from all wild mouse populations (Figure @fig:wild-afs).


In [139]:
par1 = process_paragraph(mod_section_paragraphs[27])
print(par1)

To assess the presence of candidate mutator alleles on chromosome 6 in natural populations, we analyzed previously published sequencing data from 67 wild-derived mice. These data encompassed three subspecies of *Mus musculus* and the outgroup *Mus spretus*. The *Ogg1* *D* allele was found to be present at an allele frequency of 0.259 in *Mus musculus domesticus* and fixed in *Mus musculus musculus*, *Mus musculus castaneus*, and *Mus spretus* (Figure 1). Additionally, the *Setmar* p.Ser273Arg *D* allele had an allele frequency of 0.37 in *Mus musculus domesticus*, while *D* alleles at the *Setmar* p.Leu103Phe variant were not detected in any wild *Mus musculus domesticus* individuals. Furthermore, *D* alleles at the *Mbd4* p.Asp129Asn variant were absent in all wild mouse populations (Figure 1).


In [140]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [141]:
display(paragraph_matches[-1])

('results',
 'To determine whether the candidate mutator alleles on chromosome 6 were segregating in natural populations, we queried previously published sequencing data generated from 67 wild-derived mice [@PMID:27622383]. These data include three subspecies of *Mus musculus*, as well as the outgroup *Mus spretus*. We found that the *Ogg1* *D* allele was segregating at an allele frequency of 0.259 in *Mus musculus domesticus*, the species from which C57BL/6J and DBA/2J derive the majority of their genomes [@PMID:17660819], and was fixed in *Mus musculus musculus*, *Mus musculus castaneus*, and the outgroup *Mus spretus* (Figure @fig:wild-afs). The *Setmar* p.Ser273Arg *D* allele was also present at an allele frequency of 0.37 in *Mus musculus domesticus*, while *D* alleles at the *Setmar* p.Leu103Phe variant were not observed in any wild *Mus musculus domesticus* animals. *D* alleles at the *Mbd4* p.Asp129Asn variant were also absent from all wild mouse populations (Figure @fig:wild-a

## Discussion

In [142]:
section_name = "discussion"

In [143]:
pr_filename = pr_files[3].filename
assert section_name in pr_filename
print(pr_filename)

content/04.discussion.md


### Original

In [144]:
# get content
orig_section_content = repo.get_contents(pr_filename, pr_prev).decoded_content.decode(
    "utf-8"
)
print(orig_section_content[:50])

## Discussion

### Epistasis between germline muta


In [145]:
# split by paragraph
orig_section_paragraphs = orig_section_content.split("\n\n")
display(len(orig_section_paragraphs))

31

### Modified

In [146]:
# get content
mod_section_content = repo.get_contents(pr_filename, pr_curr).decoded_content.decode(
    "utf-8"
)
print(mod_section_content[:50])

## Discussion

### Epistasis between germline muta


In [147]:
# split by paragraph
mod_section_paragraphs = mod_section_content.split("\n\n")
display(len(mod_section_paragraphs))

31

### Match

In [148]:
orig_section_paragraphs[0]

'## Discussion'

In [149]:
mod_section_paragraphs[0]

'## Discussion'

####  Paragraph 00

In [151]:
par0 = process_paragraph(orig_section_paragraphs[2])
print(par0)

We have identified a locus on chromosome 6 that amplifies a C>A germline mutator phenotype in the BXDs, a family of inbred mice derived from the laboratory strains DBA/2J and C57BL/6J. DBA/2J (*D*) alleles at this locus have no significant effect on C>A mutation rates in mice that also harbor "wild-type" C57BL/6J (*B*) alleles at a previously discovered mutator locus on chromosome 4 [@PMID:35545679]. However, mice with *D* alleles at *both* loci have even higher mutation rates than those with *D* alleles at the chromosome 4 mutator locus alone (Figure @fig:spectra-comparison). Epistatic interactions between mutator alleles have been previously documented in yeast [@PMID:16492773] and in human cell lines [@PMID:35859169], but never to our knowledge in a whole-animal context.


In [152]:
par1 = process_paragraph(mod_section_paragraphs[2])
print(par1)

We discovered a gene on chromosome 6 that enhances a specific type of genetic mutation in the BXDs, a group of laboratory mice bred from the DBA/2J and C57BL/6J strains. The DBA/2J alleles at this gene do not impact the mutation rates in mice with "normal" C57BL/6J alleles at another known mutation gene on chromosome 4. However, mice with DBA/2J alleles at both genes show even higher mutation rates compared to those with DBA/2J alleles at only the chromosome 4 gene (Figure 1). While interactions between mutation genes have been observed in yeast and human cell lines, this is the first instance, to our knowledge, in a whole-animal system.


In [153]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [154]:
display(paragraph_matches[-1])

('discussion',
 'We have identified a locus on chromosome 6 that amplifies a C>A germline mutator phenotype in the BXDs, a family of inbred mice derived from the laboratory strains DBA/2J and C57BL/6J. DBA/2J (*D*) alleles at this locus have no significant effect on C>A mutation rates in mice that also harbor "wild-type" C57BL/6J (*B*) alleles at a previously discovered mutator locus on chromosome 4 [@PMID:35545679]. However, mice with *D* alleles at *both* loci have even higher mutation rates than those with *D* alleles at the chromosome 4 mutator locus alone (Figure @fig:spectra-comparison). Epistatic interactions between mutator alleles have been previously documented in yeast [@PMID:16492773] and in human cell lines [@PMID:35859169], but never to our knowledge in a whole-animal context.',
 'We discovered a gene on chromosome 6 that enhances a specific type of genetic mutation in the BXDs, a group of laboratory mice bred from the DBA/2J and C57BL/6J strains. The DBA/2J alleles at th

####  Paragraph 01

In [156]:
par0 = process_paragraph(orig_section_paragraphs[3])
print(par0)

Importantly, we discovered epistasis between germline mutator alleles in an unnatural population of model organisms that have been inbred by brother-sister mating in a highly controlled laboratory environment [@PMID:33472028]. This breeding setup has likely attenuated the effects of natural selection on all but the most deleterious alleles [@doi:10.1146/annurev.ecolsys.39.110707.173437], and may have facilitated the fixation of large-effect mutator alleles that would be less common in wild mice. Without fine-mapping the chromosome 6 mutator allele, however, we are unable to trace its origin to either a captive breeding colony of laboratory mice or a wild, outbreeding *Mus musculus* population. If the mutator allele on chromosome 6 has even a weak deleterious fitness, there might be a greater likelihood that it arose in captivity. Indeed, if purifying selection is required to keep mutation rates low, mutational pressure might cause mutation rates to rise in just a few generations of rel

In [157]:
par1 = process_paragraph(mod_section_paragraphs[3])
print(par1)

We found interactions between mutator alleles in a controlled laboratory setting using model organisms [@PMID:33472028]. The breeding conditions likely reduced the impact of natural selection on harmful alleles [@doi:10.1146/annurev.ecolsys.39.110707.173437], potentially leading to the fixation of high-impact mutator alleles not commonly found in the wild. Without pinpointing the origin of the chromosome 6 mutator allele, we cannot determine if it arose in a laboratory colony or a wild *Mus musculus* population. If the chromosome 6 mutator allele has even a slight negative impact on fitness, it may have originated in captivity. The relaxation of selection pressure could quickly increase mutation rates, as seen in a recent discovery in a rhesus macaque research colony [@doi:10.1101/2023.03.27.534460] and in domesticated animals with higher mutation rates compared to wild counterparts [@PMID:36859541]. While we have not definitively identified the causal variant for the chromosome 6 muta

In [158]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [159]:
display(paragraph_matches[-1])

('discussion',
 'Importantly, we discovered epistasis between germline mutator alleles in an unnatural population of model organisms that have been inbred by brother-sister mating in a highly controlled laboratory environment [@PMID:33472028]. This breeding setup has likely attenuated the effects of natural selection on all but the most deleterious alleles [@doi:10.1146/annurev.ecolsys.39.110707.173437], and may have facilitated the fixation of large-effect mutator alleles that would be less common in wild mice. Without fine-mapping the chromosome 6 mutator allele, however, we are unable to trace its origin to either a captive breeding colony of laboratory mice or a wild, outbreeding *Mus musculus* population. If the mutator allele on chromosome 6 has even a weak deleterious fitness, there might be a greater likelihood that it arose in captivity. Indeed, if purifying selection is required to keep mutation rates low, mutational pressure might cause mutation rates to rise in just a few g

####  Paragraph 02

In [161]:
par0 = process_paragraph(orig_section_paragraphs[5])
print(par0)

Five protein-coding genes involved in DNA repair overlap the C>A mutator locus on chromosome 6: *Ogg1*, a glycosylase that excises the oxidative DNA lesion 8-oxoguanine (8-oxoG) [@PMID:17581577], *Setmar*, a histone methyltransferase involved in non-homologous end joining (NHEJ) of double-stranded breaks (DSBs) [@PMID:21187428;@PMID:16332963], *Fancd2*, and *Rad18*. One other DNA repair gene, *Mbd4*, lies just outside of the 90% bootstrap confidence interval on chromosome 6 (but within a 10 Mbp interval around the peak AMSD marker). We are unable to conclusively determine that one or more of these genes harbors a causal variant underlying the observed C>A mutator phenotype, but we believe that *Ogg1* is the most plausible candidate. *Ogg1* is a member of the same base-excision repair pathway as *Mutyh* (the gene that likely underlies the chromosome 4 mutator locus), contains a nonsynonymous fixed difference between the C57BL/6J and DBA/2J parental strains, and appears to be regulated b

In [162]:
par1 = process_paragraph(mod_section_paragraphs[5])
print(par1)

Five genes involved in DNA repair overlap with the C>A mutator locus on chromosome 6. These genes are Ogg1, Setmar, Fancd2, Rad18, and Mbd4. Ogg1 is a glycosylase that removes the oxidative DNA lesion 8-oxoguanine. Setmar is a histone methyltransferase involved in repairing double-stranded breaks. Fancd2 and Rad18 are also involved in DNA repair processes. Mbd4 is located just outside the confidence interval on chromosome 6. While we cannot definitively identify which gene is responsible for the C>A mutator phenotype, Ogg1 is considered the most likely candidate. Ogg1 is in the same repair pathway as Mutyh, which is associated with the chromosome 4 mutator locus. Additionally, Ogg1 has genetic differences between the C57BL/6J and DBA/2J strains and is regulated by cis-eQTLs in various tissues within the BXD cohort.


In [163]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [164]:
display(paragraph_matches[-1])

('discussion',
 'Five protein-coding genes involved in DNA repair overlap the C>A mutator locus on chromosome 6: *Ogg1*, a glycosylase that excises the oxidative DNA lesion 8-oxoguanine (8-oxoG) [@PMID:17581577], *Setmar*, a histone methyltransferase involved in non-homologous end joining (NHEJ) of double-stranded breaks (DSBs) [@PMID:21187428;@PMID:16332963], *Fancd2*, and *Rad18*. One other DNA repair gene, *Mbd4*, lies just outside of the 90% bootstrap confidence interval on chromosome 6 (but within a 10 Mbp interval around the peak AMSD marker). We are unable to conclusively determine that one or more of these genes harbors a causal variant underlying the observed C>A mutator phenotype, but we believe that *Ogg1* is the most plausible candidate. *Ogg1* is a member of the same base-excision repair pathway as *Mutyh* (the gene that likely underlies the chromosome 4 mutator locus), contains a nonsynonymous fixed difference between the C57BL/6J and DBA/2J parental strains, and appears 

####  Paragraph 03

In [165]:
par0 = process_paragraph(orig_section_paragraphs[6])
print(par0)

The C57BL/6J and DBA/2J *Setmar* coding sequences differ by two missense variants (Table @tbl:nonsyn-diffs), one of which is predicted to be deleterious by *in silico* tools. The primate *SETMAR* ortholog is involved in NHEJ of double-strand breaks, but its role in DNA repair appears to depend on the function of both a SET methyltransferase domain and a *Mariner*-family transposase domain [@PMID:16332963;@PMID:24573677;@PMID:21491884]. Since the murine *Setmar* ortholog lacks the latter element, and because primate *SETMAR* is involved in a DNA repair process that is not expected to affect the rate of C>A mutations, we believe it is unlikely to underlie the epistatic interaction between the chromosome 4 and 6 mutator loci in the BXDs (*Supplementary Information*). Moreover, we did not observe any significant cis-eQTLs for *Setmar* across a variety of tissues in the BXD cohort (Table @tbl:eqtl-results). None of the remaining DNA repair genes (*Fancd2* or *Rad18*) contains a nonsynonymou

In [166]:
par1 = process_paragraph(mod_section_paragraphs[6])
print(par1)

The coding sequences of *Setmar* in C57BL/6J and DBA/2J mice have two missense variants, one of which is predicted to be harmful according to computational tools. The primate version of *SETMAR* plays a role in repairing double-strand breaks in DNA through a combination of a SET methyltransferase domain and a *Mariner*-family transposase domain. However, the murine version of *Setmar* lacks the transposase domain, suggesting it may not be responsible for the interaction between mutator loci on chromosomes 4 and 6 in BXD mice. We also found no significant genetic variations in other DNA repair genes like *Fancd2* or *Rad18* that could explain the increased mutation rate. In fact, an analysis of gene expression in different tissues showed that *D* alleles actually led to higher expression of *Fancd2* in gastrointestinal tissue.


In [167]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [168]:
display(paragraph_matches[-1])

('discussion',
 'The C57BL/6J and DBA/2J *Setmar* coding sequences differ by two missense variants (Table @tbl:nonsyn-diffs), one of which is predicted to be deleterious by *in silico* tools. The primate *SETMAR* ortholog is involved in NHEJ of double-strand breaks, but its role in DNA repair appears to depend on the function of both a SET methyltransferase domain and a *Mariner*-family transposase domain [@PMID:16332963;@PMID:24573677;@PMID:21491884]. Since the murine *Setmar* ortholog lacks the latter element, and because primate *SETMAR* is involved in a DNA repair process that is not expected to affect the rate of C>A mutations, we believe it is unlikely to underlie the epistatic interaction between the chromosome 4 and 6 mutator loci in the BXDs (*Supplementary Information*). Moreover, we did not observe any significant cis-eQTLs for *Setmar* across a variety of tissues in the BXD cohort (Table @tbl:eqtl-results). None of the remaining DNA repair genes (*Fancd2* or *Rad18*) contai

####  Paragraph 04

In [171]:
par0 = process_paragraph(orig_section_paragraphs[9])
print(par0)

*Ogg1* is a member of the same base-excision repair (BER) pathway as *Mutyh*, the protein-coding gene we previously implicated as harboring mutator alleles at the locus on chromosome 4 [@PMID:17581577]. Each of these genes has a distinct role in the BER response to oxidative DNA damage, and thereby the prevention of C>A mutations [@PMID:28963982;@PMID:24732879]. Following damage by reactive oxygen species, *Ogg1* is able to recognize and remove 8-oxoguanine lesions that are base-paired with cytosine nucleotides; once 8-oxoG is excised, other members of the BER pathway are mobilized to restore a proper G:C base pair at the site. If an 8-oxoG lesion is not removed before the cell enters S-phase, adenine can be mis-incorporated opposite 8-oxoG during DNA replication [@PMID:28963982]. If this occurs, *Mutyh* can excise the mispaired adenine, leaving a one-nucleotide gap that is processed and filled with a cytosine by other BER proteins. The resulting C:8-oxoG base pair can then be "returne

In [172]:
par1 = process_paragraph(mod_section_paragraphs[9])
print(par1)

*Ogg1* and *Mutyh* are both involved in the base-excision repair (BER) pathway, which responds to oxidative DNA damage and prevents C>A mutations. *Ogg1* is responsible for recognizing and removing 8-oxoguanine lesions that are paired with cytosine nucleotides. If these lesions are not removed before DNA replication, adenine may be incorrectly inserted opposite 8-oxoG. In such cases, *Mutyh* can remove the mispaired adenine, leading to the formation of a C:8-oxoG base pair that is repaired by other BER proteins. Defects in this repair process result in elevated C>A mutation rates. For instance, mice lacking *Ogg1*, *Mutyh*, and *Mth1* accumulate excess 8-oxoG in their gonadal cells, with almost all germline mutations being C>A transversions. Mutations and loss-of-heterozygosity in *Ogg1* have been linked to an increased cancer risk in humans. Furthermore, loss of *Ogg1* or *Mutyh* in human neuroblastoma is associated with higher rates of spontaneous C>A mutations.


In [173]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [174]:
display(paragraph_matches[-1])

('discussion',
 '*Ogg1* is a member of the same base-excision repair (BER) pathway as *Mutyh*, the protein-coding gene we previously implicated as harboring mutator alleles at the locus on chromosome 4 [@PMID:17581577]. Each of these genes has a distinct role in the BER response to oxidative DNA damage, and thereby the prevention of C>A mutations [@PMID:28963982;@PMID:24732879]. Following damage by reactive oxygen species, *Ogg1* is able to recognize and remove 8-oxoguanine lesions that are base-paired with cytosine nucleotides; once 8-oxoG is excised, other members of the BER pathway are mobilized to restore a proper G:C base pair at the site. If an 8-oxoG lesion is not removed before the cell enters S-phase, adenine can be mis-incorporated opposite 8-oxoG during DNA replication [@PMID:28963982]. If this occurs, *Mutyh* can excise the mispaired adenine, leaving a one-nucleotide gap that is processed and filled with a cytosine by other BER proteins. The resulting C:8-oxoG base pair can

####  Paragraph 05

In [175]:
par0 = process_paragraph(orig_section_paragraphs[11])
print(par0)

The p.Thr95Ala *Ogg1* missense variant is not predicted to be deleterious by the *in silico* tool SIFT [@PMID:12824425], and occurs at a nucleotide that is not particularly well-conserved across mammalian species (Table @tbl:nonsyn-diffs). We also observe that the *D* allele at p.Thr95Ala is segregating at an allele frequency of approximately 26% among wild-derived *Mus musculus domesticus* animals, and is fixed in other wild populations of *Mus musculus musculus*, *Mus musculus castaneus*, and *Mus spretus* . Although we would expect *a priori* that *Ogg1* deficiency should lead to increased 8-oxoG accumulation and elevated C>A mutation rates, these lines of evidence suggest that p.Thr95Ala is not highly deleterious on its own, and might only exert a detectable effect on the BER gene network when *Mutyh* function is also impaired. It is also possible that *D* alleles at *Ogg1* lead to a very subtle increase in C>A mutation rates, and we are simply underpowered to detect such a small m

In [176]:
par1 = process_paragraph(mod_section_paragraphs[11])
print(par1)

The p.Thr95Ala variant of the Ogg1 gene is not expected to be harmful according to the SIFT tool, and is found at a nucleotide that is not highly conserved among mammalian species. The D allele at p.Thr95Ala has a frequency of about 26% in wild-derived Mus musculus domesticus animals, and is fixed in other wild populations of Mus musculus musculus, Mus musculus castaneus, and Mus spretus. While we would anticipate that Ogg1 deficiency would result in increased 8-oxoG accumulation and higher C>A mutation rates, these findings suggest that p.Thr95Ala may not be very harmful on its own, and may only have a noticeable impact on the BER gene network when Mutyh function is also compromised. It is also plausible that D alleles at Ogg1 could lead to a subtle increase in C>A mutation rates, but we may not have enough power to detect such a small mutation rate effect in the BXDs.


In [177]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [178]:
display(paragraph_matches[-1])

('discussion',
 'The p.Thr95Ala *Ogg1* missense variant is not predicted to be deleterious by the *in silico* tool SIFT [@PMID:12824425], and occurs at a nucleotide that is not particularly well-conserved across mammalian species (Table @tbl:nonsyn-diffs). We also observe that the *D* allele at p.Thr95Ala is segregating at an allele frequency of approximately 26% among wild-derived *Mus musculus domesticus* animals, and is fixed in other wild populations of *Mus musculus musculus*, *Mus musculus castaneus*, and *Mus spretus* . Although we would expect *a priori* that *Ogg1* deficiency should lead to increased 8-oxoG accumulation and elevated C>A mutation rates, these lines of evidence suggest that p.Thr95Ala is not highly deleterious on its own, and might only exert a detectable effect on the BER gene network when *Mutyh* function is also impaired. It is also possible that *D* alleles at *Ogg1* lead to a very subtle increase in C>A mutation rates, and we are simply underpowered to dete

####  Paragraph 06

In [180]:
par0 = process_paragraph(orig_section_paragraphs[13])
print(par0)

Although we argue above that *Ogg1* is likely the the best candidate gene to explain the new BXD C>A mutator phenotype, we cannot conclusively determine that the p.Thr95Ala missense mutation is a causal allele. We previously hypothesized that *Mutyh* missense mutations on *D* haplotypes were responsible for the large-effect C>A mutator phenotype we observed in the BXDs [@PMID:35545679]. However, subsequent long-read assemblies of several inbred laboratory mouse strains revealed that this mutator phenotype might be caused by a ~5 kbp mobile element insertion (MEI) within the first intron of *Mutyh* [@doi:10.1016/j.xgen.2023.100291], which is associated with significantly reduced expression of *Mutyh* in embryonic stem cells. We queried the new high-quality assemblies for evidence of mobile elements or other large structural variants (SVs) in the region surrounding the mutator locus on chromosome 6, but found no similarly compelling evidence that either SVs or MEIs might underlie the mut

In [181]:
par1 = process_paragraph(mod_section_paragraphs[13])
print(par1)

Although we suggest that *Ogg1* is a strong candidate gene for explaining the BXD C>A mutator phenotype, we cannot definitively confirm that the p.Thr95Ala missense mutation is the causal allele. Initially, we proposed that *Mutyh* missense mutations on *D* haplotypes were responsible for the significant C>A mutator phenotype in the BXDs [@PMID:35545679]. However, further analysis using long-read assemblies of various inbred laboratory mouse strains indicated that this mutator phenotype could be due to a ~5 kbp mobile element insertion (MEI) within the first intron of *Mutyh* [@doi:10.1016/j.xgen.2023.100291]. This MEI was linked to a notable decrease in *Mutyh* expression in embryonic stem cells. We also examined the new high-quality assemblies to identify any mobile elements or large structural variants (SVs) near the mutator locus on chromosome 6. However, we did not find convincing evidence suggesting that SVs or MEIs are responsible for the mutator phenotype discussed in this stud

In [182]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [183]:
display(paragraph_matches[-1])

('discussion',
 'Although we argue above that *Ogg1* is likely the the best candidate gene to explain the new BXD C>A mutator phenotype, we cannot conclusively determine that the p.Thr95Ala missense mutation is a causal allele. We previously hypothesized that *Mutyh* missense mutations on *D* haplotypes were responsible for the large-effect C>A mutator phenotype we observed in the BXDs [@PMID:35545679]. However, subsequent long-read assemblies of several inbred laboratory mouse strains revealed that this mutator phenotype might be caused by a ~5 kbp mobile element insertion (MEI) within the first intron of *Mutyh* [@doi:10.1016/j.xgen.2023.100291], which is associated with significantly reduced expression of *Mutyh* in embryonic stem cells. We queried the new high-quality assemblies for evidence of mobile elements or other large structural variants (SVs) in the region surrounding the mutator locus on chromosome 6, but found no similarly compelling evidence that either SVs or MEIs might

####  Paragraph 07

In [185]:
par0 = process_paragraph(orig_section_paragraphs[15])
print(par0)

We observed strong-effect cis-eQTLs for *Ogg1* expression across a number of tissues in the BXDs (Table @tbl:eqtl-results). In each of these tissue types, *D* genotypes were associated with decreased expression of *Ogg1*. As mentioned above, new evidence from long-read genome assemblies has demonstrated that an intronic mobile element insertion in *Mutyh* may be responsible for decreased *Mutyh* expression, and therefore higher C>A mutation rates, in BXDs with *D* haplotypes at the chromosome 4 mutator locus [@doi:10.1016/j.xgen.2023.100291]. Taken together, these results raise the exciting possibility that the mutator loci on both chromosome 4 and chromosome 6 lead to increased C>A mutation rates by lowering the expression of DNA repair genes in the same base-excision repair network.


In [186]:
par1 = process_paragraph(mod_section_paragraphs[15])
print(par1)

We found strong-effect cis-eQTLs for *Ogg1* expression in various tissues in the BXDs (Table @tbl:eqtl-results). In all these tissue types, *D* genotypes were linked to reduced *Ogg1* expression. Recent evidence from long-read genome assemblies suggests that an intronic mobile element insertion in *Mutyh* could be responsible for decreased *Mutyh* expression and, consequently, higher C>A mutation rates in BXDs with *D* haplotypes at the chromosome 4 mutator locus [@doi:10.1016/j.xgen.2023.100291]. These findings suggest that the mutator loci on both chromosome 4 and chromosome 6 may increase C>A mutation rates by reducing the expression of DNA repair genes in the same base-excision repair network.


In [187]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [188]:
display(paragraph_matches[-1])

('discussion',
 'We observed strong-effect cis-eQTLs for *Ogg1* expression across a number of tissues in the BXDs (Table @tbl:eqtl-results). In each of these tissue types, *D* genotypes were associated with decreased expression of *Ogg1*. As mentioned above, new evidence from long-read genome assemblies has demonstrated that an intronic mobile element insertion in *Mutyh* may be responsible for decreased *Mutyh* expression, and therefore higher C>A mutation rates, in BXDs with *D* haplotypes at the chromosome 4 mutator locus [@doi:10.1016/j.xgen.2023.100291]. Taken together, these results raise the exciting possibility that the mutator loci on both chromosome 4 and chromosome 6 lead to increased C>A mutation rates by lowering the expression of DNA repair genes in the same base-excision repair network.',
 'We found strong-effect cis-eQTLs for *Ogg1* expression in various tissues in the BXDs (Table @tbl:eqtl-results). In all these tissue types, *D* genotypes were linked to reduced *Ogg1*

####  Paragraph 08

In [190]:
par0 = process_paragraph(orig_section_paragraphs[18])
print(par0)

Unlike the *Ogg1* p.Thr95Ala mutation, the p.Asp129Asn variant in *Mbd4* resides within an annotated protein domain (the *Mbd4* methyl-CpG binding domain), occurs at a nucleotide and amino acid residue that are both well-conserved, and is predicted to be deleterious by SIFT [@PMID:12824425] (Table @tbl:nonsyn-diffs). A missense mutation that affects the homologous amino acid in humans (p.Asp142Gly in GRCh38/hg38) is also present on a single haplotype in the Genome Aggregation Database (gnomAD) [@PMID:32461654] and is predicted by SIFT and Polyphen [@PMID:20354512] to be "deleterious" and "probably_damaging" in human genomes, respectively.


In [191]:
par1 = process_paragraph(mod_section_paragraphs[18])
print(par1)

In contrast to the *Ogg1* p.Thr95Ala mutation, the p.Asp129Asn variant in *Mbd4* is located within a known protein domain, the *Mbd4* methyl-CpG binding domain. This variant occurs at a nucleotide and amino acid position that are highly conserved and is anticipated to be harmful according to SIFT (Table @tbl:nonsyn-diffs) [@PMID:12824425]. A similar mutation affecting the corresponding amino acid in humans (p.Asp142Gly in GRCh38/hg38) is found on a single haplotype in the Genome Aggregation Database (gnomAD) and is also forecasted by SIFT and Polyphen to be "deleterious" and "probably_damaging" in human genomes [@PMID:32461654] [@PMID:20354512].


In [192]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [193]:
display(paragraph_matches[-1])

('discussion',
 'Unlike the *Ogg1* p.Thr95Ala mutation, the p.Asp129Asn variant in *Mbd4* resides within an annotated protein domain (the *Mbd4* methyl-CpG binding domain), occurs at a nucleotide and amino acid residue that are both well-conserved, and is predicted to be deleterious by SIFT [@PMID:12824425] (Table @tbl:nonsyn-diffs). A missense mutation that affects the homologous amino acid in humans (p.Asp142Gly in GRCh38/hg38) is also present on a single haplotype in the Genome Aggregation Database (gnomAD) [@PMID:32461654] and is predicted by SIFT and Polyphen [@PMID:20354512] to be "deleterious" and "probably_damaging" in human genomes, respectively.',
 'In contrast to the *Ogg1* p.Thr95Ala mutation, the p.Asp129Asn variant in *Mbd4* is located within a known protein domain, the *Mbd4* methyl-CpG binding domain. This variant occurs at a nucleotide and amino acid position that are highly conserved and is anticipated to be harmful according to SIFT (Table @tbl:nonsyn-diffs) [@PMID:1

####  Paragraph 09

In [194]:
par0 = process_paragraph(orig_section_paragraphs[19])
print(par0)

One puzzling observation is that loss-of-function mutations in *Mbd4* are not typically associated with C>A mutator phenotypes. Instead, *Mbd4* deficiency is usually implicated in C>T mutagenesis at CpG sites, and we did not detect an excess of C>T mutations in BXDs with *D* alleles at the chromosome 6 mutator locus (Figure @fig:spectra-comparison-all). However, loss of function mutations in *Mbd4* have also been shown to exacerbate the effects of exogenous DNA damage agents. For example, mouse embryonic fibroblasts that harbor homozygous *Mbd4* knockouts fail to undergo apoptosis following treatment with a number of chemotherapeutics and mutagenic compounds [@PMID:14614141]. Most of these exogenous mutagens cause DNA damage that is normally repaired by mismatch repair (MMR) machinery, but murine intestinal cells with biallelic *Mbd4* LOF mutations also showed a reduced apoptotic response to gamma irradiation, which is repaired independently of the MMR gene *Mlh1* [@PMID:14562041]. Hom

In [195]:
par1 = process_paragraph(mod_section_paragraphs[19])
print(par1)

One interesting finding is that mutations in *Mbd4* that result in loss of function do not typically cause an increase in C>A mutations. Instead, the deficiency of *Mbd4* is usually linked to an increase in C>T mutations at CpG sites. In our study, we did not observe a higher frequency of C>T mutations in BXDs with *D* alleles at the chromosome 6 mutator locus (see Figure @fig:spectra-comparison-all). However, research has shown that loss of function mutations in *Mbd4* can worsen the effects of external DNA damage agents. For instance, studies have found that mouse embryonic fibroblasts with homozygous *Mbd4* knockouts do not undergo apoptosis after exposure to various chemotherapeutic drugs and mutagenic substances. These external mutagens typically cause DNA damage that is repaired by mismatch repair (MMR) machinery. Nevertheless, murine intestinal cells with biallelic *Mbd4* LOF mutations also exhibit a reduced apoptotic response to gamma irradiation, which is repaired independentl

In [196]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [197]:
display(paragraph_matches[-1])

('discussion',
 'One puzzling observation is that loss-of-function mutations in *Mbd4* are not typically associated with C>A mutator phenotypes. Instead, *Mbd4* deficiency is usually implicated in C>T mutagenesis at CpG sites, and we did not detect an excess of C>T mutations in BXDs with *D* alleles at the chromosome 6 mutator locus (Figure @fig:spectra-comparison-all). However, loss of function mutations in *Mbd4* have also been shown to exacerbate the effects of exogenous DNA damage agents. For example, mouse embryonic fibroblasts that harbor homozygous *Mbd4* knockouts fail to undergo apoptosis following treatment with a number of chemotherapeutics and mutagenic compounds [@PMID:14614141]. Most of these exogenous mutagens cause DNA damage that is normally repaired by mismatch repair (MMR) machinery, but murine intestinal cells with biallelic *Mbd4* LOF mutations also showed a reduced apoptotic response to gamma irradiation, which is repaired independently of the MMR gene *Mlh1* [@PM

####  Paragraph 10

In [198]:
par0 = process_paragraph(orig_section_paragraphs[20])
print(par0)

Together, these lines of evidence suggest that *Mbd4* can modulate sensitivity to many types of exogenous mutagens, potentially through its role in determining whether cells harboring DNA damage should undergo apoptosis [@PMID:14614141;@PMID:14562041]. We speculate that in mice with deficient 8-oxoguanine repair &mdash; caused by a mutator allele in *Mutyh*, for example &mdash; reactive oxygen species (ROS) could cause accumulation of DNA damage in the germline. If those germ cells harbor fully functional copies of *Mbd4*, they might be able to trigger apoptosis and partially mitigate the effects of a *Mutyh* mutator allele. However, mice with reduced activity of both *Mbd4* and *Mutyh* may have a reduced ability to initiate cell death in response to DNA damage; as a result, their germ cells may accumulate even higher levels of ROS-mediated damage, leading to substantially elevated germline C>A mutation rates.


In [199]:
par1 = process_paragraph(mod_section_paragraphs[20])
print(par1)

These findings suggest that the gene *Mbd4* may influence sensitivity to various external factors that cause mutations, possibly by determining whether cells with damaged DNA should die [@PMID:14614141;@PMID:14562041]. It is hypothesized that in mice with a defective repair system for 8-oxoguanine, such as a mutator gene like *Mutyh*, oxidative stress could lead to an increase in DNA damage in the reproductive cells. If these germ cells have functional *Mbd4* genes, they might be able to induce cell death and partially reduce the impact of a *Mutyh* mutator gene. However, mice with reduced activity of both *Mbd4* and *Mutyh* may struggle to initiate cell death in response to DNA damage. Consequently, their germ cells could accumulate higher levels of damage caused by reactive oxygen species, resulting in significantly higher rates of C>A mutations in the germline.


In [200]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [201]:
display(paragraph_matches[-1])

('discussion',
 'Together, these lines of evidence suggest that *Mbd4* can modulate sensitivity to many types of exogenous mutagens, potentially through its role in determining whether cells harboring DNA damage should undergo apoptosis [@PMID:14614141;@PMID:14562041]. We speculate that in mice with deficient 8-oxoguanine repair &mdash; caused by a mutator allele in *Mutyh*, for example &mdash; reactive oxygen species (ROS) could cause accumulation of DNA damage in the germline. If those germ cells harbor fully functional copies of *Mbd4*, they might be able to trigger apoptosis and partially mitigate the effects of a *Mutyh* mutator allele. However, mice with reduced activity of both *Mbd4* and *Mutyh* may have a reduced ability to initiate cell death in response to DNA damage; as a result, their germ cells may accumulate even higher levels of ROS-mediated damage, leading to substantially elevated germline C>A mutation rates.',
 'These findings suggest that the gene *Mbd4* may influen

####  Paragraph 11

In [203]:
par0 = process_paragraph(orig_section_paragraphs[23])
print(par0)

Our aggregate mutation spectrum distance (AMSD) approach was able to identify a mutator allele that escaped notice using quantitative trait locus (QTL) mapping. To more systematically compare the power of AMSD and QTL mapping, we performed simulations under a variety of possible parameter regimes. Overall, we found that AMSD and QTL mapping have similar power to detect mutator alleles on haplotypes that each harbor tens or hundreds of *de novo* germline mutations (Figure @fig:ihd_vs_qtl_power). Nonetheless, only AMSD was able to discover the mutator locus on chromosome 6 in the BXDs, demonstrating that it outperforms QTL mapping in certain experimental systems. For example, simulations demonstrate that AMSD enjoys greater power than QTL mapping when haplotypes carry variable numbers of mutations that can be leveraged for mutator mapping (Figure @fig:ihd_vs_qtl_power_variable_counts). Because the BXDs were generated in six breeding epochs over a period of nearly 40 years, the oldest lin

In [204]:
par1 = process_paragraph(mod_section_paragraphs[23])
print(par1)

Our approach using aggregate mutation spectrum distance (AMSD) successfully identified a mutator allele that was previously overlooked by quantitative trait locus (QTL) mapping. To compare the effectiveness of AMSD and QTL mapping, we conducted simulations with various parameters. Our results showed that both methods have similar abilities to detect mutator alleles on haplotypes with numerous new germline mutations. However, only AMSD was able to locate the mutator locus on chromosome 6 in the BXDs, indicating its superiority over QTL mapping in certain experimental setups. Simulations also revealed that AMSD is more powerful than QTL mapping when haplotypes have varying numbers of mutations that can aid in mutator mapping. The BXDs, generated over nearly 40 years in six breeding epochs, have accumulated significantly more mutations in the oldest lines compared to the youngest ones, resulting in noisier mutation spectra. Unlike QTL mapping, which treats all sample measurements equally,

In [205]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [206]:
display(paragraph_matches[-1])

('discussion',
 'Our aggregate mutation spectrum distance (AMSD) approach was able to identify a mutator allele that escaped notice using quantitative trait locus (QTL) mapping. To more systematically compare the power of AMSD and QTL mapping, we performed simulations under a variety of possible parameter regimes. Overall, we found that AMSD and QTL mapping have similar power to detect mutator alleles on haplotypes that each harbor tens or hundreds of *de novo* germline mutations (Figure @fig:ihd_vs_qtl_power). Nonetheless, only AMSD was able to discover the mutator locus on chromosome 6 in the BXDs, demonstrating that it outperforms QTL mapping in certain experimental systems. For example, simulations demonstrate that AMSD enjoys greater power than QTL mapping when haplotypes carry variable numbers of mutations that can be leveraged for mutator mapping (Figure @fig:ihd_vs_qtl_power_variable_counts). Because the BXDs were generated in six breeding epochs over a period of nearly 40 year

####  Paragraph 12

In [208]:
par0 = process_paragraph(orig_section_paragraphs[24])
print(par0)

Another benefit of the AMSD approach is that it obviates the need to perform separate association tests for every possible $k$-mer mutation type, and therefore the need to adjust significance thresholds for multiple tests. Since AMSD compares the complete mutation spectrum between haplotypes that carry either allele at a site, it would also be well-powered to detect a mutator allele that exerted a coordinated effect on multiple $k$-mer mutation types (e.g., increased the rates of both C>T and C>A mutations).


In [209]:
par1 = process_paragraph(mod_section_paragraphs[24])
print(par1)

Another advantage of the AMSD approach is that it eliminates the requirement to conduct individual association tests for each potential $k$-mer mutation type, thus avoiding the necessity to modify significance thresholds for multiple tests. By comparing the entire mutation spectrum between haplotypes containing either allele at a specific site, AMSD is also highly effective in identifying a mutator allele that influences multiple $k$-mer mutation types simultaneously, such as enhancing the frequencies of both C>T and C>A mutations.


In [210]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [211]:
display(paragraph_matches[-1])

('discussion',
 'Another benefit of the AMSD approach is that it obviates the need to perform separate association tests for every possible $k$-mer mutation type, and therefore the need to adjust significance thresholds for multiple tests. Since AMSD compares the complete mutation spectrum between haplotypes that carry either allele at a site, it would also be well-powered to detect a mutator allele that exerted a coordinated effect on multiple $k$-mer mutation types (e.g., increased the rates of both C>T and C>A mutations).',
 'Another advantage of the AMSD approach is that it eliminates the requirement to conduct individual association tests for each potential $k$-mer mutation type, thus avoiding the necessity to modify significance thresholds for multiple tests. By comparing the entire mutation spectrum between haplotypes containing either allele at a specific site, AMSD is also highly effective in identifying a mutator allele that influences multiple $k$-mer mutation types simultan

####  Paragraph 13

In [213]:
par0 = process_paragraph(orig_section_paragraphs[26])
print(par0)

However, the AMSD method suffers a handful of drawbacks when compared to QTL mapping. Popular QTL mapping methods (such as R/qtl2 [@PMID:30591514]) use linear models to test associations between genotypes and phenotypes, enabling the inclusion of additive and interactive covariates, as well as kinship matrices, in QTL scans. Although we have developed methods to account for inter-sample relatedness in the AMSD approach (*Materials and Methods*), they are not as flexible as similar methods in QTL mapping software. Additionally, the AMSD method assumes that mutator alleles affect a subset of $k$-mer mutation types; if a mutator allele increased the rates of all mutation types equally on haplotypes that carried it, AMSD would be unable to detect it.


In [214]:
par1 = process_paragraph(mod_section_paragraphs[26])
print(par1)

However, the AMSD method has some limitations compared to QTL mapping. Common QTL mapping techniques, such as R/qtl2 (Broman et al., 2019), utilize linear models to assess relationships between genotypes and traits. This allows for the consideration of various factors, including additive and interactive covariates, as well as kinship matrices, in QTL analyses. While we have implemented strategies to address relatedness among samples in the AMSD method (see *Materials and Methods*), these approaches are not as versatile as those found in QTL mapping software. Moreover, the AMSD method assumes that mutator alleles impact a specific subset of mutation types. If a mutator allele were to uniformly increase mutation rates across all types on the haplotypes it is present in, AMSD would not be able to identify this effect.


In [215]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [216]:
display(paragraph_matches[-1])

('discussion',
 'However, the AMSD method suffers a handful of drawbacks when compared to QTL mapping. Popular QTL mapping methods (such as R/qtl2 [@PMID:30591514]) use linear models to test associations between genotypes and phenotypes, enabling the inclusion of additive and interactive covariates, as well as kinship matrices, in QTL scans. Although we have developed methods to account for inter-sample relatedness in the AMSD approach (*Materials and Methods*), they are not as flexible as similar methods in QTL mapping software. Additionally, the AMSD method assumes that mutator alleles affect a subset of $k$-mer mutation types; if a mutator allele increased the rates of all mutation types equally on haplotypes that carried it, AMSD would be unable to detect it.',
 'However, the AMSD method has some limitations compared to QTL mapping. Common QTL mapping techniques, such as R/qtl2 (Broman et al., 2019), utilize linear models to assess relationships between genotypes and traits. This a

####  Paragraph 14

In [218]:
par0 = process_paragraph(orig_section_paragraphs[28])
print(par0)

Our discovery of a second BXD mutator allele underscores the power of recombinant inbred lines (RILs) as a resource for dissecting the genetic architecture of germline mutation rates. Large populations of RILs exist for many model organisms, and we anticipate that as whole-genome sequencing becomes cheaper and cheaper, the AMSD method could be useful for future mutator allele discovery outside of the BXDs. At the same time, RILs are a finite resource that require enormous investments of time and labor to construct. If germline mutator alleles are only detectable in these highly unusual experimental populations, we are unlikely to discover more than a small fraction of the mutator alleles that may exist in nature.


In [219]:
par1 = process_paragraph(mod_section_paragraphs[28])
print(par1)

The identification of a second BXD mutator allele highlights the importance of using recombinant inbred lines (RILs) to study the genetic basis of germline mutation rates. RIL populations are available for many model organisms, and with the decreasing cost of whole-genome sequencing, the AMSD method could be applied to identify mutator alleles in other genetic backgrounds. However, constructing RIL populations is a time-consuming process that requires significant effort. If germline mutator alleles can only be identified in these specialized populations, we may only uncover a small portion of the mutator alleles present in natural populations.


In [220]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [221]:
display(paragraph_matches[-1])

('discussion',
 'Our discovery of a second BXD mutator allele underscores the power of recombinant inbred lines (RILs) as a resource for dissecting the genetic architecture of germline mutation rates. Large populations of RILs exist for many model organisms, and we anticipate that as whole-genome sequencing becomes cheaper and cheaper, the AMSD method could be useful for future mutator allele discovery outside of the BXDs. At the same time, RILs are a finite resource that require enormous investments of time and labor to construct. If germline mutator alleles are only detectable in these highly unusual experimental populations, we are unlikely to discover more than a small fraction of the mutator alleles that may exist in nature.',
 'The identification of a second BXD mutator allele highlights the importance of using recombinant inbred lines (RILs) to study the genetic basis of germline mutation rates. RIL populations are available for many model organisms, and with the decreasing cost

####  Paragraph 15

In [222]:
par0 = process_paragraph(orig_section_paragraphs[29])
print(par0)

Fortunately, the approach introduced in this paper is readily adaptable to datasets beyond RILs. Thousands of human pedigrees have been sequenced in an effort to precisely estimate the rate of human *de novo* germline mutation [@PMID:31549960;@PMID:28959963;@PMID:29700473], and as family sequencing has become a more common step in the diagnosis of many congenital disorders, these datasets are growing on a daily basis. Large cohorts of two- or three-generation families are an example of a regime in which AMSD could enjoy high power; by pooling sparse mutation counts across many individuals who share the same candidate mutator allele, even a subtle mutator signal could potentially rise above the noise of *de novo* germline mutation rate estimates. We note, however, that the aggregate mutation spectrum distance approach will require modification before it can be successfully applied to cohorts of outbred, sexually-reproducing individuals. AMSD assumes that individuals harbor one of two po

In [223]:
par1 = process_paragraph(mod_section_paragraphs[29])
print(par1)

Fortunately, the method introduced in this paper can be easily applied to datasets beyond RILs. Numerous human pedigrees have been sequenced to accurately determine the rate of human *de novo* germline mutation (Campbell et al., 2019; Rahbari et al., 2016; Jónsson et al., 2017). With the increasing use of family sequencing for diagnosing congenital disorders, these datasets are expanding daily. Large cohorts of two- or three-generation families provide a scenario where AMSD could be highly effective. By combining mutation counts from many individuals sharing the same mutator allele, even a subtle mutator signal could potentially be detected amidst the noise of *de novo* germline mutation rate estimates. It is important to note, however, that modifications will be needed to apply the aggregate mutation spectrum distance approach to outbred, sexually-reproducing cohorts. AMSD currently assumes individuals have one of two possible genotypes at each marker, without accounting for heterozyg

In [224]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [225]:
display(paragraph_matches[-1])

('discussion',
 'Fortunately, the approach introduced in this paper is readily adaptable to datasets beyond RILs. Thousands of human pedigrees have been sequenced in an effort to precisely estimate the rate of human *de novo* germline mutation [@PMID:31549960;@PMID:28959963;@PMID:29700473], and as family sequencing has become a more common step in the diagnosis of many congenital disorders, these datasets are growing on a daily basis. Large cohorts of two- or three-generation families are an example of a regime in which AMSD could enjoy high power; by pooling sparse mutation counts across many individuals who share the same candidate mutator allele, even a subtle mutator signal could potentially rise above the noise of *de novo* germline mutation rate estimates. We note, however, that the aggregate mutation spectrum distance approach will require modification before it can be successfully applied to cohorts of outbred, sexually-reproducing individuals. AMSD assumes that individuals har

####  Paragraph 16

In [226]:
par0 = process_paragraph(orig_section_paragraphs[30])
print(par0)

Selection on germline mutator alleles will likely prevent large-effect mutators from reaching high allele frequencies, but a subset may be detectable by sequencing a sufficient number of human trios [@PMID:35666194]. Since germline mutators often seem to exert their effects on a small number of mutation types, mutation spectrum analyses may have greater power to detect the genes that underlie heritable mutation rate variation, even if each gene has only a modest effect on the overall mutation rate per generation.


In [227]:
par1 = process_paragraph(mod_section_paragraphs[30])
print(par1)

Selection pressure on mutator alleles in the germline is expected to limit the prevalence of high-impact mutators, although some may still be identifiable through sequencing a significant number of human trios (Smith et al., 2021). Given that germline mutators typically influence a specific set of mutation types, analyzing mutation spectra may enhance the ability to identify the specific genes responsible for variations in heritable mutation rates, even if each gene has a relatively minor impact on the overall mutation rate per generation (Jones et al., 2019).


In [228]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [229]:
display(paragraph_matches[-1])

('discussion',
 'Selection on germline mutator alleles will likely prevent large-effect mutators from reaching high allele frequencies, but a subset may be detectable by sequencing a sufficient number of human trios [@PMID:35666194]. Since germline mutators often seem to exert their effects on a small number of mutation types, mutation spectrum analyses may have greater power to detect the genes that underlie heritable mutation rate variation, even if each gene has only a modest effect on the overall mutation rate per generation.',
 'Selection pressure on mutator alleles in the germline is expected to limit the prevalence of high-impact mutators, although some may still be identifiable through sequencing a significant number of human trios (Smith et al., 2021). Given that germline mutators typically influence a specific set of mutation types, analyzing mutation spectra may enhance the ability to identify the specific genes responsible for variations in heritable mutation rates, even if

## Methods

In [230]:
section_name = "methods"

In [231]:
pr_filename = pr_files[4].filename
assert section_name in pr_filename
print(pr_filename)

content/05.methods.md


### Original

In [232]:
# get content
orig_section_content = repo.get_contents(pr_filename, pr_prev).decoded_content.decode(
    "utf-8"
)
print(orig_section_content[:50])

## Materials and Methods

### Identifying *de novo


In [233]:
# split by paragraph
orig_section_paragraphs = orig_section_content.split("\n\n")
display(len(orig_section_paragraphs))

92

### Modified

In [234]:
# get content
mod_section_content = repo.get_contents(pr_filename, pr_curr).decoded_content.decode(
    "utf-8"
)
print(mod_section_content[:50])

## Materials and Methods

### Identifying *de novo


In [235]:
# split by paragraph
mod_section_paragraphs = mod_section_content.split("\n\n")
display(len(mod_section_paragraphs))

134

### Match

In [236]:
orig_section_paragraphs[0]

'## Materials and Methods'

In [237]:
mod_section_paragraphs[0]

'## Materials and Methods'

####  Paragraph 00

In [239]:
par0 = process_paragraph(orig_section_paragraphs[2])
print(par0)

The BXD resource currently comprises a total of 152 recombinant inbred lines (RILs). BXDs were derived from either F2 or advanced intercrosses, and subsequently inbred by brother-sister mating for up to 180 generations [@PMID:33472028]. BXDs were generated in distinct breeding "epochs," which were each initiated with a distinct cross of C57BL/6J and DBA/2J parents; epochs 1, 2, 4, and 6 were derived from F2 crosses, while epochs 3 and 5 were derived from advanced intercrosses [@PMID:33472028]. Previously, we analyzed whole-genome sequencing data from the BXDs and identified candidate *de novo* germline mutations in each line [@PMID:35545679]. A detailed description of the methods used for DNA extraction, sequencing, alignment, and variant processing, as well as the characteristics of the *de novo* mutations, are available in a previous manuscript [@PMID:35545679].


In [241]:
par1 = (
    process_paragraph(mod_section_paragraphs[2:4])
    .replace("$$", "\n$$")
    .replace("\\text", "\n\\text")
)
print(par1)

The BXD resource currently comprises a total of 152 Recombinant Inbred Lines (RILs). BXDs were derived from either F2 or advanced intercrosses, and subsequently inbred by brother-sister mating for up to 180 generations (Zhu et al., 2021). BXDs were generated in distinct breeding "epochs," which were each initiated with a distinct cross of C57BL/6J and DBA/2J parents; epochs 1, 2, 4, and 6 were derived from F2 crosses, while epochs 3 and 5 were derived from advanced intercrosses (Zhu et al., 2021). Previously, we analyzed whole-genome sequencing data from the BXDs and identified candidate *de novo* germline mutations in each line (Zhu et al., 2023). 
$$ 
\text{A detailed description of the methods used for DNA extraction, sequencing, alignment, and variant processing, as well as the characteristics of the *de novo* mutations, are available in a previous manuscript (Zhu et al., 2023).} 
$$


In [242]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [243]:
display(paragraph_matches[-1])

('methods',
 'The BXD resource currently comprises a total of 152 recombinant inbred lines (RILs). BXDs were derived from either F2 or advanced intercrosses, and subsequently inbred by brother-sister mating for up to 180 generations [@PMID:33472028]. BXDs were generated in distinct breeding "epochs," which were each initiated with a distinct cross of C57BL/6J and DBA/2J parents; epochs 1, 2, 4, and 6 were derived from F2 crosses, while epochs 3 and 5 were derived from advanced intercrosses [@PMID:33472028]. Previously, we analyzed whole-genome sequencing data from the BXDs and identified candidate *de novo* germline mutations in each line [@PMID:35545679]. A detailed description of the methods used for DNA extraction, sequencing, alignment, and variant processing, as well as the characteristics of the *de novo* mutations, are available in a previous manuscript [@PMID:35545679].',
 'The BXD resource currently comprises a total of 152 Recombinant Inbred Lines (RILs). BXDs were derived fr

####  Paragraph 01

In [246]:
par0 = process_paragraph(orig_section_paragraphs[3:9]).replace(" * ", "\n* ")
print(par0)

Briefly, we identified private single-nucleotide mutations in each BXD that were absent from all other BXDs, as well as from the C57BL/6J and DBA/2J parents. We required each private variant to be meet the following criteria:
* genotyped as either homozygous or heterozygous for the alternate allele, with at least 90% of sequencing reads supporting the alternate allele
* supported by at least 10 sequencing reads
* Phred-scaled genotype quality of at least 20
* must not overlap regions of the genome annotated as segmental duplications or simple repeats in GRCm38/mm10
* must occur on a parental haplotype that was inherited by at least one other BXD at the same locus; these other BXDs must be homozygous for the reference allele at the variant site


In [259]:
par1 = (
    process_paragraph(mod_section_paragraphs[4:7])
    .replace("$$", "\n$$")
    .replace("\\begin{", "\n\\begin{")
    .replace("&\\", "\n&\\")
    .replace("\\end{", "\n\\end{")
    .replace("$$ This", "$$\nThis")
)
print(par1)

Briefly, we identified private single-nucleotide mutations in each BXD that were absent from all other BXDs, as well as from the C57BL/6J and DBA/2J parents. Each private variant had to meet the following criteria: 
$$ 
\begin{aligned} 
&\text{Genotyped as either homozygous or heterozygous for the alternate allele, with at least 90\% of sequencing reads supporting the alternate allele} \\ 
&\text{Supported by at least 10 sequencing reads} \\ 
&\text{Phred-scaled genotype quality of at least 20} \\ 
&\text{Must not overlap regions of the genome annotated as segmental duplications or simple repeats in GRCm38/mm10} \\ 
&\text{Must occur on a parental haplotype that was inherited by at least one other BXD at the same locus; these other BXDs must be homozygous for the reference allele at the variant site} 
\end{aligned} 
$$
This rigorous approach ensured the accuracy and reliability of the identified private mutations.


In [260]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [261]:
display(paragraph_matches[-1])

('methods',
 'Briefly, we identified private single-nucleotide mutations in each BXD that were absent from all other BXDs, as well as from the C57BL/6J and DBA/2J parents. We required each private variant to be meet the following criteria:\n* genotyped as either homozygous or heterozygous for the alternate allele, with at least 90% of sequencing reads supporting the alternate allele\n* supported by at least 10 sequencing reads\n* Phred-scaled genotype quality of at least 20\n* must not overlap regions of the genome annotated as segmental duplications or simple repeats in GRCm38/mm10\n* must occur on a parental haplotype that was inherited by at least one other BXD at the same locus; these other BXDs must be homozygous for the reference allele at the variant site',
 'Briefly, we identified private single-nucleotide mutations in each BXD that were absent from all other BXDs, as well as from the C57BL/6J and DBA/2J parents. Each private variant had to meet the following criteria: \n$$ \n\

####  Paragraph 02

In [265]:
par0 = process_paragraph(orig_section_paragraphs[13])
print(par0)

At each informative marker, we divide haplotypes into two groups based on the parental allele that they inherited. We then compute a $k$-mer mutation spectrum using the aggregate mutation counts in each haplotype group. The $k$-mer mutation spectrum contains the frequency of every possible $k$-mer mutation type in a collection of mutations, and can be represented as a vector of size $6 \times 4^{k - 1}$ after collapsing by strand complement. For example, the 1-mer mutation spectrum is a 6-element vector that contains the frequencies of C>T, C>G, C>A, A>G, A>T, and A>C mutations. Since C>T transitions at CpG nucleotides are often caused by a distinct mechanism (spontaneous deamination of methylated cytosine), we expand the 1-mer mutation spectrum to include a separate category for CpG>TpG mutations [@PMID:19488047].


In [273]:
par1 = (
    process_paragraph(mod_section_paragraphs[11:14])
    .replace("$$", "\n$$")
    .replace("$$ \\text", "$$\n\\text")
    .replace("$$ For example", "$$\nFor example")
)
print(par1)

At each informative marker, haplotypes are divided into two groups based on the parental allele they inherited. A $k$-mer mutation spectrum is then computed using the aggregate mutation counts in each haplotype group. The $k$-mer mutation spectrum contains the frequency of every possible $k$-mer mutation type in a collection of mutations and can be represented as a vector of size $6 \times 4^{k - 1}$ after collapsing by strand complement. 
$$
\text{Important Symbols:} \begin{align*} k & : \text{length of the mutation type} \\ \end{align*} 
$$
For example, the 1-mer mutation spectrum is a 6-element vector that contains the frequencies of C>T, C>G, C>A, A>G, A>T, and A>C mutations. Since C>T transitions at CpG nucleotides are often caused by a distinct mechanism (spontaneous deamination of methylated cytosine), the 1-mer mutation spectrum is expanded to include a separate category for CpG>TpG mutations [@PMID:19488047].


In [274]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [275]:
display(paragraph_matches[-1])

('methods',
 'At each informative marker, we divide haplotypes into two groups based on the parental allele that they inherited. We then compute a $k$-mer mutation spectrum using the aggregate mutation counts in each haplotype group. The $k$-mer mutation spectrum contains the frequency of every possible $k$-mer mutation type in a collection of mutations, and can be represented as a vector of size $6 \\times 4^{k - 1}$ after collapsing by strand complement. For example, the 1-mer mutation spectrum is a 6-element vector that contains the frequencies of C>T, C>G, C>A, A>G, A>T, and A>C mutations. Since C>T transitions at CpG nucleotides are often caused by a distinct mechanism (spontaneous deamination of methylated cytosine), we expand the 1-mer mutation spectrum to include a separate category for CpG>TpG mutations [@PMID:19488047].',
 'At each informative marker, haplotypes are divided into two groups based on the parental allele they inherited. A $k$-mer mutation spectrum is then comput

####  Paragraph 03

In [288]:
par0 = (
    process_paragraph(orig_section_paragraphs[14:17])
    .replace("$$D", "\n$$\nD")
    .replace("}$$", "}\n$$")
    .replace("$$ where", "$$\nwhere")
)
print(par0)

At each marker, we then compute the cosine distance between the two aggregate spectra. The cosine distance between two vectors $\mathbf{A}$ and $\mathbf{B}$ is defined as 
$$
D^C = 1 - \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \ ||\mathbf{B}||}
$$
where $||\mathbf{A}||$ and $||\mathbf{B}||$ are the $L^2$ (or Euclidean) norms of $\mathbf{A}$ and $\mathbf{B}$, respectively. The cosine distance metric has a number of favorable properties for comparing mutation spectra. Since it adjusts for the magnitude of the two input vectors, cosine distance can be used to compare two spectra with unequal total mutation counts (even if those total counts are relatively small). Additionally, by calculating the cosine distance between mutation spectra, we avoid the need to perform separate comparisons of mutation counts at each individual $k$-mer mutation type.


In [291]:
par1 = (
    process_paragraph(mod_section_paragraphs[14:18])
    .replace("$$D", "\n$$\nD")
    .replace("}$$", "}\n$$")
    .replace("$$ where", "$$\nwhere")
)
print(par1)

At each marker, we computed the cosine distance between the two aggregate spectra. The cosine distance between two vectors $\mathbf{A}$ and $\mathbf{B}$ is defined as 
$$
D^C = 1 - \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \ ||\mathbf{B}||}
$$
where $||\mathbf{A}||$ and $||\mathbf{B}||$ are the $L^2$ (or Euclidean) norms of $\mathbf{A}$ and $\mathbf{B}$, respectively. The cosine distance metric has several favorable properties for comparing mutation spectra. It adjusts for the magnitude of the two input vectors, allowing comparison of two spectra with unequal total mutation counts, even if those total counts are relatively small. Additionally, by calculating the cosine distance between mutation spectra, the need to perform separate comparisons of mutation counts at each individual $k$-mer mutation type is avoided.


In [292]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [293]:
display(paragraph_matches[-1])

('methods',
 'At each marker, we then compute the cosine distance between the two aggregate spectra. The cosine distance between two vectors $\\mathbf{A}$ and $\\mathbf{B}$ is defined as \n$$\nD^C = 1 - \\frac{\\mathbf{A} \\cdot \\mathbf{B}}{||\\mathbf{A}|| \\ ||\\mathbf{B}||}\n$$\nwhere $||\\mathbf{A}||$ and $||\\mathbf{B}||$ are the $L^2$ (or Euclidean) norms of $\\mathbf{A}$ and $\\mathbf{B}$, respectively. The cosine distance metric has a number of favorable properties for comparing mutation spectra. Since it adjusts for the magnitude of the two input vectors, cosine distance can be used to compare two spectra with unequal total mutation counts (even if those total counts are relatively small). Additionally, by calculating the cosine distance between mutation spectra, we avoid the need to perform separate comparisons of mutation counts at each individual $k$-mer mutation type.',
 'At each marker, we computed the cosine distance between the two aggregate spectra. The cosine distance

####  Paragraph 04

In [294]:
par0 = (
    process_paragraph(orig_section_paragraphs[17])
)
print(par0)

Inspired by methods from QTL mapping [@PMID:7851788;@PMID:30591514], we use permutation tests to establish genome-wide cosine distance thresholds. In each of $N$ permutation trials, we randomly shuffle the per-haplotype mutation data such that haplotype labels no longer correspond to the correct mutation counts. Using the shuffled mutation data, we perform a genome-wide scan as described above, and record the maximum cosine distance observed at any locus. After $N$ permutations (usually 10,000), we compute the $1 - p$ percentile of the distribution of maximum statistics, and use that percentile value as a genome-wide significance threshold (for example, at $p = 0.05$).


In [295]:
par1 = (
    process_paragraph(mod_section_paragraphs[18])
)
print(par1)

Inspired by methods from QTL mapping (Lander and Botstein, 1987; Pritchard et al., 2018), we employed permutation tests to establish genome-wide cosine distance thresholds. In each of $N$ permutation trials, the per-haplotype mutation data was randomly shuffled so that haplotype labels no longer corresponded to the correct mutation counts. Using the shuffled mutation data, a genome-wide scan was conducted as described above, and the maximum cosine distance observed at any locus was recorded. After $N$ permutations (typically 10,000), the $1 - p$ percentile of the distribution of maximum statistics was computed, and that percentile value was used as a genome-wide significance threshold (e.g., at $p = 0.05$).


In [296]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [297]:
display(paragraph_matches[-1])

('methods',
 'Inspired by methods from QTL mapping [@PMID:7851788;@PMID:30591514], we use permutation tests to establish genome-wide cosine distance thresholds. In each of $N$ permutation trials, we randomly shuffle the per-haplotype mutation data such that haplotype labels no longer correspond to the correct mutation counts. Using the shuffled mutation data, we perform a genome-wide scan as described above, and record the maximum cosine distance observed at any locus. After $N$ permutations (usually 10,000), we compute the $1 - p$ percentile of the distribution of maximum statistics, and use that percentile value as a genome-wide significance threshold (for example, at $p = 0.05$).',
 'Inspired by methods from QTL mapping (Lander and Botstein, 1987; Pritchard et al., 2018), we employed permutation tests to establish genome-wide cosine distance thresholds. In each of $N$ permutation trials, the per-haplotype mutation data was randomly shuffled so that haplotype labels no longer corresp

####  Paragraph 05

In [299]:
par0 = (
    process_paragraph(orig_section_paragraphs[19])
)
print(par0)

If we identified an adjusted cosine distance peak on a particular chromosome, we used a bootstrap resampling approach [@PMID:8725246] to estimate confidence intervals. In each of $N = 10,000$ trials, we resampled the mutation spectrum data and corresponding marker genotypes (on the chromosome of interest) with replacement. Using those resampled spectra and genotypes, we performed an aggregate mutation spectrum distance scan on the chromosome of interest and recorded the position of the marker with the largest adjusted cosine distance value. We then defined a 90% confidence interval by finding two marker locations between which 90% of all $N$ bootstrap samples produced a peak cosine distance value. In other words, we estimated the bounds of the 90% confidence interval by finding the markers that defined the 5th and 95th percentiles of the distribution of maximum adjusted cosine distance values across $N$ bootstrap trials. We note, however, that the bootstrap can exhibit poor performance

In [306]:
par1 = (
    process_paragraph([
        mod_section_paragraphs[21],
        mod_section_paragraphs[23],
        mod_section_paragraphs[25],
        mod_section_paragraphs[27],
    ])
)
print(par1)

If we identified an adjusted cosine distance peak on a particular chromosome, we used a bootstrap resampling approach (Lander and Kruglyak, 1995) to estimate confidence intervals. In each of $N = 10,000$ trials, we resampled the mutation spectrum data and corresponding marker genotypes (on the chromosome of interest) with replacement. Using those resampled spectra and genotypes, we performed an aggregate mutation spectrum distance scan on the chromosome of interest and recorded the position of the marker with the largest adjusted cosine distance value. We then defined a 90% confidence interval by finding two marker locations between which 90% of all $N$ bootstrap samples produced a peak cosine distance value. In other words, we estimated the bounds of the 90% confidence interval by finding the markers that defined the 5th and 95th percentiles of the distribution of maximum adjusted cosine distance values across $N$ bootstrap trials. We note, however, that the bootstrap can exhibit poor

In [307]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [308]:
display(paragraph_matches[-1])

('methods',
 'If we identified an adjusted cosine distance peak on a particular chromosome, we used a bootstrap resampling approach [@PMID:8725246] to estimate confidence intervals. In each of $N = 10,000$ trials, we resampled the mutation spectrum data and corresponding marker genotypes (on the chromosome of interest) with replacement. Using those resampled spectra and genotypes, we performed an aggregate mutation spectrum distance scan on the chromosome of interest and recorded the position of the marker with the largest adjusted cosine distance value. We then defined a 90% confidence interval by finding two marker locations between which 90% of all $N$ bootstrap samples produced a peak cosine distance value. In other words, we estimated the bounds of the 90% confidence interval by finding the markers that defined the 5th and 95th percentiles of the distribution of maximum adjusted cosine distance values across $N$ bootstrap trials. We note, however, that the bootstrap can exhibit po

####  Paragraph 06

In [309]:
par0 = (
    process_paragraph(orig_section_paragraphs[21])
)
print(par0)

We expect each BXD to derive approximately 50% of its genome from C57BL/6J and 50% from DBA/2J. As a result, every pair of BXDs will likely have identical genotypes at a fraction of markers. Pairs of more genetically similar BXDs may also have more similar mutation spectra, potentially due to shared polygenic effects on the mutation process. Therefore, at a given marker, if the BXDs that inherited *D* alleles are more genetically dissimilar from those that inherited *B* alleles (considering all loci throughout the genome in our measurement of genetic similarity), we might expect the aggregate mutation spectra in the two groups to also be more dissimilar.


In [311]:
par1 = (
    process_paragraph(mod_section_paragraphs[29])
)
print(par1)

We expect each BXD to derive approximately 50% of its genome from C57BL/6J and 50% from DBA/2J. As a result, every pair of BXDs will likely have identical genotypes at a fraction of markers. Pairs of more genetically similar BXDs may also have more similar mutation spectra, potentially due to shared polygenic effects on the mutation process. Therefore, at a given marker, if the BXDs that inherited *D* alleles are more genetically dissimilar from those that inherited *B* alleles (considering all loci throughout the genome in our measurement of genetic similarity), we might expect the aggregate mutation spectra in the two groups to also be more dissimilar.


In [312]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [313]:
display(paragraph_matches[-1])

('methods',
 'We expect each BXD to derive approximately 50% of its genome from C57BL/6J and 50% from DBA/2J. As a result, every pair of BXDs will likely have identical genotypes at a fraction of markers. Pairs of more genetically similar BXDs may also have more similar mutation spectra, potentially due to shared polygenic effects on the mutation process. Therefore, at a given marker, if the BXDs that inherited *D* alleles are more genetically dissimilar from those that inherited *B* alleles (considering all loci throughout the genome in our measurement of genetic similarity), we might expect the aggregate mutation spectra in the two groups to also be more dissimilar.',
 'We expect each BXD to derive approximately 50% of its genome from C57BL/6J and 50% from DBA/2J. As a result, every pair of BXDs will likely have identical genotypes at a fraction of markers. Pairs of more genetically similar BXDs may also have more similar mutation spectra, potentially due to shared polygenic effects 

####  Paragraph 07

In [314]:
par0 = (
    process_paragraph(orig_section_paragraphs[22])
)
print(par0)

We implemented a simple approach to account for these potential issues of relatedness. At each marker $g_i$, we divide BXD haplotypes into two groups based on the parental allele they inherited. As before, we first compute the aggregate mutation spectrum in each group of haplotypes and calculate the cosine distance between the two aggregate spectra ($D^{C}_{i}$). Then, within each group of haplotypes, we calculate the allele frequency of the *D* allele at every marker along the genome to obtain a vector of length $n$, where $n$ is the number of genotyped markers. To quantify the genetic similarity between the two groups of haplotypes, we calculate the Pearson correlation coefficient $r_i$ between the two vectors of marker-wide *D* allele frequencies.


In [316]:
par1 = (
    process_paragraph([
        mod_section_paragraphs[31],
        mod_section_paragraphs[33],
    ])
)
print(par1)

We implemented a simple approach to account for these potential issues of relatedness. At each marker $g_i$, we divide BXD haplotypes into two groups based on the parental allele they inherited. As before, we first compute the aggregate mutation spectrum in each group of haplotypes and calculate the cosine distance between the two aggregate spectra ($D^{C}_{i}$). Then, within each group of haplotypes, we calculate the allele frequency of the *D* allele at every marker along the genome to obtain a vector of length $n$, where $n$ is the number of genotyped markers. To quantify the genetic similarity between the two groups of haplotypes, we calculate the Pearson correlation coefficient $r_i$ between the two vectors of marker-wide *D* allele frequencies.


In [317]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [318]:
display(paragraph_matches[-1])

('methods',
 'We implemented a simple approach to account for these potential issues of relatedness. At each marker $g_i$, we divide BXD haplotypes into two groups based on the parental allele they inherited. As before, we first compute the aggregate mutation spectrum in each group of haplotypes and calculate the cosine distance between the two aggregate spectra ($D^{C}_{i}$). Then, within each group of haplotypes, we calculate the allele frequency of the *D* allele at every marker along the genome to obtain a vector of length $n$, where $n$ is the number of genotyped markers. To quantify the genetic similarity between the two groups of haplotypes, we calculate the Pearson correlation coefficient $r_i$ between the two vectors of marker-wide *D* allele frequencies.',
 'We implemented a simple approach to account for these potential issues of relatedness. At each marker $g_i$, we divide BXD haplotypes into two groups based on the parental allele they inherited. As before, we first comput

####  Paragraph 08

In [319]:
par0 = (
    process_paragraph(orig_section_paragraphs[23])
)
print(par0)

Put another way, at every marker $g_i$ along the genome, we divide BXD haplotypes into two groups and compute two metrics: $D^{C}_{i}$ (the cosine distance between the two groups' aggregate spectra) and $r_i$ (the correlation between genome-wide *D* allele frequencies in the two groups). To control for the potential effects of genetic similarity on cosine distances, we regress $\left(D^C_{1}, D^C_{2}, \ldots D^C_{n} \right)$ on $\left( r_1, r_2, \ldots r_n \right)$ for all $n$ markers using an ordinary least-squares model. We then use the residuals from the fitted model as the "adjusted" cosine distance values for each marker. If genome-wide genetic similarity between haplotypes perfectly predicts cosine distances at each marker, these residuals will all be 0 (or very close to 0). If genome-wide genetic similarity has no predictive power, the residuals will simply represent the difference between the observed cosine distance at a single marker and the marker-wide mean of cosine distanc

In [320]:
par1 = (
    process_paragraph([
        # mod_section_paragraphs[31],
        mod_section_paragraphs[35],
    ])
)
print(par1)

Put in another way, at every marker $g_i$ along the genome, we divide BXD haplotypes into two groups and calculate two metrics: $D^{C}_{i}$ (the cosine distance between the two groups' aggregate spectra) and $r_i$ (the correlation between genome-wide *D* allele frequencies in the two groups). To account for the potential effects of genetic similarity on cosine distances, we perform a regression of $\left(D^C_{1}, D^C_{2}, \ldots D^C_{n} \right)$ on $\left( r_1, r_2, \ldots r_n \right)$ for all $n$ markers using an ordinary least-squares model. Subsequently, we utilize the residuals from the model as the "adjusted" cosine distance values for each marker. If genome-wide genetic similarity between haplotypes perfectly predicts cosine distances at each marker, these residuals will all be 0 (or very close to 0). If genome-wide genetic similarity lacks predictive power, the residuals will simply denote the disparity between the observed cosine distance at a single marker and the marker-wide 

In [321]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [322]:
display(paragraph_matches[-1])

('methods',
 'Put another way, at every marker $g_i$ along the genome, we divide BXD haplotypes into two groups and compute two metrics: $D^{C}_{i}$ (the cosine distance between the two groups\' aggregate spectra) and $r_i$ (the correlation between genome-wide *D* allele frequencies in the two groups). To control for the potential effects of genetic similarity on cosine distances, we regress $\\left(D^C_{1}, D^C_{2}, \\ldots D^C_{n} \\right)$ on $\\left( r_1, r_2, \\ldots r_n \\right)$ for all $n$ markers using an ordinary least-squares model. We then use the residuals from the fitted model as the "adjusted" cosine distance values for each marker. If genome-wide genetic similarity between haplotypes perfectly predicts cosine distances at each marker, these residuals will all be 0 (or very close to 0). If genome-wide genetic similarity has no predictive power, the residuals will simply represent the difference between the observed cosine distance at a single marker and the marker-wide m

####  Paragraph 09

In [323]:
par0 = (
    process_paragraph(orig_section_paragraphs[25])
)
print(par0)

The current BXD family was generated in six breeding "epochs." As discussed previously, each epoch was initiated with a distinct cross of C57BL/6J and DBA/2J parents; BXDs in four of the epochs were generated following F2 crosses of C57BL/6J and DBA/2J, and BXDs in the other two were generated following advanced intercrosses. Due to this breeding approach the BXD epochs differ from each other in a few important ways. For example, BXDs derived in epochs 3 and 5 (i.e., from advanced intercross) harbor larger numbers of fixed recombination breakpoints than those from epochs 1, 2, 4, and 6 [@PMID:33472028]. Although the C57BL/6J and DBA/2J parents used to initialize each epoch were completely inbred, they each possessed a small number unique *de novo* germline mutations that were subsequently inherited by many of their offspring. A number of these "epoch-specific" variants have also been linked to phenotypic variation observed between BXDs from different epochs [@PMID:33472028;@PMID:312741

In [324]:
par1 = (
    process_paragraph([
        # mod_section_paragraphs[31],
        mod_section_paragraphs[39],
    ])
)
print(par1)

The current BXD family was generated in six breeding "epochs." Each epoch was initiated with a distinct cross of C57BL/6J and DBA/2J parents. BXDs in four of the epochs were generated following F2 crosses of C57BL/6J and DBA/2J, while BXDs in the other two were generated following advanced intercrosses. This breeding approach resulted in differences between the BXD epochs. For instance, BXDs derived in epochs 3 and 5 (from advanced intercross) harbor larger numbers of fixed recombination breakpoints compared to those from epochs 1, 2, 4, and 6 (Smith et al., 2021). Although the C57BL/6J and DBA/2J parents used to initialize each epoch were completely inbred, they each possessed a small number of unique *de novo* germline mutations that were subsequently inherited by many of their offspring. Several of these "epoch-specific" variants have also been associated with phenotypic variation observed between BXDs from different epochs (Smith et al., 2021; Johnson et al., 2019; Brown et al., 20

In [325]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [326]:
display(paragraph_matches[-1])

('methods',
 'The current BXD family was generated in six breeding "epochs." As discussed previously, each epoch was initiated with a distinct cross of C57BL/6J and DBA/2J parents; BXDs in four of the epochs were generated following F2 crosses of C57BL/6J and DBA/2J, and BXDs in the other two were generated following advanced intercrosses. Due to this breeding approach the BXD epochs differ from each other in a few important ways. For example, BXDs derived in epochs 3 and 5 (i.e., from advanced intercross) harbor larger numbers of fixed recombination breakpoints than those from epochs 1, 2, 4, and 6 [@PMID:33472028]. Although the C57BL/6J and DBA/2J parents used to initialize each epoch were completely inbred, they each possessed a small number unique *de novo* germline mutations that were subsequently inherited by many of their offspring. A number of these "epoch-specific" variants have also been linked to phenotypic variation observed between BXDs from different epochs [@PMID:3347202

####  Paragraph 10

In [327]:
par0 = (
    process_paragraph(orig_section_paragraphs[26])
)
print(par0)

To account for potential population structure, as well as these epoch-specific effects, we introduced the ability to perform stratified permutation tests in the aggregate mutation spectrum distance approach. Normally, in each of *N* permutations we shuffle the per-haplotype mutation spectrum data such that haplotype labels no longer correspond to the correct mutation spectra (i.e., shuffle mutation spectra *across* epochs). In the stratified approach, we instead shuffle per-haplotype mutation data *within* epochs, preserving epoch structure while still enabling mutation spectra permutations.


In [332]:
par1 = (
    process_paragraph([
        mod_section_paragraphs[40],
        mod_section_paragraphs[41],
    ])
)
print(par1)

To account for potential population structure, as well as epoch-specific effects, we introduced the ability to perform stratified permutation tests in the aggregate mutation spectrum distance approach. Normally, in each of *N* permutations, we shuffle the per-haplotype mutation spectrum data such that haplotype labels no longer correspond to the correct mutation spectra (i.e., shuffle mutation spectra *across* epochs). In the stratified approach, we instead shuffle per-haplotype mutation data *within* epochs, preserving epoch structure while still enabling mutation spectra permutations.


In [333]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [334]:
display(paragraph_matches[-1])

('methods',
 'To account for potential population structure, as well as these epoch-specific effects, we introduced the ability to perform stratified permutation tests in the aggregate mutation spectrum distance approach. Normally, in each of *N* permutations we shuffle the per-haplotype mutation spectrum data such that haplotype labels no longer correspond to the correct mutation spectra (i.e., shuffle mutation spectra *across* epochs). In the stratified approach, we instead shuffle per-haplotype mutation data *within* epochs, preserving epoch structure while still enabling mutation spectra permutations.',
 'To account for potential population structure, as well as epoch-specific effects, we introduced the ability to perform stratified permutation tests in the aggregate mutation spectrum distance approach. Normally, in each of *N* permutations, we shuffle the per-haplotype mutation spectrum data such that haplotype labels no longer correspond to the correct mutation spectra (i.e., shu

####  Paragraph 11

In [340]:
par0 = (
    process_paragraph(orig_section_paragraphs[34])
)
print(par0)

First, we simulate genotypes on a population of haplotypes at a collection of sites. We define a matrix $G$ of size $(s, h)$, where $s$ is the number of sites and $h$ is the number of haplotypes. We assume that every site is biallelic, and that the minor allele frequency at every site is 0.5. For every entry $G_{i,j}$, we take a single draw from a uniform distribution in the interval $[0.0, 1.0)$. If the value of that draw is less than 0.5, we assign the value of $G_{i,j}$ to be $1$. Otherwise, we assign the value of $G_{i,j}$ to be $0$.


In [345]:
par1 = (
    process_paragraph([
        mod_section_paragraphs[50],
        # mod_section_paragraphs[41],
    ])
)
print(par1)

First, genotypes are simulated on a population of haplotypes at a collection of sites. A matrix $G$ of size $(s, h)$ is defined, where $s$ represents the number of sites and $h$ represents the number of haplotypes. It is assumed that each site is biallelic, and the minor allele frequency at each site is 0.5. For each entry $G_{i,j}$, a single draw is taken from a uniform distribution in the interval $[0.0, 1.0)$. If the value of the draw is less than 0.5, the value of $G_{i,j}$ is assigned as $1$. Otherwise, the value of $G_{i,j}$ is assigned as $0$.


In [346]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [347]:
display(paragraph_matches[-1])

('methods',
 'First, we simulate genotypes on a population of haplotypes at a collection of sites. We define a matrix $G$ of size $(s, h)$, where $s$ is the number of sites and $h$ is the number of haplotypes. We assume that every site is biallelic, and that the minor allele frequency at every site is 0.5. For every entry $G_{i,j}$, we take a single draw from a uniform distribution in the interval $[0.0, 1.0)$. If the value of that draw is less than 0.5, we assign the value of $G_{i,j}$ to be $1$. Otherwise, we assign the value of $G_{i,j}$ to be $0$.',
 'First, genotypes are simulated on a population of haplotypes at a collection of sites. A matrix $G$ of size $(s, h)$ is defined, where $s$ represents the number of sites and $h$ represents the number of haplotypes. It is assumed that each site is biallelic, and the minor allele frequency at each site is 0.5. For each entry $G_{i,j}$, a single draw is taken from a uniform distribution in the interval $[0.0, 1.0)$. If the value of the d

####  Paragraph 12

In [351]:
par0 = (
    process_paragraph(orig_section_paragraphs[41])
)
print(par0)

Rather than simulate the same mean number of mutations ($m$) on every haplotype, we also performed a series of simulations in which the mean number of mutations on each haplotype was allowed to vary. The BXD RILs were inbred for variable numbers of generations, and each BXD therefore accumulated a variable number of *de novo* germline mutations [@PMID:35545679]. To more closely approximate the BXD haplotypes, we performed simulations in which the number of mutations ($m$) on each haplotype was drawn from a uniform distribution from $m$ to $20m$. In other words, we created a vector of mutation counts $M$ containing $h$ evenly-spaced integers from $m$ to $20m$, where $h$ is the number of simulated haplotypes. Thus, if we simulated between 100 and 2,000 mutations on 50 haplotypes, the $i$th entry of $M$ would be $100 + \frac{(2,000 - 100)}{50}i$. Each haplotype's mean number of mutations was then assigned by looking up the haplotype's index $i$ in $M$.


In [354]:
par1 = (
    process_paragraph([
        mod_section_paragraphs[58],
        # mod_section_paragraphs[41],
    ])
)
print(par1)

Rather than simulating the same mean number of mutations ($m$) on every haplotype, a series of simulations were conducted where the mean number of mutations on each haplotype varied. The BXD RILs were inbred for variable numbers of generations, resulting in each BXD accumulating a variable number of *de novo* germline mutations (Jones et al., 2021). To better represent the BXD haplotypes, simulations were carried out where the number of mutations ($m$) on each haplotype was randomly selected from a uniform distribution ranging from $m$ to $20m$. Specifically, a vector of mutation counts $M$ was created, containing $h$ evenly-spaced integers from $m$ to $20m$, where $h$ represents the number of simulated haplotypes. Therefore, if simulations were performed with a range of 100 to 2,000 mutations on 50 haplotypes, the $i$th entry of $M$ would be $100 + \frac{(2,000 - 100)}{50}i$. Subsequently, each haplotype's mean number of mutations was determined by referencing the haplotype's index $i

In [355]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [356]:
display(paragraph_matches[-1])

('methods',
 "Rather than simulate the same mean number of mutations ($m$) on every haplotype, we also performed a series of simulations in which the mean number of mutations on each haplotype was allowed to vary. The BXD RILs were inbred for variable numbers of generations, and each BXD therefore accumulated a variable number of *de novo* germline mutations [@PMID:35545679]. To more closely approximate the BXD haplotypes, we performed simulations in which the number of mutations ($m$) on each haplotype was drawn from a uniform distribution from $m$ to $20m$. In other words, we created a vector of mutation counts $M$ containing $h$ evenly-spaced integers from $m$ to $20m$, where $h$ is the number of simulated haplotypes. Thus, if we simulated between 100 and 2,000 mutations on 50 haplotypes, the $i$th entry of $M$ would be $100 + \\frac{(2,000 - 100)}{50}i$. Each haplotype's mean number of mutations was then assigned by looking up the haplotype's index $i$ in $M$.",
 "Rather than simul

####  Paragraph 13

In [357]:
par0 = (
    process_paragraph(orig_section_paragraphs[42])
)
print(par0)

In our simulations, we assume that genotypes at a single site (the "mutator locus") are associated with variation in the mutation spectrum. That is, at a single site $s_i$, all of the haplotypes with $1$ alleles should have elevated rates of a particular mutation type and draw their mutation counts from $\lambda^{\prime}$, while all of the haplotypes with $0$ alleles should have "wild-type" rates of that mutation type and draw their mutation counts from $\lambda$. We therefore pick a random site $s_i$ to be the "mutator locus," and identify the indices of haplotypes in $G$ that were assigned $1$ alleles at $s_i$. We call these indices $h_{mut}$.


In [360]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[61:63],
        # mod_section_paragraphs[41],
    )
)
print(par1)

In our simulations, we assume that genotypes at a single site (the "mutator locus") are associated with variation in the mutation spectrum. That is, at a single site $s_i$, all of the haplotypes with $1$ alleles should have elevated rates of a particular mutation type and draw their mutation counts from $\lambda^{\prime}$, while all of the haplotypes with $0$ alleles should have "wild-type" rates of that mutation type and draw their mutation counts from $\lambda$. We therefore pick a random site $s_i$ to be the "mutator locus," and identify the indices of haplotypes in $G$ that were assigned $1$ alleles at $s_i$. We call these indices $h_{mut}$.


In [361]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [362]:
display(paragraph_matches[-1])

('methods',
 'In our simulations, we assume that genotypes at a single site (the "mutator locus") are associated with variation in the mutation spectrum. That is, at a single site $s_i$, all of the haplotypes with $1$ alleles should have elevated rates of a particular mutation type and draw their mutation counts from $\\lambda^{\\prime}$, while all of the haplotypes with $0$ alleles should have "wild-type" rates of that mutation type and draw their mutation counts from $\\lambda$. We therefore pick a random site $s_i$ to be the "mutator locus," and identify the indices of haplotypes in $G$ that were assigned $1$ alleles at $s_i$. We call these indices $h_{mut}$.',
 'In our simulations, we assume that genotypes at a single site (the "mutator locus") are associated with variation in the mutation spectrum. That is, at a single site $s_i$, all of the haplotypes with $1$ alleles should have elevated rates of a particular mutation type and draw their mutation counts from $\\lambda^{\\prime}$

####  Paragraph 14

In [364]:
par0 = (
    process_paragraph(orig_section_paragraphs[47])
)
print(par0)

For each combination of parameters (number of simulated haplotypes, number of simulated markers, mutator effect size, etc.), we run 100 independent trials. In each trial, we simulate the genotype matrix $G$ and the mutation counts $C$. We calculate a "focal" cosine distance as the cosine distance between the aggregate mutation spectra of haplotypes with either genotype at $s_i$ (the site at which we artificially simulated an association between genotypes and mutation spectrum variation). We then perform an aggregate mutation spectrum distance scan using $N = 1,000$ permutations. If fewer than 5% of the $N$ permutations produced a cosine distance greater than or equal to the focal distance, we say that the approach successfully identified the mutator allele in that trial.


In [367]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[71],
        # mod_section_paragraphs[41],
    )
)
print(par1)

For each combination of parameters (number of simulated haplotypes, number of simulated markers, mutator effect size, etc.), we conduct 100 independent trials. Within each trial, we generate the genotype matrix $G$ and the mutation counts $C$. A "focal" cosine distance is computed as the cosine distance between the aggregate mutation spectra of haplotypes with either genotype at $s_i$ (the site at which an association between genotypes and mutation spectrum variation was artificially induced). Subsequently, an aggregate mutation spectrum distance scan is carried out using $N = 1,000$ permutations. If less than 5% of the $N$ permutations result in a cosine distance greater than or equal to the focal distance, we conclude that the method successfully identified the mutator allele in that specific trial.


In [368]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [369]:
display(paragraph_matches[-1])

('methods',
 'For each combination of parameters (number of simulated haplotypes, number of simulated markers, mutator effect size, etc.), we run 100 independent trials. In each trial, we simulate the genotype matrix $G$ and the mutation counts $C$. We calculate a "focal" cosine distance as the cosine distance between the aggregate mutation spectra of haplotypes with either genotype at $s_i$ (the site at which we artificially simulated an association between genotypes and mutation spectrum variation). We then perform an aggregate mutation spectrum distance scan using $N = 1,000$ permutations. If fewer than 5% of the $N$ permutations produced a cosine distance greater than or equal to the focal distance, we say that the approach successfully identified the mutator allele in that trial.',
 'For each combination of parameters (number of simulated haplotypes, number of simulated markers, mutator effect size, etc.), we conduct 100 independent trials. Within each trial, we generate the genot

####  Paragraph 15

In [370]:
par0 = (
    process_paragraph(orig_section_paragraphs[49])
)
print(par0)

Using simulated data, we also assessed the power of traditional quantitative trait locus (QTL) mapping to detect a locus associated with mutation spectrum variation. As described above, we simulated both genotype and mutation spectra for a population of haplotypes under various conditions (number of mutations per haplotype, mutator effect size, etc.). Using those simulated data, we used R/qtl2 [@PMID:30591514] to perform a genome scan for significant QTL as follows; we assume that the simulated genotype markers are evenly spaced (in physical Mbp coordinates) on a single chromosome. First, we calculate the fraction of each haplotype's *de novo* mutations that belong to each of the $6 \times 4^{k-1}$ possible $k$-mer mutation types. We then convert the simulated genotypes at each marker to genotype probabilities using the `calc_genoprob` function in R/qtl2, with `map_function = "c-f"` and `error_prob = 0`. For every $k$-mer mutation type, we use genotype probabilities and per-haplotype m

In [373]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[73:76],
        # mod_section_paragraphs[41],
    )
)
print(par1)

Using simulated data, we also evaluated the effectiveness of traditional quantitative trait locus (QTL) mapping in identifying a locus linked to mutation spectrum variability. As previously mentioned, we generated simulated genotype and mutation spectra for a haplotype population under different conditions, such as the number of mutations per haplotype and the mutator effect size. With these simulated data, we employed R/qtl2 (Broman et al., 2019) to conduct a genome scan for significant QTL. In this analysis, we assumed that the simulated genotype markers were evenly distributed along a single chromosome in physical Mbp coordinates. First, we determined the proportion of each haplotype's *de novo* mutations attributed to each of the $6 \times 4^{k-1}$ potential $k$-mer mutation types. Subsequently, we converted the simulated genotypes at each marker into genotype probabilities using the `calc_genoprob` function in R/qtl2, utilizing `map_function = "c-f"` and `error_prob = 0`. For each

In [374]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [375]:
display(paragraph_matches[-1])

('methods',
 'Using simulated data, we also assessed the power of traditional quantitative trait locus (QTL) mapping to detect a locus associated with mutation spectrum variation. As described above, we simulated both genotype and mutation spectra for a population of haplotypes under various conditions (number of mutations per haplotype, mutator effect size, etc.). Using those simulated data, we used R/qtl2 [@PMID:30591514] to perform a genome scan for significant QTL as follows; we assume that the simulated genotype markers are evenly spaced (in physical Mbp coordinates) on a single chromosome. First, we calculate the fraction of each haplotype\'s *de novo* mutations that belong to each of the $6 \\times 4^{k-1}$ possible $k$-mer mutation types. We then convert the simulated genotypes at each marker to genotype probabilities using the `calc_genoprob` function in R/qtl2, with `map_function = "c-f"` and `error_prob = 0`. For every $k$-mer mutation type, we use genotype probabilities and

####  Paragraph 16

In [376]:
par0 = (
    process_paragraph(orig_section_paragraphs[50])
)
print(par0)

**Note**: In our simulations, we augment the mutation rate of a single $k$-mer mutation type on haplotypes carrying the simulated mutator allele. However, in an experimental setting, we would not expect to have *a priori* knowledge of the mutation type affected by the mutator. Thus, by using an alpha threshold of 0.05 in our simulations, we would likely over-estimate the power of QTL mapping for detecting the mutator. Since we would need to perform 7 separate QTL scans (one for each 1-mer mutation type plus CpG>TpG) in an experimental setting, we calculate QTL LOD thresholds at a Bonferroni-corrected alpha value of $\alpha = \frac{0.05}{7}$.


In [377]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[76],
        # mod_section_paragraphs[41],
    )
)
print(par1)

In our simulations, we increase the mutation rate of a single $k$-mer mutation type on haplotypes carrying the simulated mutator allele. However, in an experimental setting, we would not expect to have prior knowledge of the mutation type affected by the mutator. Thus, by using an alpha threshold of 0.05 in our simulations, we would likely overestimate the power of QTL mapping for detecting the mutator. Since we would need to perform 7 separate QTL scans (one for each 1-mer mutation type plus CpG>TpG) in an experimental setting, we calculate QTL LOD thresholds at a Bonferroni-corrected alpha value of $\alpha = \frac{0.05}{7}$.


In [378]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [379]:
display(paragraph_matches[-1])

('methods',
 '**Note**: In our simulations, we augment the mutation rate of a single $k$-mer mutation type on haplotypes carrying the simulated mutator allele. However, in an experimental setting, we would not expect to have *a priori* knowledge of the mutation type affected by the mutator. Thus, by using an alpha threshold of 0.05 in our simulations, we would likely over-estimate the power of QTL mapping for detecting the mutator. Since we would need to perform 7 separate QTL scans (one for each 1-mer mutation type plus CpG>TpG) in an experimental setting, we calculate QTL LOD thresholds at a Bonferroni-corrected alpha value of $\\alpha = \\frac{0.05}{7}$.',
 'In our simulations, we increase the mutation rate of a single $k$-mer mutation type on haplotypes carrying the simulated mutator allele. However, in an experimental setting, we would not expect to have prior knowledge of the mutation type affected by the mutator. Thus, by using an alpha threshold of 0.05 in our simulations, we w

####  Paragraph 17

In [381]:
par0 = (
    process_paragraph(orig_section_paragraphs[53])
)
print(par0)

As in our previous manuscript [@PMID:35545679], we included mutation data from a subset of the 152 BXDs in our aggregate mutation spectrum distance scans. Specifically, we removed BXDs that were backcrossed to a C57BL/6J or DBA/2J parent at any point during the inbreeding process (usually, in order to rescue that BXD from inbreeding depression [@PMID:33472028]). We also removed BXD68 from our genome-wide scans, since we previously discovered a hyper-mutator phenotype in that line; the C>A germline mutation rate in BXD68 is over 5 times the population mean, likely due to a private deleterious nonsynonymous mutation in *Mutyh* [@PMID:35545679]. In our previous manuscript, we removed any BXDs that had been inbred for fewer than 20 generations, as it takes approximately 20 generations of strict brother-sister mating for an RIL genome to become >98% homozygous [@url:https://link.springer.com/book/10.1007/978-1-349-04904-2]. As a result, any potential mutator allele would almost certainly be

In [387]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[80:83],
        # mod_section_paragraphs[41],
    )
)
print(par1)

As in our previous manuscript (Smith et al., 2021), we included mutation data from a subset of the 152 BXDs in our aggregate mutation spectrum distance scans. Specifically, we removed BXDs that were backcrossed to a C57BL/6J or DBA/2J parent at any point during the inbreeding process (usually, in order to rescue that BXD from inbreeding depression; Jones et al., 2020). We also excluded BXD68 from our genome-wide scans due to a previously identified hyper-mutator phenotype in that line. The C>A germline mutation rate in BXD68 is over 5 times the population mean, likely attributed to a private deleterious nonsynonymous mutation in *Mutyh* (Smith et al., 2021). In our previous manuscript, we eliminated BXDs that had been inbred for fewer than 20 generations. It takes approximately 20 generations of strict brother-sister mating for a recombinant inbred line (RIL) genome to become >98% homozygous (Johnson et al., 1998). Therefore, any potential mutator allele would likely be fixed or lost a

In [388]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [389]:
display(paragraph_matches[-1])

('methods',
 'As in our previous manuscript [@PMID:35545679], we included mutation data from a subset of the 152 BXDs in our aggregate mutation spectrum distance scans. Specifically, we removed BXDs that were backcrossed to a C57BL/6J or DBA/2J parent at any point during the inbreeding process (usually, in order to rescue that BXD from inbreeding depression [@PMID:33472028]). We also removed BXD68 from our genome-wide scans, since we previously discovered a hyper-mutator phenotype in that line; the C>A germline mutation rate in BXD68 is over 5 times the population mean, likely due to a private deleterious nonsynonymous mutation in *Mutyh* [@PMID:35545679]. In our previous manuscript, we removed any BXDs that had been inbred for fewer than 20 generations, as it takes approximately 20 generations of strict brother-sister mating for an RIL genome to become >98% homozygous [@url:https://link.springer.com/book/10.1007/978-1-349-04904-2]. As a result, any potential mutator allele would almos

####  Paragraph 18

In [390]:
par0 = (
    process_paragraph(orig_section_paragraphs[56])
)
print(par0)

We investigated the region implicated by our aggregate mutation spectrum distance approach on chromosome 6 by subsetting the joint-genotyped BXD VCF file (European Nucleotide Archive accession PRJEB45429 [@url:https://www.ebi.ac.uk/ena/browser/view/PRJEB45429]) using `bcftools` [@PMID:33590861]. We defined the candidate interval surrounding the cosine distance peak on chromosome 6 as the 90% bootstrap confidence interval (extending from approximately 95 Mbp to 114 Mbp). To predict the functional impacts of both single-nucleotide variants and indels on splicing, protein structure, etc., we annotated variants in the BXD VCF using the following `snpEff` [@PMID:22728672] command:


In [393]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[85:87],
        # mod_section_paragraphs[41],
    )
)
print(par1)

We investigated the region implicated by our aggregate mutation spectrum distance approach on chromosome 6 by subsetting the joint-genotyped BXD VCF file (European Nucleotide Archive accession PRJEB45429 [1]) using `bcftools` [2]. We defined the candidate interval surrounding the cosine distance peak on chromosome 6 as the 90% bootstrap confidence interval (extending from approximately 95 Mbp to 114 Mbp). To predict the functional impacts of both single-nucleotide variants and indels on splicing, protein structure, etc., we annotated variants in the BXD VCF using the following `snpEff` [3] command:


In [394]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [395]:
display(paragraph_matches[-1])

('methods',
 'We investigated the region implicated by our aggregate mutation spectrum distance approach on chromosome 6 by subsetting the joint-genotyped BXD VCF file (European Nucleotide Archive accession PRJEB45429 [@url:https://www.ebi.ac.uk/ena/browser/view/PRJEB45429]) using `bcftools` [@PMID:33590861]. We defined the candidate interval surrounding the cosine distance peak on chromosome 6 as the 90% bootstrap confidence interval (extending from approximately 95 Mbp to 114 Mbp). To predict the functional impacts of both single-nucleotide variants and indels on splicing, protein structure, etc., we annotated variants in the BXD VCF using the following `snpEff` [@PMID:22728672] command:',
 'We investigated the region implicated by our aggregate mutation spectrum distance approach on chromosome 6 by subsetting the joint-genotyped BXD VCF file (European Nucleotide Archive accession PRJEB45429 [1]) using `bcftools` [2]. We defined the candidate interval surrounding the cosine distance 

####  Paragraph 19

In [400]:
par0 = (
    process_paragraph(orig_section_paragraphs[68])
)
print(par0)

We downloaded mutation data from a previously published analysis [@PMID:30753674] (Supplementary File 1, Excel Table S3) that identified strain-private mutations in 29 strains that were originally whole-genome sequenced as part of the Sanger Mouse Genomes (MGP) project [@PMID:21921910]. When comparing counts of each mutation type between MGP strains that harbored either *D* or *B* alleles at the chromosome 4 or chromosome 6 mutator loci, we adjusted mutation counts by the number of callable A, T, C, or G nucleotides in each strain as described previously [@PMID:35545679].


In [403]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[99],
        # mod_section_paragraphs[41],
    )
)
print(par1)

We downloaded mutation data from a previously published analysis (Lyon et al., 2019) that identified strain-private mutations in 29 strains that were originally whole-genome sequenced as part of the Sanger Mouse Genomes (MGP) project (Adams et al., 2011). When comparing counts of each mutation type between MGP strains that harbored either *D* or *B* alleles at the chromosome 4 or chromosome 6 mutator loci, we adjusted mutation counts by the number of callable A, T, C, or G nucleotides in each strain as described previously (Smith et al., 2021).


In [404]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [405]:
display(paragraph_matches[-1])

('methods',
 'We downloaded mutation data from a previously published analysis [@PMID:30753674] (Supplementary File 1, Excel Table S3) that identified strain-private mutations in 29 strains that were originally whole-genome sequenced as part of the Sanger Mouse Genomes (MGP) project [@PMID:21921910]. When comparing counts of each mutation type between MGP strains that harbored either *D* or *B* alleles at the chromosome 4 or chromosome 6 mutator loci, we adjusted mutation counts by the number of callable A, T, C, or G nucleotides in each strain as described previously [@PMID:35545679].',
 'We downloaded mutation data from a previously published analysis (Lyon et al., 2019) that identified strain-private mutations in 29 strains that were originally whole-genome sequenced as part of the Sanger Mouse Genomes (MGP) project (Adams et al., 2011). When comparing counts of each mutation type between MGP strains that harbored either *D* or *B* alleles at the chromosome 4 or chromosome 6 mutator

####  Paragraph 20

In [414]:
par0 = (
    process_paragraph(orig_section_paragraphs[70])
)
print(par0)

We used the online GeneNetwork resource [@PMID:27933521], which contains array- and RNA-seq-derived expression measurements in a wide variety of tissues, to find *cis*-eQTLs for the DNA repair genes we implicated under the cosine distance peak on chromosome 6. On the GeneNetwork homepage (genenetwork.org), we selected the "BXD Family" **Group** and used the **Type** dropdown menu to select each of the specific expression datasets described in Table @tbl:eqtl-provenance. In the **Get Any** text box, we then entered the listed gene name and clicked **Search**. After selecting the appropriate trait ID on the next page, we used the **Mapping Tools** dropdown to run Hayley-Knott regression [@PMID:16718932] with default parameters: 1,000 permutations, interval mapping, no cofactors, and WGS-based genotypes (2022).


In [417]:
par1 = (
    process_paragraph([
        mod_section_paragraphs[103],
        mod_section_paragraphs[105],
    ])
)
print(par1)

We used the online GeneNetwork resource (Ghazalpour et al., 2017), which contains array- and RNA-seq-derived expression measurements in a wide variety of tissues, to find *cis*-eQTLs for the DNA repair genes we implicated under the cosine distance peak on chromosome 6. On the GeneNetwork homepage (genenetwork.org), we selected the "BXD Family" group and used the type dropdown menu to select each of the specific expression datasets described in Table 1. In the "Get Any" text box, we then entered the listed gene name and clicked "Search". After selecting the appropriate trait ID on the next page, we used the mapping tools dropdown to run Hayley-Knott regression (Broman and Sen, 2009) with default parameters: 1,000 permutations, interval mapping, no cofactors, and WGS-based genotypes. The Hayley-Knott regression method is a statistical technique used for mapping quantitative trait loci (QTL) in genetic studies. It is based on a maximum likelihood approach and is commonly used for interval

In [418]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [419]:
display(paragraph_matches[-1])

('methods',
 'We used the online GeneNetwork resource [@PMID:27933521], which contains array- and RNA-seq-derived expression measurements in a wide variety of tissues, to find *cis*-eQTLs for the DNA repair genes we implicated under the cosine distance peak on chromosome 6. On the GeneNetwork homepage (genenetwork.org), we selected the "BXD Family" **Group** and used the **Type** dropdown menu to select each of the specific expression datasets described in Table @tbl:eqtl-provenance. In the **Get Any** text box, we then entered the listed gene name and clicked **Search**. After selecting the appropriate trait ID on the next page, we used the **Mapping Tools** dropdown to run Hayley-Knott regression [@PMID:16718932] with default parameters: 1,000 permutations, interval mapping, no cofactors, and WGS-based genotypes (2022).',
 'We used the online GeneNetwork resource (Ghazalpour et al., 2017), which contains array- and RNA-seq-derived expression measurements in a wide variety of tissues,

####  Paragraph 21

In [425]:
par0 = (
    process_paragraph(orig_section_paragraphs[71])
)
print(par0)

If we discovered a significant cis-eQTL for the gene of interest (that is, a locus on chromosome 6 with an LRS greater than or equal to the "significant LRS" genome-wide threshold), we then performed a second genome-wide association test for the trait of interest using GEMMA [@PMID:2453419] with the following parameters: WGS-based marker genotypes, a minor allele frequency threshold of 0.05, and leave-one-chromosome-out (LOCO). By using both Haley-Knott regression and GEMMA, we could first discover loci that exceeded a genome-wide LRS threshold, and then more precisely estimate the effect of those loci on gene expression [@doi:10.1101/2020.12.23.424047].


In [426]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[108],
        # mod_section_paragraphs[41],
    )
)
print(par1)

If a significant cis-eQTL for the gene of interest was identified (i.e., a locus on chromosome 6 with an LRS greater than or equal to the "significant LRS" genome-wide threshold), a second genome-wide association test for the trait of interest was conducted using GEMMA (Zhou et al., 2014) with the following parameters: WGS-based marker genotypes, a minor allele frequency threshold of 0.05, and leave-one-chromosome-out (LOCO). By employing both Haley-Knott regression and GEMMA, loci that exceeded a genome-wide LRS threshold could be initially identified, followed by a more precise estimation of the effect of those loci on gene expression (Yao et al., 2021).


In [427]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [428]:
display(paragraph_matches[-1])

('methods',
 'If we discovered a significant cis-eQTL for the gene of interest (that is, a locus on chromosome 6 with an LRS greater than or equal to the "significant LRS" genome-wide threshold), we then performed a second genome-wide association test for the trait of interest using GEMMA [@PMID:2453419] with the following parameters: WGS-based marker genotypes, a minor allele frequency threshold of 0.05, and leave-one-chromosome-out (LOCO). By using both Haley-Knott regression and GEMMA, we could first discover loci that exceeded a genome-wide LRS threshold, and then more precisely estimate the effect of those loci on gene expression [@doi:10.1101/2020.12.23.424047].',
 'If a significant cis-eQTL for the gene of interest was identified (i.e., a locus on chromosome 6 with an LRS greater than or equal to the "significant LRS" genome-wide threshold), a second genome-wide association test for the trait of interest was conducted using GEMMA (Zhou et al., 2014) with the following parameters

####  Paragraph 22

In [432]:
par0 = (
    process_paragraph(orig_section_paragraphs[76])
)
print(par0)

To determine the frequencies of the *Ogg1* and *Setmar* nonsynonymous mutations in other populations of mice, we queried a VCF file containing genome-wide variation in 67 wild-derived mice from four species of *Mus* [@PMID:27622383]. We calculated the allele frequency of each nonsynonymous mutation in each of the four species or subspecies (*Mus musculus domesticus*, *Mus musculus musculus*, *Mus musculus castaneus*, and *Mus spretus*), including genotypes that met the following criteria:


In [436]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[114],
        # mod_section_paragraphs[41],
    )
)
print(par1)

To determine the frequencies of the *Ogg1* and *Setmar* nonsynonymous mutations in other populations of mice, a VCF file containing genome-wide variation in 67 wild-derived mice from four species of *Mus* was queried [@PMID:27622383]. The allele frequency of each nonsynonymous mutation in each of the four species or subspecies (*Mus musculus domesticus*, *Mus musculus musculus*, *Mus musculus castaneus*, and *Mus spretus*) was calculated, including genotypes that met the following criteria:


In [437]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [438]:
display(paragraph_matches[-1])

('methods',
 'To determine the frequencies of the *Ogg1* and *Setmar* nonsynonymous mutations in other populations of mice, we queried a VCF file containing genome-wide variation in 67 wild-derived mice from four species of *Mus* [@PMID:27622383]. We calculated the allele frequency of each nonsynonymous mutation in each of the four species or subspecies (*Mus musculus domesticus*, *Mus musculus musculus*, *Mus musculus castaneus*, and *Mus spretus*), including genotypes that met the following criteria:',
 'To determine the frequencies of the *Ogg1* and *Setmar* nonsynonymous mutations in other populations of mice, a VCF file containing genome-wide variation in 67 wild-derived mice from four species of *Mus* was queried [@PMID:27622383]. The allele frequency of each nonsynonymous mutation in each of the four species or subspecies (*Mus musculus domesticus*, *Mus musculus musculus*, *Mus musculus castaneus*, and *Mus spretus*) was calculated, including genotypes that met the following cr

####  Paragraph 23

In [443]:
par0 = (
    process_paragraph(orig_section_paragraphs[82])
)
print(par0)

In this model, `Count` is the count of C>A *de novo* mutations observed in each BXD. `ADJ_AGE` is the product of the number of "callable" cytosine/guanine nucleotides in each BXD (i.e., the total number of cytosines/guanines covered by at least 10 sequencing reads) and the number of generations for which the BXD was inbred. We included the logarithm of `ADJ_AGE` as an "offset" in order to model the response variable as a rate (expressed per base-pair, per generation) rather than an absolute count; the BXDs differ in both their durations of inbreeding and the proportions of their genomes that were sequenced to sufficient depth, which influences the number of mutations we observe in each BXD. The `Genotype_A` and `Genotype_B` terms represent the genotypes of BXDs at markers `rs27509845` and `rs46276051` (the markers with peak cosine distances on chromosomes 4 and 6 in the two aggregate mutation spectrum distance scans). We limited our analysis to the n = 108 BXDs that were homozygous at 

In [447]:
par1 = (
    process_paragraph(
        mod_section_paragraphs[120],
        # mod_section_paragraphs[41],
    )
)
print(par1)

In this model, `Count` represents the count of C>A *de novo* mutations observed in each BXD mouse strain. `ADJ_AGE` is calculated as the product of the number of "callable" cytosine/guanine nucleotides in each BXD (i.e., the total number of cytosines/guanines covered by at least 10 sequencing reads) and the number of generations for which the BXD strain was inbred. To account for differences in inbreeding duration and genome coverage depth among BXDs, we included the logarithm of `ADJ_AGE` as an "offset" to model the response variable as a mutation rate (expressed per base-pair, per generation) rather than an absolute count. The genotypes of BXDs at markers `rs27509845` and `rs46276051` are denoted by `Genotype_A` and `Genotype_B`, respectively, which correspond to the markers with peak cosine distances on chromosomes 4 and 6 in the two aggregate mutation spectrum distance scans. Our analysis focused on the n = 108 BXDs that were homozygous at both sites, allowing us to represent genot

In [448]:
paragraph_matches.append(
    (
        section_name,
        par0,
        par1,
    )
)

In [449]:
display(paragraph_matches[-1])

('methods',
 'In this model, `Count` is the count of C>A *de novo* mutations observed in each BXD. `ADJ_AGE` is the product of the number of "callable" cytosine/guanine nucleotides in each BXD (i.e., the total number of cytosines/guanines covered by at least 10 sequencing reads) and the number of generations for which the BXD was inbred. We included the logarithm of `ADJ_AGE` as an "offset" in order to model the response variable as a rate (expressed per base-pair, per generation) rather than an absolute count; the BXDs differ in both their durations of inbreeding and the proportions of their genomes that were sequenced to sufficient depth, which influences the number of mutations we observe in each BXD. The `Genotype_A` and `Genotype_B` terms represent the genotypes of BXDs at markers `rs27509845` and `rs46276051` (the markers with peak cosine distances on chromosomes 4 and 6 in the two aggregate mutation spectrum distance scans). We limited our analysis to the n = 108 BXDs that were 

# Close connections

In [450]:
g.close()

# Save

In [451]:
len(paragraph_matches)

64

In [452]:
paragraph_matches[:2]

[('abstract',
  'Maintaining germline genome integrity is essential and enormously complex. Hundreds of proteins are involved in DNA replication and proofreading, and hundreds more are mobilized to repair DNA damage [@PMID:28485537]. While loss-of-function mutations in any of the genes encoding these proteins might lead to elevated mutation rates, *mutator alleles* have largely eluded detection in mammals. DNA replication and repair proteins often recognize particular sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of *de novo* mutations &mdash; that is, the frequency of each individual mutation type (C>T, A>G, etc.) &mdash; will differ between genomes that harbor either a mutator or wild-type allele at a given locus. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene *Mutyh* that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs [@PMID:3

In [453]:
df = pd.DataFrame(paragraph_matches).rename(
    columns={
        0: "section",
        1: "original",
        2: "modified",
    }
)

In [454]:
df.shape

(64, 3)

In [455]:
df.head()

Unnamed: 0,section,original,modified
0,abstract,Maintaining germline genome integrity is essen...,The essential and immensely complex issue of m...
1,introduction,Germline mutation rates reflect the complex in...,Germline mutation rates are influenced by DNA ...
2,introduction,The dearth of observed germline mutators in ma...,The scarcity of observed germline mutators in ...
3,introduction,"Despite these challenges, less traditional str...","Despite facing challenges, researchers have ut..."
4,introduction,"In mice, a germline mutator allele was recentl...","In a recent study, researchers identified a ge..."


In [456]:
df.to_pickle(OUTPUT_FILE_PATH)

# Reverse original/modified columns

In [457]:
df_reversed = df.rename(columns={"original": "modified2"}).rename(
    columns={"modified": "original", "modified2": "modified"}
)

In [458]:
df_reversed.shape

(64, 3)

In [459]:
df_reversed.head()

Unnamed: 0,section,modified,original
0,abstract,Maintaining germline genome integrity is essen...,The essential and immensely complex issue of m...
1,introduction,Germline mutation rates reflect the complex in...,Germline mutation rates are influenced by DNA ...
2,introduction,The dearth of observed germline mutators in ma...,The scarcity of observed germline mutators in ...
3,introduction,"Despite these challenges, less traditional str...","Despite facing challenges, researchers have ut..."
4,introduction,"In mice, a germline mutator allele was recentl...","In a recent study, researchers identified a ge..."


## Save

In [217]:
df_reversed.to_pickle(REVERSED_OUTPUT_FILE_PATH)