# Description

This notebook is a template notebook that is intended to be run across different parameters.

Based on the settings below, it loads an input file with paragraph pairs (original and revised) and uses the LLM-as-a-Judge approach to evaluate the quality of the paragraphs.

# Modules

In [1]:
import pandas as pd
from IPython.display import display
from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache
from proj import conf
from proj.utils import llm_pairwise

# Settings/paths

In [2]:
# Input manuscript
REPO = None

INPUT_FILE = None
OUTPUT_FILE = None

# Model and its parameters
LLM_JUDGE = None
TEMPERATURE = None
MAX_TOKENS = 2000
SEED_INIT = 0

# Evaluation parameters
N_REPS = None
THROW_IF_FAILED = False

In [3]:
# Parameters
REPO = "pivlab/manubot-ai-editor-code-test-epistasis-manuscript"
INPUT_FILE = "/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/paragraph_match/epistasis-manuscript--gpt-3.5-turbo--reversed.pkl"
OUTPUT_FILE = "/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/llm_pairwise/epistasis-manuscript--gpt-3.5-turbo--reversed--openai_gpt-3.5-turbo.pkl"
LLM_JUDGE = "openai:gpt-3.5-turbo"
TEMPERATURE = 0.5
MAX_TOKENS = 2000
SEED_INIT = 0
N_REPS = 5


In [4]:
conf.common.LLM_CACHE_DIR.mkdir(parents=True, exist_ok=True)
display(conf.common.LLM_CACHE_DIR)

PosixPath('/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/llm_cache')

# Set default LangChain cache file

In [5]:
default_cache_file = conf.common.LLM_CACHE_DIR / "default.db"
display(default_cache_file)
set_llm_cache(SQLiteCache(database_path=str(default_cache_file)))

PosixPath('/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/llm_cache/default.db')

# Load paragraphs

In [6]:
df = pd.read_pickle(INPUT_FILE)

In [7]:
df.shape

(63, 3)

In [8]:
df.head()

Unnamed: 0,section,modified,original
0,abstract,Maintaining germline genome integrity is essen...,The essential and immensely complex issue of m...
1,introduction,Germline mutation rates reflect the complex in...,Germline mutation rates are influenced by DNA ...
2,introduction,The dearth of observed germline mutators in ma...,The scarcity of observed germline mutators in ...
3,introduction,"Despite these challenges, less traditional str...","Despite facing challenges, researchers have ut..."
4,introduction,"In mice, a germline mutator allele was recentl...","In a recent study, researchers identified a ge..."


In [9]:
df.iloc[0]["original"]

'The essential and immensely complex issue of maintaining germline genome integrity involves hundreds of proteins responsible for DNA replication, proofreading, and repair [@PMID:28485537]. While loss-of-function mutations in genes encoding these proteins can result in increased mutation rates, the detection of *mutator alleles* in mammals has been challenging. DNA replication and repair proteins often target specific sequence motifs or excise lesions at particular nucleotides, suggesting that the spectrum of *de novo* mutations (such as C>T, A>G, etc.) may vary between genomes with mutator or wild-type alleles at a specific locus. Previous research utilized quantitative trait locus mapping to identify potential mutator alleles in the DNA repair gene *Mutyh*, which elevated the C>A germline mutation rate in the BXD inbred mouse family [@PMID:35545679;@PMID:33472028]. In this study, a novel method called "aggregate mutation spectrum distance" was developed to identify alleles linked to 

In [10]:
df.iloc[0]["modified"]

'Maintaining germline genome integrity is essential and enormously complex. Hundreds of proteins are involved in DNA replication and proofreading, and hundreds more are mobilized to repair DNA damage [@PMID:28485537]. While loss-of-function mutations in any of the genes encoding these proteins might lead to elevated mutation rates, *mutator alleles* have largely eluded detection in mammals. DNA replication and repair proteins often recognize particular sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of *de novo* mutations &mdash; that is, the frequency of each individual mutation type (C>T, A>G, etc.) &mdash; will differ between genomes that harbor either a mutator or wild-type allele at a given locus. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene *Mutyh* that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs [@PMID:35545679;@PMID:33

# Test run

In [11]:
t_json = llm_pairwise(
    df.iloc[0]["original"],
    df.iloc[0]["modified"],
    df.iloc[0]["section"],
    model_name=LLM_JUDGE,
    model_params={
        "temperature": TEMPERATURE,
        "max_tokens": MAX_TOKENS,
        "model_kwargs": {
            "seed": SEED_INIT,
        },
    },
    verbose=True,
)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are an expert copyeditor with ample experience in scientific writing. You are assessing the quality of two versions of a paragraph from the Abstract of a scientific article.
Human: Evaluate the quality of the following paragraph by writing a list with positive (if any) and/or negative (if any) aspects on the following areas: 1) has a clear sentence structure, 2) is easy to follow, 3) is correct in grammar, 4) has no spelling errors.

Paragraph A: The essential and immensely complex issue of maintaining germline genome integrity involves hundreds of proteins responsible for DNA replication, proofreading, and repair [@PMID:28485537]. While loss-of-function mutations in genes encoding these proteins can result in increased mutation rates, the detection of *mutator alleles* in mammals has been challenging. DNA replication and repair proteins often target specific sequence motifs or excise lesions a


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are an expert copyeditor with ample experience in scientific writing. You are assessing the quality of two versions of a paragraph from the Abstract of a scientific article.
Human: Evaluate the quality of the following paragraph by writing a list with positive (if any) and/or negative (if any) aspects on the following areas: 1) has a clear sentence structure, 2) is easy to follow, 3) is correct in grammar, 4) has no spelling errors.

Paragraph A: The essential and immensely complex issue of maintaining germline genome integrity involves hundreds of proteins responsible for DNA replication, proofreading, and repair [@PMID:28485537]. While loss-of-function mutations in genes encoding these proteins can result in increased mutation rates, the detection of *mutator alleles* in mammals has been challenging. DNA replication and repair proteins often target specific sequence


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are an expert copyeditor with ample experience in scientific writing. You are assessing the quality of two versions of a paragraph from the Abstract of a scientific article.
Human: Evaluate the quality of the following paragraph by writing a list with positive (if any) and/or negative (if any) aspects on the following areas: 1) has a clear sentence structure, 2) is easy to follow, 3) is correct in grammar, 4) has no spelling errors.

Paragraph A: The essential and immensely complex issue of maintaining germline genome integrity involves hundreds of proteins responsible for DNA replication, proofreading, and repair [@PMID:28485537]. While loss-of-function mutations in genes encoding these proteins can result in increased mutation rates, the detection of *mutator alleles* in mammals has been challenging. DNA replication and repair proteins often target specific sequence


[1m> Finished chain.[0m


In [12]:
t_json

{'best': 'tie',
 'rationale': 'Both paragraphs have a clear sentence structure, are easy to follow, correct in grammar, and have no spelling errors. The quality of writing in both versions is similar, making it a tie.'}

In [13]:
type(t_json)

dict

# Run

Since models are stochastic, we run the pairwise comparison many times.

Here I use a cache to avoid hitting an external API multiple times.

In [14]:
results = []

In [15]:
for rep_idx in range(N_REPS):
    # we cache prompt/results by repetition
    output_cache_file = conf.common.LLM_CACHE_DIR / f"rep{rep_idx}.db"
    set_llm_cache(SQLiteCache(database_path=str(output_cache_file)))

    print(f"{str(rep_idx).zfill(2)} ({output_cache_file.name}): ", end="", flush=True)

    for par_idx, par in df.iterrows():
        print(".", end="", flush=True)

        res = llm_pairwise(
            par["original"],
            par["modified"],
            par["section"],
            model_name=LLM_JUDGE,
            model_params={
                "temperature": TEMPERATURE,
                "max_tokens": MAX_TOKENS,
                "model_kwargs": {
                    "seed": SEED_INIT + rep_idx,
                },
            },
            throw_if_failed=THROW_IF_FAILED,
            verbose=False,
        )

        results.append(
            {
                "rep_index": rep_idx,
                "paragraph_index": par_idx,
                "paragraph_section": par["section"],
                "winner": res["best"],
                "rationale": res["rationale"],
            }
        )

    print(flush=True)

00 (rep0.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




01 (rep1.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




02 (rep2.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




03 (rep3.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




04 (rep4.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




# Process results

In [16]:
winner_matchings = {
    "Paragraph A": "-1",  # Original
    "Paragraph 1": "1",  # Modified
    "tie": "0",
}

In [17]:
df_results = pd.DataFrame(results)

In [18]:
df_results.shape

(315, 5)

In [19]:
df_results.head()

Unnamed: 0,rep_index,paragraph_index,paragraph_section,winner,rationale
0,0,0,abstract,tie,Both paragraphs exhibit clear sentence structu...
1,0,1,introduction,tie,Both paragraphs exhibit clear sentence structu...
2,0,2,introduction,Paragraph A,"Paragraph A has a more varied transition word,..."
3,0,3,introduction,tie,Both Paragraph A and Paragraph 1 exhibit clear...
4,0,4,introduction,Paragraph A,Paragraph A excels in clear sentence structure...


In [20]:
df_results["winner"].value_counts()

winner
tie            162
Paragraph 1     77
Paragraph A     76
Name: count, dtype: int64

In [21]:
df_results = df_results[df_results["winner"].isin(winner_matchings.keys())]

In [22]:
df_results.shape

(315, 5)

In [23]:
df_results = df_results.assign(
    winner_score=df_results["winner"].replace(winner_matchings).apply(float)
)

In [24]:
df_results.shape

(315, 6)

In [25]:
df_results.head()

Unnamed: 0,rep_index,paragraph_index,paragraph_section,winner,rationale,winner_score
0,0,0,abstract,tie,Both paragraphs exhibit clear sentence structu...,0.0
1,0,1,introduction,tie,Both paragraphs exhibit clear sentence structu...,0.0
2,0,2,introduction,Paragraph A,"Paragraph A has a more varied transition word,...",-1.0
3,0,3,introduction,tie,Both Paragraph A and Paragraph 1 exhibit clear...,0.0
4,0,4,introduction,Paragraph A,Paragraph A excels in clear sentence structure...,-1.0


In [26]:
df_results.dtypes

rep_index              int64
paragraph_index        int64
paragraph_section     object
winner                object
rationale             object
winner_score         float64
dtype: object

In [27]:
df_results.groupby("paragraph_section")["winner_score"].mean()

paragraph_section
abstract        0.200000
discussion     -0.094118
introduction   -0.114286
methods         0.016667
results         0.142857
Name: winner_score, dtype: float64

# Save

In [28]:
df_results.to_pickle(OUTPUT_FILE)