# Description

This notebook is a template notebook that is intended to be run across different parameters.

TODO

# Modules

In [1]:
import pandas as pd
from IPython.display import display
from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache
from proj import conf
from proj.utils import llm_pairwise

# Settings/paths

In [2]:
# Input manuscript
REPO = None

INPUT_FILE = None
OUTPUT_FILE = None

# Model and its parameters
LLM_JUDGE = None
TEMPERATURE = None
MAX_TOKENS = 2000
SEED_INIT = 0

# Evaluation parameters
N_REPS = None

In [3]:
# Parameters
REPO = "pivlab/manubot-ai-editor-code-test-biochatter-manuscript"
INPUT_FILE = "/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/paragraph_match/biochatter-manuscript--gpt-3.5-turbo.pkl"
OUTPUT_FILE = "/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/llm_pairwise/biochatter-manuscript--gpt-3.5-turbo--mistral_7b-instruct-fp16.pkl"
LLM_JUDGE = "mistral:7b-instruct-fp16"
TEMPERATURE = 0.5
MAX_TOKENS = 2000
SEED_INIT = 0
N_REPS = 10


In [4]:
conf.common.LLM_CACHE_DIR.mkdir(parents=True, exist_ok=True)
display(conf.common.LLM_CACHE_DIR)

PosixPath('/home/miltondp/projects/others/manubot/manubot-ai-editor-code/base/results/llm_cache')

# Load paragraphs

In [5]:
df = pd.read_pickle(INPUT_FILE)

In [6]:
df.shape

(37, 3)

In [7]:
df.head()

Unnamed: 0,section,original,modified
0,abstract,Current-generation Large Language Models (LLMs...,Large Language Models (LLMs) have generated si...
1,introduction,"Despite technological advances, understanding ...","Despite technological advances, understanding ..."
2,introduction,Large Language Models (LLMs) of the current ge...,The latest generation of Large Language Models...
3,introduction,Computational biomedicine involves many tasks ...,Computational biomedicine encompasses various ...
4,results,The framework is designed to be modular: any o...,"The framework is designed to be modular, allow..."


In [8]:
df.iloc[0]["original"]

'Current-generation Large Language Models (LLMs) have stirred enormous interest in recent months, yielding great potential for accessibility and automation, while simultaneously posing significant challenges and risk of misuse. To facilitate interfacing with LLMs in the biomedical space, while at the same time safeguarding their functionalities through sensible constraints, we propose a dedicated, open-source framework: BioChatter. Based on open-source software packages, we synergise the many functionalities that are currently developing around LLMs, such as knowledge integration / retrieval-augmented generation, model chaining, and benchmarking, resulting in an easy-to-use and inclusive framework for application in many use cases of biomedicine. We focus on robust and user-friendly implementation, including ways to deploy privacy-preserving local open-source LLMs. We demonstrate use cases via two multi-purpose web apps ([https://chat.biocypher.org](https://chat.biocypher.org)), and pr

In [9]:
df.iloc[0]["modified"]

'Large Language Models (LLMs) have generated significant interest due to their potential for accessibility and automation in various fields, including biomedicine. However, they also present challenges and risks of misuse. In this paper, we address the need for a framework to interface with LLMs in the biomedical domain while ensuring their safe and effective use. To meet this need, we introduce BioChatter, an open-source framework that integrates various functionalities of LLMs, such as knowledge integration, retrieval-augmented generation, model chaining, and benchmarking. By leveraging open-source software packages, we have developed a user-friendly and versatile platform that can be applied across a range of biomedicine use cases. Our focus is on implementing robust and privacy-preserving local open-source LLMs. We showcase the utility of BioChatter through two multi-purpose web apps available at [https://chat.biocypher.org](https://chat.biocypher.org) and provide comprehensive doc

# Test run

In [10]:
t_json = llm_pairwise(
    df.iloc[0]["original"],
    df.iloc[0]["modified"],
    df.iloc[0]["section"],
    model_name=LLM_JUDGE,
    model_params={
        "temperature": TEMPERATURE,
        "max_tokens": MAX_TOKENS,
        "model_kwargs": {
            "seed": SEED_INIT,
        },
    },
    verbose=True,
)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are an expert copyeditor with ample experience in scientific writing. You are assessing the quality of two versions of a paragraph from the Abstract of a scientific article.
Human: Evaluate the quality of the following paragraph by writing a list with positive (if any) and/or negative (if any) aspects on the following areas: 1) has a clear sentence structure, 2) is easy to follow, 3) is correct in grammar, 4) has no spelling errors.

Paragraph A: Current-generation Large Language Models (LLMs) have stirred enormous interest in recent months, yielding great potential for accessibility and automation, while simultaneously posing significant challenges and risk of misuse. To facilitate interfacing with LLMs in the biomedical space, while at the same time safeguarding their functionalities through sensible constraints, we propose a dedicated, open-source framework: BioChatter. Based on open-source 


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are an expert copyeditor with ample experience in scientific writing. You are assessing the quality of two versions of a paragraph from the Abstract of a scientific article.
Human: Evaluate the quality of the following paragraph by writing a list with positive (if any) and/or negative (if any) aspects on the following areas: 1) has a clear sentence structure, 2) is easy to follow, 3) is correct in grammar, 4) has no spelling errors.

Paragraph A: Current-generation Large Language Models (LLMs) have stirred enormous interest in recent months, yielding great potential for accessibility and automation, while simultaneously posing significant challenges and risk of misuse. To facilitate interfacing with LLMs in the biomedical space, while at the same time safeguarding their functionalities through sensible constraints, we propose a dedicated, open-source framework: BioCha


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are an expert copyeditor with ample experience in scientific writing. You are assessing the quality of two versions of a paragraph from the Abstract of a scientific article.
Human: Evaluate the quality of the following paragraph by writing a list with positive (if any) and/or negative (if any) aspects on the following areas: 1) has a clear sentence structure, 2) is easy to follow, 3) is correct in grammar, 4) has no spelling errors.

Paragraph A: Current-generation Large Language Models (LLMs) have stirred enormous interest in recent months, yielding great potential for accessibility and automation, while simultaneously posing significant challenges and risk of misuse. To facilitate interfacing with LLMs in the biomedical space, while at the same time safeguarding their functionalities through sensible constraints, we propose a dedicated, open-source framework: BioCha


[1m> Finished chain.[0m


In [11]:
t_json

{'best': 'tie',
 'rationale': 'Both paragraphs have a clear sentence structure and are easy to follow. They are also correct in grammar and have no spelling errors. However, Paragraph 1 may be more accessible to non-scientific readers due to its simpler language and less technical jargon.'}

In [12]:
type(t_json)

dict

# Run

Since models are stochastic, we run the pairwise comparison many times.

Here I use a cache to avoid hitting an external API multiple times.

In [13]:
results = []

In [14]:
for rep_idx in range(N_REPS):
    # we cache prompt/results by repetition
    output_cache_file = conf.common.LLM_CACHE_DIR / f"rep{rep_idx}.db"
    set_llm_cache(SQLiteCache(database_path=str(output_cache_file)))

    print(f"{str(rep_idx).zfill(2)} ({output_cache_file.name}): ", end="", flush=True)

    for par_idx, par in df.iterrows():
        print(".", end="", flush=True)

        res = llm_pairwise(
            par["original"],
            par["modified"],
            par["section"],
            model_name=LLM_JUDGE,
            model_params={
                "temperature": TEMPERATURE,
                "max_tokens": MAX_TOKENS,
                "model_kwargs": {
                    "seed": SEED_INIT + rep_idx,
                },
            },
            verbose=False,
        )

        results.append(
            {
                "rep_index": rep_idx,
                "paragraph_index": par_idx,
                "paragraph_section": par["section"],
                "winner": res["best"],
                "rationale": res["rationale"],
            }
        )

    print(flush=True)

00 (rep0.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




01 (rep1.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




02 (rep2.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




03 (rep3.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




04 (rep4.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




05 (rep5.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




06 (rep6.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




07 (rep7.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




08 (rep8.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




09 (rep9.db): 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




# Process results

In [15]:
winner_matchings = {
    "Paragraph A": "-1",  # Original
    "Paragraph 1": "1",  # Modified
    "tie": "0",
}

In [16]:
df_results = pd.DataFrame(results)

In [17]:
df_results.shape

(370, 5)

In [18]:
df_results.head()

Unnamed: 0,rep_index,paragraph_index,paragraph_section,winner,rationale
0,0,0,abstract,tie,Both paragraphs are well-written and cover the...
1,0,1,introduction,Paragraph 1,While both paragraphs are well-written and mee...
2,0,2,introduction,tie,Both paragraphs are of similar quality. While ...
3,0,3,introduction,Paragraph 1,While both paragraphs are similar in structure...
4,0,4,results,tie,Both paragraphs are similar in quality. They b...


In [19]:
df_results = df_results[df_results["winner"].isin(winner_matchings.keys())]

In [20]:
df_results.shape

(370, 5)

In [21]:
df_results = df_results.assign(
    winner_score=df_results["winner"].replace(winner_matchings).apply(float)
)

In [22]:
df_results.shape

(370, 6)

In [23]:
df_results.head()

Unnamed: 0,rep_index,paragraph_index,paragraph_section,winner,rationale,winner_score
0,0,0,abstract,tie,Both paragraphs are well-written and cover the...,0.0
1,0,1,introduction,Paragraph 1,While both paragraphs are well-written and mee...,1.0
2,0,2,introduction,tie,Both paragraphs are of similar quality. While ...,0.0
3,0,3,introduction,Paragraph 1,While both paragraphs are similar in structure...,1.0
4,0,4,results,tie,Both paragraphs are similar in quality. They b...,0.0


In [24]:
df_results.dtypes

rep_index              int64
paragraph_index        int64
paragraph_section     object
winner                object
rationale             object
winner_score         float64
dtype: object

In [25]:
df_results.groupby("paragraph_section")["winner_score"].mean()

paragraph_section
abstract        0.600000
discussion      0.300000
introduction    0.266667
methods         0.212500
results         0.060000
Name: winner_score, dtype: float64

# Save

In [26]:
df_results.to_pickle(OUTPUT_FILE)