# Evaluation Notebook
This notebook seeks to evaluate the pipeline from `main.py`. The example tested involves the drug target `CK1α`. 

A ground truth dataset is extracted from [molgluedb](https://www.molgluedb.com/browseDB), with the `Targets` filter set to `CK1α`. This returns 207 molecules from 18 papers (see `test-query-generation/tmp.txt`). Only 13 of these papers are accessible on PubMed Central (i.e., have a PMCID). We run our evaluations on those 13 papers.

In [None]:
import dotenv
import pandas as pd
import requests

import llm 
import ncbi 
import slogpkg

from openai import OpenAI
from Bio import Entrez

In [None]:
slog = slogpkg.GoStyleLogger()

In [None]:
env = {
    "OPENAI_URL": "",
    "OPENAI_API_KEY": "",
    "MODEL_NAME": "",
    "NCBI_EMAIL": "",
    "NCBI_API_KEY": "",
}
for var in env.keys():
    env[var] = dotenv.get_key(".env", var)  # type: ignore # can be str or None
    if not len(env[var]):
        slog.Error("Missing required environment variable", variable=var)
        raise RuntimeError(f"Missing environment variable: {var}")

Entrez.email = env["NCBI_EMAIL"]
Entrez.api_key = env["NCBI_API_KEY"]

client = OpenAI(base_url=env["OPENAI_URL"], api_key=env["OPENAI_API_KEY"])

In [None]:
df = pd.read_csv("evals/molgluedb_targets-CK1a.csv", index_col="DATAID")

In [None]:
df

## Processing to a Ground Truth
Because the pipeline currently only searches PMC, we want to take this list of papers and molecules that target `CK1α` and find the PMCIDs of the relevant papers and the number of compounds from each paper.

In [None]:
# remove the "https://doi.org/" prefix to extract the raw DOI
df["DOI"] = df["SourceAddress_Website"].apply(lambda x: x[len("https://doi.org/") :])

In [None]:
df

In [None]:
# Merge rows if they have the same DOI.
# "compounds" should be the number of rows in the original DataFrame that have the given DOI,
# and we perform a sanity check to ensure rows with the same DOI have the same publication year
def handle_year(group):
    if group["Year"].nunique() == 1:
        return group["Year"].iloc[0]
    else:
        raise RuntimeError("Years don't match")


df = (
    df.groupby("DOI")
    .apply(
        lambda group: pd.Series(
            {
                "compounds": len(group),  # Count of rows with same DOI
            }
        )
    ).reset_index()
)

In [None]:
df

In [None]:
assert df['compounds'].sum() == 207 # sanity check

### [PubMed Central ID Converter API](https://pmc.ncbi.nlm.nih.gov/tools/id-converter-api/)

In [None]:
# convert DOI to PMCID
# Make the request
try:
    response = requests.get(
        url="https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/",
        params={
            "tool": "gamma",
            "email": Entrez.email,
            "ids": ",".join(df["DOI"]),
            "format": "json",
        },
        timeout=10,
    )

    # Check if request was successful
    if response.status_code == 200:
        data = response.json()
        print("Success!")
        print(f"Status: {data.get('status', 'unknown')}")
        print(f"Response: {data}")
    else:
        print(f"Error: {response.status_code}")
        print(response.text)

except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")


In [None]:
pmcids = []
for i, (record, id) in enumerate(zip(data.get("records"), df["DOI"])):
    print(i, record)
    if record.get("status") == "error":
        pmcid, doi = " ", record.get("doi")
    else:
        pmcid, doi = record.get("pmcid"), record.get("doi")

    assert doi == id, f"{doi} != {id}"
    pmcids.append(pmcid)

In [None]:
df['PMCID'] = list(map(lambda x: x[len("PMC"):], pmcids))
df = df[df['PMCID'] != ""] # remove rows with no PMCID

In [None]:
df # should have PMCID | compounds | ...

## Summarization Accuracy
Given a ground truth list of ($n=13$) relevant PMCIDs, evaluate the pipeline's ability to extract the correct number of screened compounds. 

Binary score based on exact matching, mean absolute deviation, and average squared deviation are computed.

In [None]:
def process_paper(paper):
    # Generate a summary for each paper
    paper_summary: str = llm.generate_paper_summary(
        slog, client, env["MODEL_NAME"], paper["XML_Content"]
    )
    slog.Info("Paper summary complete")

    # Use the summary + original paper to count the number of compounds screened
    res = llm.generate_paper_compounds(
        slog, client, env["MODEL_NAME"], paper["XML_Content"], paper_summary
    )
    n_screened = res if res else 0
    slog.Info("Compounds screened complete")

    processed_paper = paper.copy()
    processed_paper["Summary"] = paper_summary
    processed_paper["n"] = str(n_screened)

    return processed_paper

In [None]:
papers = ncbi.fetch_by_pmcid(df['PMCID']) # type: ignore # ignore the first 3 letters
results = [process_paper(paper) for paper in papers]

In [None]:
predictions = [result['n'] for result in results] 

In [None]:
print(predictions)
predictions = [int(x) for x in predictions]
print(predictions)
print(list(df['compounds']))

In [None]:
n_correct = 0
total_abs_dev = 0
total_sq_dev = 0
for y_hat, y in zip(predictions, df['compounds']):
    n_correct += 1 if y_hat == y else 0
    total_abs_dev += abs(y_hat - y)
    total_sq_dev += (y_hat - y) ** 2

In [None]:
slog.Info("Exact matching performance:", correct=n_correct)
slog.Info("Mean Absolute Deviation:", mad=total_abs_dev / 13)
slog.Info("Mean Squared Error:", mse=total_sq_dev / 13)

## Coverage
Given a drug target and related query keywords, verify that the LLM correctly extracts all of the relevant papers. Let's say the LLM's keyword searching finds $n$ papers, $x$ of which are the expected papers.

### Precision: $\frac{\# \text{ of expected papers found}}{\# \text{ papers returned}} = \frac{x}{n}$

### Recall: $\frac{\# \text{ of expected papers found}}{\# \text{ of expected papers}} = \frac{x}{13}$