# Why are delta-AUC-PR scores so different across datasets?
The output of the notebook 4.3 shows that there is vastly different performance for our pipeline depending upon which dataset is being used. Prism results in delta-AUC-PR scores that are marginally better than chance. However, the delta-AUC-PR scores are significantly better for toxvaldb (0.2-0.3). What is the reason for this vastly different performance aross support set sizes?

I will start by exploring some of the following questions:
- Are there some types of assays (e.g. aquatic predictions) for which toxvaldb has significantly better classification performance?
- Does the average assay across datasets have a similar active ratio?

In [8]:
import os

import pandas as pd
import duckdb

## Assay makeup

In [9]:
def get_run_info(filepath):

    # Split the string at each slash to isolate each part of the path
    parts = filepath.split("/")

    # The target part is the one that contains 'params.dataset' and 'params.support_set_size'
    target = [part for part in parts if "params.dataset" in part and "params.support_set_size" in part][0]

    # Split the target part at each comma
    params = target.split(",")

    # Split each parameter at the equals sign and take the second part
    dataset = params[0].split("=")[1]
    support_set_size = int(params[1].split("=")[1])  # convert to int for numerical operations

    return {"dataset": dataset, "support_set_size": support_set_size}

In [10]:
INPUT_DIR = "/Users/sethhowes/Desktop/FS-Tox/multirun/2023-07-19/11-23-46"
run_dirs = [os.path.join(INPUT_DIR, run_dir) for run_dir in os.listdir(INPUT_DIR) if os.path.isdir(os.path.join(INPUT_DIR, run_dir))]
run_dirs = [f"{run_dir}/data/processed/score/*.parquet" for run_dir in run_dirs]

support_sizes = [8, 16, 32, 64]
datasets = ["clintox", "tox21", "toxcast", "bbbp", "toxval", "nci60", "cancerrx", "prism"]

con = duckdb.connect()

dfs = []

for run_dir in run_dirs:
    info = get_run_info(run_dir)
    query = f"""
    SELECT delta_auc_pr
    FROM read_parquet('{run_dir}')
    """
    try:
        df = con.execute(query).df()
        df["support_set_size"] = info["support_set_size"]
        df["dataset"] = info["dataset"]
        dfs.append(df)
    except Exception as e:
        print(f"No data for {info['dataset']} with support set size of {info['support_set_size']}")

        
# Concatenate all dataframes into one
df_final = pd.concat(dfs, ignore_index=True)

No data for clintox with support set size of 16
No data for bbbp with support set size of 32
No data for tox21 with support set size of 64
No data for bbbp with support set size of 16
No data for clintox with support set size of 32
No data for bbbp with support set size of 64
No data for tox21 with support set size of 16
No data for tox21 with support set size of 32
No data for clintox with support set size of 64
