evalzoo requires a config `yml` file for each batch of profiles. Using the `match_rep_df.csv`, create these profiles.

Additionally, each plate will require a unique config file for each batch. 

1. Load match_rep_df to get access to batch names and plate names (Assay_Plate_Barcode)
2. Iterate through each batch+plate and update both the replicate and match config files
   1. Update `experiment.data_path` and `experiment.plates` with the batch and **all** plates within that batch information, respectively
3. Save the two config files in an identical folder structure 
   1. `batch_plate.yml`
4. Run evalzoo


After evalzoo we will have results that are stored within a folder that's named with a random hash (unsure why). However, we want these results to be 


experiment:
  data_path: "/input/Scope1_Nikon_10X"
  input_structure: "{data_path}/{{plate}}/{{plate}}_normalized_feature_select_negcon_batch.{extension}"
  extension: csv.gz
  plates:
    - BR00117060a10x

In [106]:
import yaml
import pandas as pd
import os
import shutil
import math
import matplotlib.pyplot as plt

In [107]:
match_rep_df = pd.read_csv("../checkpoints/match_rep_df.csv")

match_rep_df = match_rep_df[match_rep_df["sphering"] == True]

match_rep_df

Unnamed: 0,Vendor,Batch,Plate_Map_Name,Assay_Plate_Barcode,Modality,Images_per_well,Sites-SubSampled,Binning,Magnification,Number_of_channels,...,Size_MB_std,sphering,value_95_replicating,Percent_Replicating,channel_names,brightfield_z_plane_used,feature_channels_found,Percent_Matching,value_95_matching,cell_count
0,MolDev,Scope1_MolDev_10X,JUMP-MOA_compound_platemap,Plate2_PCO_6ch_4site_10XPA,Widefield,4,,1,10,6,...,0.000144,True,0.191908,60.000000,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",23.255814,0.288099,2014937
2,MolDev,Scope1_MolDev_10X,JUMP-MOA_compound_platemap,Plate3_PCO_6ch_4site_10XPA_Crest,Confocal,4,,1,10,6,...,0.000183,True,0.269617,62.222222,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",18.604651,0.398249,2413350
4,MolDev,Scope1_MolDev_10X_4siteZ,JUMP-MOA_compound_platemap,Plate3_PCO_6ch_4site_10XPA_Crestz,Confocal,4,,1,10,6,...,0.000142,True,0.205121,66.666667,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",23.255814,0.363114,2381443
6,MolDev,Scope1_MolDev_20X_4site,JUMP-MOA_compound_platemap,Plate3_PCO_6ch_4site_20XPA_Crestz,Confocal,4,,1,20,6,...,0.000114,True,0.182630,57.777778,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",18.604651,0.279178,527841
8,MolDev,Scope1_MolDev_20X_9site,JUMP-MOA_compound_platemap,Plate2_PCO_6ch_9site_20XPA,Widefield,9,,1,20,6,...,0.000153,True,0.184205,67.777778,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",23.255814,0.291127,1101611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_5Ch,JUMP-MOA_compound_platemap,BRO0117056_20x,Confocal,9,4.0,1,20,5,...,0.000044,True,0.174914,57.777778,"AGP, DNA, ER, Mito, RNA",,"AGP, DNA, ER, Mito, RNA",23.255814,0.244983,544244
354,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_5Ch_12Z,JUMP-MOA_compound_platemap,BRO0117056_20xb,Confocal,9,4.0,1,20,5,...,0.000044,True,0.157136,60.000000,"AGP, DNA, ER, Mito, RNA",,"AGP, DNA, ER, Mito, RNA",20.930233,0.227059,543826
356,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_6Ch_BRO0...,JUMP-MOA_compound_platemap,BRO0117059_20X,Confocal,9,4.0,1,20,6,...,0.000583,True,0.179268,58.888889,"AGP, BrightField, DNA, ER, Mito, RNA",Z08,"AGP, BrightField, DNA, ER, Mito, RNA",20.930233,0.253483,489099
358,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_6Ch_BRO0...,JUMP-MOA_compound_platemap,BRO01177034_20x,Confocal,9,4.0,1,20,6,...,0.000014,True,0.139090,56.666667,"AGP, BrightField, DNA, ER, Mito, RNA",Z17,"AGP, BrightField, DNA, ER, Mito, RNA",18.604651,0.193171,452567


In [105]:

def make_params():
    with open("within_matches.yml") as f:
        match_yaml = yaml.safe_load(f)

    with open("within_replicates.yml") as f:
        rep_yaml = yaml.safe_load(f)

    for batch, grouped_df in match_rep_df.groupby("Batch")[["Batch", "Assay_Plate_Barcode"]]:
        plates = grouped_df["Assay_Plate_Barcode"].tolist()

        out_rep_yaml = rep_yaml
        out_rep_yaml["experiment"]["data_path"] = f"/input/profiles/{batch}"
        out_rep_yaml["experiment"]["plates"] = plates

        out_match_yaml = match_yaml
        out_match_yaml["experiment"]["data_path"] = f"/input/profiles/{batch}"
        out_match_yaml["experiment"]["plates"] = plates

        save_dir = os.path.join("params", batch)

        if not os.path.exists(save_dir):
            os.makedirs(save_dir)

        with open(os.path.join(save_dir, f'{batch}_replicates_config.yaml'), 'w') as f:
            yaml.dump(out_rep_yaml, f, sort_keys=False)

        with open(os.path.join(save_dir, f'{batch}_matches_config.yaml'), 'w') as f:
            yaml.dump(out_match_yaml, f, sort_keys=False)


make_params()

Following an evalzoo matric run, you are left with results folders that have a random hash. However, each folder should be named after the batch from which it was derived. Inside this folder is a copy of the config file that was used to generate it, so we can get the batch name from there. Additionally, each hashed folder will be the output from a **single** config file, so one for match and one for rep.

Furthermore, there's a copy of the profile (`profiles.paraquet`) for some reason, which we can just delete. We can also delete the `.html` plot files too.

1. Rename the hash folder based on the `batch` and the `background_type` (ie. `{batch}_{background_type}`)
2. Delete `.html` and `profiles.paraquet` files


In [108]:
results_dir = "results/results/"

def postprocess_evalzoo(results_dir):
    for result_folder in os.listdir(results_dir):
        print(os.path.join(results_dir, result_folder))
        if ".DS_Store" not in result_folder:
            with open(os.path.join(results_dir, result_folder, "params.yaml")) as f:
                param_file = yaml.safe_load(f)
            batch = param_file["experiment"]["data_path"].split("/")[-1]
            # Don't try to process a folder that has already been processed
            if batch not in result_folder:
                os.remove(os.path.join(results_dir, result_folder, "profiles.parquet"))
                os.remove(os.path.join(results_dir, result_folder, "metrics_plot_level_1_pvalue.html"))
                os.remove(os.path.join(results_dir, result_folder, "metrics_plot_level_1_qvalue.html"))
                shutil.rmtree(os.path.join(results_dir, result_folder, "metrics_plot_level_1_pvalue_files"))
                shutil.rmtree(os.path.join(results_dir, result_folder, "metrics_plot_level_1_qvalue_files"))
                rep_or_non_rep = param_file["experiment"]["background_type"]
                shutil.move(os.path.join(results_dir, result_folder), os.path.join(results_dir, batch+"_"+rep_or_non_rep))

postprocess_evalzoo(results_dir)


results/results/5c2d331f
results/results/23f70a55
results/results/3af8fa4c
results/results/98d17b38
results/results/063ac5d9
results/results/b680545c
results/results/4552c92d
results/results/10f78e48
results/results/56f331c4
results/results/0de2d346
results/results/13fda8a7
results/results/3586faba
results/results/65c43c5f
results/results/cf837086
results/results/1f04dab5
results/results/bd9d1200
results/results/046cf486
results/results/534d9740
results/results/3806dda6
results/results/8947fcd1
results/results/56ce1760
results/results/51f02018
results/results/926ba684
results/results/.DS_Store
results/results/8c8b24ff
results/results/9c1795fb
results/results/67a28550
results/results/904f0d81
results/results/51ca25d1
results/results/c575ff69
results/results/1de26514
results/results/81ba7224
results/results/d4b21731
results/results/2b2058f9
results/results/13f35689
results/results/fb3ecdd0
results/results/712aceba
results/results/d9c06ee4
results/results/c1ebad2b
results/results/f3e41706

In [109]:
# Using match_rep_df, load an evalzoo paraquet file

evalzoo_result_dir = "results/results"


def validate_evalzoo(match_rep_df, evalzoo_dir):
    """For all batches in match_rep_df, checks if there is a corresponding
    evalzoo metric paraquet file"""
    profile_types = ["ref", "non_rep"]

    batches = match_rep_df["Batch"].tolist()

    for batch in batches:
        plates = match_rep_df[match_rep_df["Batch"] == batch]["Assay_Plate_Barcode"].tolist()
        for p_typ in profile_types:
            filename = os.path.join(evalzoo_dir, batch+"_"+p_typ, f"metrics_level_1_{p_typ}.parquet")
            if not os.path.isfile(filename):
                print(f"{filename} not found")
                continue
            else:
                check_df = pd.read_parquet(filename)
                if "Metadata_Plate" not in check_df.columns:
                    print(f"{filename} doesn't have a Metadata_Plate column")
                    continue
                else:
                    evalzoo_plates = check_df["Metadata_Plate"].unique().tolist()
                    plate_diff = set(plates) - set(evalzoo_plates)
                    if not len(plate_diff) == 0:
                        print(f"{filename} is missing {plate_diff}")

validate_evalzoo(match_rep_df, evalzoo_result_dir)


In [110]:
pd.read_parquet("results/results/Scope1_MolDev_10X_non_rep/metrics_level_1_non_rep.parquet")

Unnamed: 0,Metadata_moa,Metadata_reference_or_other,Metadata_Plate,sim_scaled_mean_non_rep_i_mean_i,sim_scaled_mean_non_rep_i_median_i,sim_scaled_median_non_rep_i_mean_i,sim_scaled_median_non_rep_i_median_i,sim_ranked_relrank_mean_non_rep_i_mean_i,sim_ranked_relrank_mean_non_rep_i_median_i,sim_ranked_relrank_median_non_rep_i_mean_i,...,sim_stat_signal_n_non_rep_i_median_i,sim_stat_background_n_non_rep_i_mean_i,sim_stat_background_n_non_rep_i_median_i,sim_retrieval_average_precision_non_rep_i_mean_i,sim_retrieval_average_precision_non_rep_i_median_i,sim_retrieval_r_precision_non_rep_i_mean_i,sim_retrieval_r_precision_non_rep_i_median_i,sim_retrieval_average_precision_non_rep_i_nlog10pvalue_mean_i,sim_retrieval_average_precision_non_rep_i_nlog10pvalue_median_i,sim_retrieval_average_precision_non_rep_i_nlog10qvalue_mean_i
0,AMPK inhibitor,pert,Plate2_PCO_6ch_4site_10XPA,-0.780680,-0.780680,-0.780680,-0.780680,0.800562,0.800562,0.800562,...,1.0,178.0,178.0,0.007056,0.007056,0.0,0.0,0.000000,0.000000,-0.000000
1,AMPK inhibitor,pert,Plate3_PCO_6ch_4site_10XPA_Crest,-0.780680,-0.780680,-0.780680,-0.780680,0.800562,0.800562,0.800562,...,1.0,178.0,178.0,0.007056,0.007056,0.0,0.0,0.000000,0.000000,-0.000000
2,Aurora kinase inhibitor,pert,Plate2_PCO_6ch_4site_10XPA,1.578701,1.635297,1.666066,1.758050,0.160985,0.136364,0.149148,...,3.0,176.0,176.0,0.136668,0.113966,0.0,0.0,1.123522,1.109713,0.405666
3,Aurora kinase inhibitor,pert,Plate3_PCO_6ch_4site_10XPA_Crest,1.578701,1.635297,1.666066,1.758050,0.160985,0.136364,0.149148,...,3.0,176.0,176.0,0.136668,0.113966,0.0,0.0,1.123522,1.109713,0.405666
4,BCL inhibitor,pert,Plate2_PCO_6ch_4site_10XPA,-0.440991,-0.398083,-0.447199,-0.409299,0.549242,0.532197,0.478693,...,3.0,176.0,176.0,0.021151,0.020741,0.0,0.0,0.200824,0.193570,0.130786
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,tricyclic antidepressant,pert,Plate3_PCO_6ch_4site_10XPA_Crest,-0.571860,-0.554555,-0.608000,-0.605951,0.689394,0.696023,0.723011,...,3.0,176.0,176.0,0.016368,0.016642,0.0,0.0,0.080729,0.086999,0.061844
90,tumor necrosis factor production inhibitor,pert,Plate2_PCO_6ch_4site_10XPA,-0.348024,-0.337229,-0.443433,-0.447126,0.447917,0.415720,0.473011,...,3.0,176.0,176.0,0.024789,0.025989,0.0,0.0,0.288618,0.320610,0.160588
91,tumor necrosis factor production inhibitor,pert,Plate3_PCO_6ch_4site_10XPA_Crest,-0.348024,-0.337229,-0.443433,-0.447126,0.447917,0.415720,0.473011,...,3.0,176.0,176.0,0.024789,0.025989,0.0,0.0,0.288618,0.320610,0.160588
92,ubiquitin specific protease inhibitor,pert,Plate2_PCO_6ch_4site_10XPA,-0.380265,-0.452995,-0.539984,-0.578758,0.587595,0.623106,0.607955,...,3.0,176.0,176.0,0.019244,0.017448,0.0,0.0,0.152715,0.108563,0.114086
